Skip to content

Commit

Permalink
Using StringReader and Transient collection for tokenisation
Browse files Browse the repository at this point in the history
  • Loading branch information
lynxluna committed Sep 8, 2014
1 parent 9a46ffc commit 7f8c17d
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 9 deletions.
13 changes: 12 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Unlike its Ruby Counterparts, `clj-libil` is only implemented as library.
Just include this to fetch it from Clojars

```clojure
[clj-libil "0.1.0"]
[clj-libil "0.1.1"]
```

## Usage
Expand All @@ -38,6 +38,17 @@ There are 4 functions to convert word and sentences using `clj-libil`
(convert-sentence-ngalam "Ngalup Ayabarus") ;; Pulang Surabaya
```

## Release Notes

### Version 0.1.1

- Using `StringReader` to process token.
- Using transient collection for optimisation.

### Version 0.1.0

- Initial Version

## License

The MIT License (MIT)
Expand Down
2 changes: 1 addition & 1 deletion project.clj
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
(defproject clj-libil "0.1.0"
(defproject clj-libil "0.1.1"
:description "Clojure port of Libil, Processor of Bahasa Walikan"
:url "http://github.com/lynxluna/clj-libil"
:license {:name "MIT License"
Expand Down
32 changes: 25 additions & 7 deletions src/libil/core.clj
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
(ns libil.core
(:use [clojure.string :only [split lower-case upper-case capitalize join]]))
(:use [clojure.string :only [split lower-case upper-case capitalize join]])
(:import [java.io Reader StringReader]))

(def first-pair
["h" "n" "c" "r" "k" "d" "t" "s" "w" "l"])
Expand All @@ -13,15 +14,32 @@
(defn- within? [coll item]
((complement nil?) (some (set [item]) coll)))

(defn- rdr-peek
[^Reader rdr]
(.mark rdr 1)
(let [c (.read rdr)]
(.reset rdr)
c))

(defn tokenize-rdr
"Tokenize a reader"
[^Reader rdr]
(loop [tokens (transient [])
current (.read rdr)
ahead (rdr-peek rdr)]
(let [cc (-> current char str)]
(cond (== -1 ahead) (persistent! (conj! tokens cc))
(within? all-con (lower-case (str cc (char ahead))))
(let [pair (str (char current) (char ahead))]
(.skip rdr 1)
(if (== -1 (rdr-peek rdr)) (persistent! (conj! tokens pair))
(recur (conj! tokens pair) (.read rdr) (rdr-peek rdr))))
:else (recur (conj! tokens cc) (.read rdr) (rdr-peek rdr))))))

(defn tokenize-word
"Tokenizing the word, to be able to be mapped"
[^String w]
(loop [l [] rstr w]
(let [pair (apply str (take 2 rstr))
fstr (str (first rstr))]
(cond (empty? rstr) l
(within? all-con (lower-case pair)) (recur (conj l pair) (apply str (-> rstr rest rest)))
:else (recur (conj l fstr) (apply str (rest rstr)))))))
(tokenize-rdr (StringReader. w)))

(defn- inv-cap
"Inverse Capitalize"
Expand Down

0 comments on commit 7f8c17d

Please sign in to comment.