This package is a julia implementation of:
- Text classification based on BoW models (e.g. topic/langauge id)
- Language ID (training and processing) based on word and character n-grams
- Lewis's SMART stop list for English
- tfidf/tfllr text feature normalization
- ngram feature extractors
Stage
- Needed for logging and memoization (Note: requires manual install)Ollam
- online learning modules (Note: requires manual install)Devectorize
- macro-based devectorizationDataStructures
- for DefaultDictDevectorize
GZip
Iterators
- for iterator helper functions
This is an experimental package which is not currently registered in the julia central repository. You can install via:
Pkg.clone("https://github.com/saltpork/Stage.jl")
Pkg.clone("https://github.com/mit-nlp/Ollam.jl")
Pkg.clone("https://github.com/mit-nlp/Text.jl")
See test/runtests.jl
for detailed usage.
This package was created for the DARPA XDATA and Memex program under an Apache v2 License.