Demo with 500k+ documents should be running at search.lookies.io.
# sudo apt-get install python3.4-dev
virtualenv -p /usr/local/bin/python3 py3env # see: which python3
source py3env/bin/activate
pip install Flask pymongo
pip install nltk # for the stemmer (todo)
pip install gensim # for word2vec # cython numpy word2vec
wget https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz
# mirror of: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
python app.py
# listenning on localhost:6001
green
I clearly need to transition from build then test/maintain
to test then build/maintain
...
- The tokenizer is tested.
- The rest still not so much...
- All the logic is in the class
term_document_matrix_abstract
in term_document.py - Low-level details are drafted in abstract methods and left to be implemented
- A implementation using a dict-of-dicts is available.
- A sparse matrix could be usefull as well and interesting for a comparaison.
- String transforms and token filters can be used easily to create a
tokenizer
- Stemming, lowercasing and some others are shown as an example in tokenizers.py
- Tokenization is available through
tokens = my_tokenizer.tokenize(string)
- Extensible : we just need to provide the
term_document_matrix_abstract
constructor with an iterable over documents - Available : a JSON file reader and one fetching docs from MongoDB.
- We could read the JSON in chunks but since we are keeping all the data in memory anyway...
- Maybe more SOLID to have the term-doc-freq data structure as member of the main data structure
- A full-fledged class for documents instead of a dict could help
- Typing is poor (Python..)
- Python3 only.
- Add bigrams transformation (continue work from train_word2vec..)
- See how to improve perf.
- Indexing : Time should grow in O(tokens) ~= O(documents)
- Index size : the dict-of-dicts approach is heavy...
- Search : O(n * ln(n)) where n is the number of documents where the query terms appear.