Simple library for building position, time, and inverted indexes on .srt files for quick indexing and n-gram search. The goal of this module is provide reasonably efficient iterators over caption/text files for analytics code, while limiting resident memory usage and IO/computational latency.
First, install Rust (tested on stable 1.43.0). Run python3 setup.py install --user
.
Run scripts/build_index.py
and point it to the directory containing your
subtitle files to build an index. This can take some time and require
significant computation and memory resources if there are many files
(i.e., hundreds of thousands).
After the indexer has run, there will be four entries in the index directory. These are:
documents.txt
lexicon.txt
index.bin
data/
Note that if you set ran the indexer with the --chunk-size
set, then
index.bin
will be a directory containing the index files.
data
is a directory containing binary encoded captions, one per file, and
named by the document id. Do not manually rename these files!
Sometimes, we may need to index additional documents after we first built our
index. To do this, run scripts/update_index.py
. You can optionally also
update the lexicon.
The tools
directory contains examples for how to use the various indices
that were built by scripts/build_index.py
.
-
tools/search.py
demonstrates n-gram and topic search in a command line application. -
tools/scan.py
performs a scan over all of the tokens in all documents.
Run pytest -v
from inside the tests
directory.