Caption Index

Simple library for building position, time, and inverted indexes on .srt files for quick indexing and n-gram search. The goal of this module is provide reasonably efficient iterators over caption/text files for analytics code, while limiting resident memory usage and IO/computational latency.

Install

First, install Rust (tested on stable 1.43.0). Run python3 setup.py install --user.

Usage

Indexing your files

Run scripts/build_index.py and point it to the directory containing your subtitle files to build an index. This can take some time and require significant computation and memory resources if there are many files (i.e., hundreds of thousands).

After the indexer has run, there will be four entries in the index directory. These are:

documents.txt
lexicon.txt
index.bin
data/

Note that if you set ran the indexer with the --chunk-size set, then index.bin will be a directory containing the index files.

data is a directory containing binary encoded captions, one per file, and named by the document id. Do not manually rename these files!

Updating an index

Sometimes, we may need to index additional documents after we first built our index. To do this, run scripts/update_index.py. You can optionally also update the lexicon.

Using your index

The tools directory contains examples for how to use the various indices that were built by scripts/build_index.py.

tools/search.py demonstrates n-gram and topic search in a command line application.
tools/scan.py performs a scan over all of the tokens in all documents.

Tests

Run pytest -v from inside the tests directory.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
captions		captions
scripts		scripts
src		src
tests		tests
tools		tools
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Caption Index

Install

Usage

Indexing your files

Updating an index

Using your index

Tests

About

Releases

Packages

Contributors 5

Languages

License

scanner-research/caption-index

Folders and files

Latest commit

History

Repository files navigation

Caption Index

Install

Usage

Indexing your files

Updating an index

Using your index

Tests

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages