This docker script is designed to facilitate indexing the XML files from thlib-texts-xml, as well as running and indexing the results of intertextual analysis. It is a companion to thlib-texts-indexer and to the Linode StackScript here.
The suggested workflow is:
- Spin up a new Linode using the Stackscript linked above. You will need a GitHib personal access token with the scopes
admin:org, admin:public_key, repo
- SSH onto that machine
- Spin up the docker container with
docker run -it tti
- Run either indexing or analysis commands, e.g.:
cd /thlib-texts-indexer
# activate the virtualenv
source .venv/bin/activate
./examples/index-all-texts.sh
Currently there are two different strategies for running intertextual analyis in use. Steps for running each are below:
The following will run the analysis and index the results:
cd /thlib-texts/indexer
# activate the virtualenv
source .venv/bin/activate
set -o allexport; source /solr-variables; set +o allexport
python index.py -ttxd ../thlib-texts-xml -solr https://$SOLR_HOST -coll $SOLR_CORE -saxon ./saxon-8.jar --solr_auth $SOLR_USER:$SOLR_PASS\
-include ngb.pt,lccw --index_itx --tib_data_path ../TibetanData --itx_type itx
Analysis and indexing happen in two distinct steps:
cd /tibetan-text-reuse
# activate the virtualenv
source .venv/bin/activate
python bo_reuse/text_reuse.py -c /texts/lccw-raw.txt /texts/ngb.pt-raw.txt -d . -o result.txt
The above script will take several hours to run. Once it is finished:
cd /thlib-texts-indexer
# activate the virtualenv
source .venv/bin/activate
set -o allexport; source /solr-variables; set +o allexport
python index.py -ttxd ../thlib-texts-xml -solr https://$SOLR_HOST -coll $SOLR_CORE -saxon ./saxon-8.jar --solr_auth $SOLR_USER:$SOLR_PASS\
--results_file ../tibetan-text-reuse/result.txt --text_file_1 /texts/lccw.txt --text_file_2 /texts/ngb.pt.txt --itx_type itx2