App: DeepDive OCR

Put this repo under deepdive/app.

A DeepDive application on OCR systems.

Requirements

PostreSQL
Python
Matplotlib (pip install matplotlib)

How to run the system

Create a ddocr database (createdb ddocr)
Change the application.conf db.default.user entry to yours.
If necessary, add database connection details to run.sh
Prepare your OCR output data, Google ngram data, distant supervision data.
Execute prepare_supv_data.sh to load supervision data into database
Execute load_ngram_to_db.sh to load Google Ngram data into database
Do OCR Alignment by script/AlignJournals.py
Execute prepare_data.sh to load aligned OCR outputs to databse.
Execute run.sh.

Datasets

Error analysis location

/Users/Robin/Documents/repos/deepdive_ocr/data-140111/compare/getaccrecall.py

Raw OCR Results

140,982  /lfs/madmax3/0/czhang/cleanpaleo/TORUNEXT_JOURNAL
 43,487  /lfs/madmax3/0/czhang/cleanpaleo/TORUNEXT_feb15
 14,646  /lfs/madmax3/0/czhang/cleanpaleo/NLPRS_jan20_overlap.22/

 dirty papers: 
 http://hazy.cs.wisc.edu/hazy/share/zifeipdf/
 /lfs/madmax/0/czhang/paleopaleo/input_large_compact/
 or with cuneiform HTML outputs: 
 /lfs/madmax/0/czhang/paleopaleo/input_large/

Candidates

# /dfs/madmax3/0/zifei/deepdive/app/ocr/data/journals-output-new
<!-- THIS IS WRONG!!! -->

OR:
/dfs/madmax/0/zifei/deepdive/app/ocr/data/journals-output

HOW TO GET: from raw candidates, do:

cd script/
python2.7 align_one_document.py ......

Supervision HTMLs

first half:

17,506  /dfs/madmax5/0/zifei/deepdive/app/ocr/data/sd-html/output-140508

second half:

20,040  /dfs/madmax5/0/zifei/deepdive/app/ocr/data/output-secondhalf

or TOTAL:

/dfs/hulk/0/zifei/ocr/sd-html/

Escaped supervision / evaluation data:

How to prepare: (e.g.)

bash prepare_supv_data_from_html_xargs.sh eval /dfs/hulk/0/zifei/ocr/sd-html/ /dfs/hulk/0/zifei/ocr/evaluation_escaped_2/

/dfs/hulk/0/zifei/ocr/supervision_escaped/
/dfs/hulk/0/zifei/ocr/evaluation_escaped/

<!-- /dfs/hulk/0/zifei/ocr/evaluation_escaped_new/ -->

Google Ngram

/dfs/madmax/0/zifei/google-ngram/1gram/
/dfs/madmax/0/zifei/google-ngram/2gram/
OR
/dfs/madmax5/0/zifei/deepdive/app/ocr/data/google-ngram/1gram/
/dfs/madmax5/0/zifei/deepdive/app/ocr/data/google-ngram/2gram/
OR 
/dfs/hulk/0/zifei/ocr/google-ngram/1gram/
/dfs/hulk/0/zifei/ocr/google-ngram/1gram.tsv
/dfs/hulk/0/zifei/ocr/google-ngram/2gram_reduced.tsv

Web Ngram (filtered by 10000)

/dfs/hulk/0/zifei/ocr/web_ngram/3gram.tsv
/dfs/hulk/0/zifei/ocr/web_ngram/4gram.tsv
/dfs/hulk/0/zifei/ocr/web_ngram/5gram.tsv

Domain corpus (HTML aggregated by docid)

/dfs/hulk/0/zifei/ocr/domain-corpus/domain-corpus.tsv

KB data

/dfs/hulk/0/zifei/ocr/kb/intervals.tsv
/dfs/hulk/0/zifei/ocr/kb/paleodb_taxons.tsv
/dfs/hulk/0/zifei/ocr/kb/supervision_occurrences.tsv

# Aggregated:
/dfs/hulk/0/zifei/ocr/kb/entity_kb.tsv
/dfs/hulk/0/zifei/ocr/kb/entity_kb_words.txt

# ngrams
/dfs/hulk/0/zifei/ocr/kb/domain_1gram_100docs_reduced5.txt
/dfs/hulk/0/zifei/ocr/kb/domain_1gram_100docs.txt
/dfs/hulk/0/zifei/ocr/kb/google_1gram_1000.txt
/dfs/hulk/0/zifei/ocr/kb/google_1gram_10k.txt

Ground truth

HTML: (hold out 1/5 as ground truth and rest for distant supervision; run 5 times to make sure they do not differ wildly...)

168,790  /lfs/madmax3/0/czhang/cleanpaleo/jid2url.tsv

 37,545  grep sciencedirect /lfs/madmax3/0/czhang/cleanpaleo/jid2url.tsv
 

 priority: 
 1. /lfs/madmax3/0/czhang/cleanpaleo/TORUNEXT_feb15/[JID] 
 2. /lfs/madmax3/0/czhang/cleanpaleo/TORUNEXT_JOURNAL/[JID]

Dependencies

python-Levenshtein
pyquery
snappy 0.8.5 or higher (http://snap.stanford.edu/snappy/0.8.5/)
psql "fuzzystrmatch" module:
- http://blog.2ndquadrant.com/wp-content/uploads/2011/03/fuzzystrmatch-gp-4.0.4.0.tar.gz
- Doc: http://blog.2ndquadrant.com/fuzzystrmatch_greenplum/
psql "pg_trgm" module (in GiST):
- http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/pg_trgm.tar.gz
- Docs: http://www.sai.msu.su/~megera/postgres/gist/

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
data		data
doc		doc
env		env
error-analysis		error-analysis
evaluation		evaluation
experiments		experiments
plotting		plotting
prepare_data		prepare_data
script		script
testgroupby		testgroupby
udf		udf
util		util
.gitignore		.gitignore
README.md		README.md
application-multinomial.conf		application-multinomial.conf
application.conf		application.conf
bestpick.conf		bestpick.conf
evaluate_ocr_result.sh		evaluate_ocr_result.sh
feature-analysis.py		feature-analysis.py
generate_all_ocr_results.sh		generate_all_ocr_results.sh
generate_ocr_result-master.sh		generate_ocr_result-master.sh
generate_ocr_result.sh		generate_ocr_result.sh
kfold-run.sh		kfold-run.sh
ocr-evaluation-NORESEG.py		ocr-evaluation-NORESEG.py
ocr-evaluation-strict.py		ocr-evaluation-strict.py
ocr-evaluation-xargs.py		ocr-evaluation-xargs.py
ocr-evaluation.py		ocr-evaluation.py
pick-result.eps		pick-result.eps
plot-eval-recall.py		plot-eval-recall.py
rerun-all-evaluation.sh		rerun-all-evaluation.sh
run-bestpick.sh		run-bestpick.sh
run-ddocr_1k.sh		run-ddocr_1k.sh
run-evaluation-sample.sh		run-evaluation-sample.sh
run-evaluation.sh		run-evaluation.sh
run-full-pipeline.sh		run-full-pipeline.sh
run-multinomial.sh		run-multinomial.sh
run-ocr-evaluation-raw-candword.sh		run-ocr-evaluation-raw-candword.sh
run-ocr-evaluation-sample.sh		run-ocr-evaluation-sample.sh
run-ocr-evaluation.sh		run-ocr-evaluation.sh
run.sh		run.sh
save-exp-result.sh		save-exp-result.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

App: DeepDive OCR

Put this repo under deepdive/app.

Requirements

How to run the system

Datasets

Error analysis location

Raw OCR Results

Candidates

Supervision HTMLs

Escaped supervision / evaluation data:

Google Ngram

Web Ngram (filtered by 10000)

Domain corpus (HTML aggregated by docid)

KB data

Ground truth

Dependencies

About

Releases

Packages

Languages

zifeishan/deepdive_ocr_app

Folders and files

Latest commit

History

Repository files navigation

App: DeepDive OCR

Put this repo under deepdive/app.

Requirements

How to run the system

Datasets

Error analysis location

Raw OCR Results

Candidates

Supervision HTMLs

Escaped supervision / evaluation data:

Google Ngram

Web Ngram (filtered by 10000)

Domain corpus (HTML aggregated by docid)

KB data

Ground truth

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages