-
Notifications
You must be signed in to change notification settings - Fork 4
NoraExtraction
Many packages exist for text extraction from PDF, some based on OCR-like techniques (primarily for scanned documents), others working as limited PDF interpreters, reading out a pure text stream from `digitally born' documents. One of the more widely used packages appears to be Apache [http://incubator.apache.org/pdfbox/ PDFBox], which we will evaluate as our baseline—parallel to much ongoing work in the international ACL community.
Other open-source tools that we should assess include [http://pdftohtml.sourceforge.net/ PDFtoHTML] [http://poppler.freedesktop.org/ Poppler], and [http://www.unixuser.org/~euske/python/pdfminer/index.html PDFMiner]. For a smaller sample of NORA documents, it may also make sense to contrastively look at non-open tools like [http://a-pdf.com/text/index.htm A-PDF Text Extractor] and Adobe Acrobat. Some of these packages were briefly discussed at the 2009 DELPH-IN Summit; please see the [http://wiki.delph-in.net/moin/BarcelonaPreprocessing discussion notes] for details.
We have not yet made a decision regarding exactly how the architecture of the various parts will be. I am currently setting up a Eclipse Workspace on this for ease of editing. I suggest using git to keep track of said workspace. Git is an advanced(used for linux-kernel and others) but also very lightweight source control man. I am currently setting it up om my workspace for (at least) personal use, but others are welcome to give it a go.
Say git clone ~johanbev/wescience0 at the ifi-linux to get the current branch. See [http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html] for a short and good tutorial on git. Johanbev can then manually pull changes off of other devs repositories. Later on we also plan to allow git push (similar to svn check-in) but one of us has to learn xfs-acl-lists or make a group of nora ppl.
Home | Forum | Discussions | Events