Extract citations from PDFs.
Note: The version is 2.x, but really should be 0.2.x.
You might also want to try Grobid, which I have found to perform better than the version of Parscit used here, which throws away non-textual information (font, formatting, etc).
# Extract metadata from a file using default tools and settings
result = Biblicit::Extractor.extract(content: "a string containing the content of a PDF file")
# Extract metadata from a file using all available tools
result = Biblicit::Extractor.extract(file: "myfile.pdf", tools: [:parshed, :cb2bib], remote: true, token: false)
# See reference information for "myfile.pdf"
result[:citeseer][:title]
result[:parshed][:title]
result[:citeseer][:authors]
# etc
Wrapper around Perl code extracted from CiteSeerX.
Uses a model trained with the svm-light Support Vector Machine library.
Wrapper around Perl & Ruby code from ParsCit, which is included as a Git submodule.
Uses a model trained with the CRF++ Conditional Random Fields library.
Wrapper around cb2Bib in command-line mode.
Uses an apparently less-sophisticated parsing algorithm than the others to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data. Warning: sometimes it finds the wrong work!
There are a lot, but you may not need all of them, depending on your use case.
Different tools are used for different input file formats.
PDF - Poppler
This provides pdftotext
. You could install xpdf
instead.
Requires fontconfig.
wget http://poppler.freedesktop.org/poppler-0.22.1.tar.gz
tar -xzf poppler-0.22.1.tar.gz
cd poppler-0.22.1
./configure
make
sudo make install
sudo apt-get install poppler-utils
brew install poppler
Postscript - Ghostscript
This provides ps2ascii
.
wget http://downloads.ghostscript.com/public/ghostscript-9.06.tar.gz
tar -xzf ghostscript-9.06.tar.gz
cd ghostscript-9.06
make
sudo make install
sudo apt-get install ghostscript
brew install ghostscript
Other (e.g. docx) - AbiWord
This provides abiword
.
sudo apt-get install abiword
As of writing, you're out of luck, because AbiWord doesn't compile on recent versions of OS X. According to their website, however, this is being actively worked on.
More than these might be required; this is what I had to add to my default installation.
sudo cpan install Digest::SHA1
sudo cpan install String::Approx
You can specify where you have installed CRF++ by setting the CRFPP_HOME environment variable.
wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.57.tar.gz
tar xvzf CRF++-0.57.tar.gz
cd CRF++-0.57
./configure
make
sudo make install
sudo apt-add-repository 'deb http://cl.naist.jp/~eric-n/ubuntu-nlp oneiric all'
sudo apt-get update
sudo apt-get install libcrf++-dev crf++
brew install crf++
Required for header extraction (reference information for the input work itself).
The included model requires version 5, not the current version. You can specify where you have installed svm-light by setting the SVM_LIGHT_HOME environment variable.
mkdir svm_light5
cd svm_light5
wget http://download.joachims.org/svm_light/v5.00/svm_light.tar.gz
tar -xzf svm_light.tar.gz
make
echo "export SVM_LIGHT_HOME=`pwd`" >> ~/.profile # or .bashrc or whatever
source ~/.profile
wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
./configure --prefix /usr/local
make
sudo make install
Requires Qt & X11, unfortunately, and still requires a hack to work on recent versions of OS X.
wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
./configure --prefix /Applications/cb2Bib
make # fails first time...
mv src/Makefile src/Makefile.old
sed 's|-lX11 -framework QtWebKit|-lX11 -L/usr/X11/lib -I/usr/X11/include -framework QtWebKit|' src/Makefile.old > src/Makefile
make # should succeed now
sudo make install
sudo apt-get install cb2bib
(I'm not currently sure what this was required for; TODO figure it out!)
sudo apt-get install libicu-dev
Copyright Academia.edu or the original author(s) - see documentation in the included parscit and svm-header-parse directories.
Apache licensed (see LICENSE.TXT).
Please note svm-light is in general free only for non-commercial use, but can be used in this gem by permission of the author. For conditions on additional uses see the website.