Goal: extract math Latex from .tex
content available from arXiv.
Caveat when cloning this repo: Total download size is 640 MB.
Read latex-in-arxiv/src/postings_list/query/README.md
Everything is containerized, so in this repo (latex-in-arxiv/
) use
either make docker
(for linux) or make docmac
(for Mac).
To run the application, within the Docker image run /opt/scanner.out .
To recompile the scanner, within the Docker image run
cd latex-in-arxiv/src/postings_list/query
make scanner
./scanner.out path_to_tex_files tex$
Suppose you have a .tex
file that contains math, like
\documentclass{article}
\title{test}
\begin{document}
\maketitle
\section{Introduction}
This is a great paper.
\begin{equation}
a+b = c
\end{equation}
Where $c$ is some variable.
\end{document}
There's an expression, a+b=c
and an in-line variable c
.
How can the expression and the variables be extracted?
There are a few options for parsing Latex; see #14 The options that are decent in terms of quality of results are also slow.
This repo uses ragel
to quickly parse Latex and find math.
Depends on Ragel State Machine Compiler version 7.0.4 February 2021
https://www.cs.cornell.edu/projects/kddcup/datasets.html
In the directory latex-in-arxiv/get_sample_data
use
make get_sample_data
# curl http://export.arxiv.org/api/query?search_query=all:rigorous%20derivation
for details, see https://arxiv.org/help/bulk_data_s3
# s3cmd get s3://arxiv/src/arXiv_src_manifest.xml . --requester-pays
# s3cmd get s3://arxiv/src/arXiv_src_9912_001.tar . --requester-pays