
ArXivExplore helps the deep data analysis of all 2.6 million physics, math, cs, etc. articles on ArXiv, providing functionality for e.g. title/abstract word statistics; TeX source/formulae and citations dissection; neural networks for classification, clustering or recommendation; LLM automated concept definitions and author reports.
For a general explanation, see the presentation I gave at the Wolfram Technology Conference 2024. You can also watch the recording on Youtube.
The first article ever on ArXiv:
In[]:= ArXivIDs[All] // First
Out[]= "physics/9403001"
In[]:= ArXivVersions["physics/9403001"]
Out[]= {{"version" -> "v1", "created" -> "Fri, 25 Apr 1986 15:39:49 GMT"}}
In[]:= ArXivTitles["physics/9403001"]
Out[]= "Desperately Seeking Superstrings"
In[]:= ArXivAuthors["physics/9403001"]
Out[]= {"Paul Ginsparg", "Sheldon Glashow"}
A
showing the trends in the most popular title words in theoretical physics category ("hep-th", primary or cross-list):
In[]:= Block[{words = {"black", "gauge", "gravity", "string"}},
ArXivPlot[words, {"hep-th", All}, PlotRange -> Full, PlotLegends -> words]]

All the 50 most common 2-neighbour title words on the whole ArXiv, ever:
In[]:= ArXivTopTitles[All, 50, 2] // Normal // Multicolumn[#, 3] &

Authors with more than one possible name (and categories) are conveniently registered as "ArXivAuthor" entities. For example:
In[]:= ArXivAuthorRegister["Vescovi", {"E. Vescovi", "Edoardo Vescovi"}]

We can then also easily create an author citations graph, with the tooltip indicating the articles ids:
In[]:= ArXivAuthorGraph[Entity["ArXivAuthor", "Vescovi"], VertexLabels -> Placed[Automatic, Tooltip]]

The dimensions of the whole ArXiv main dataset (at the end of June 2025):
In[]:= ArXivDataset[All] // Dimensions
Out[]= {2775152, 14}
Let us create a super-database with all computer science "cs" type (primary or cross-list) categories:
In[]:= ArXivDataset[{"cs", All}] = ArXivDatasetAggregate[{"cs", All}] // EchoFunction[Dimensions];
Out[]= {696632, 14}
and then let us visualize the most and less frequent title words:
In[]:= Block[{cat = {"cs", All}, tabs, colrules, tabskey, compl, cut = 160, res = 10},
colrules = {"learning" -> Style["learning", Purple, Bold], "using" -> Style["using", Purple, Bold], "theory" -> Style["theory", Red, Bold], "understanding" -> Style["understanding", Red, Bold]};
tabs = MapAt[Apply[Sequence, #] &,
MapIndexed[Partition[Riffle[Map[Style[#, Bold] &, Range[res*(First[#2] - 1) + 1, res*First[#2]]], #], 2] &, Partition[Normal@ArXivTopTitles[cat, cut], UpTo@res]], {All, All,2}] /. colrules;
tabskey = Cases[tabs, _List?(MemberQ[#[[All, 2]], Alternatives["theory", "understanding"] /. colrules] &)];
compl = Text[Style["... " <> ToString[Round[First@tabskey[[1, 1, 1]] - 1, 10]] <> "+words morepopular than\"understanding\"or \"theory\"in CS !", Bold, 9, TextAlignment -> Center]];
GraphicsRow[Join[{TextGrid@tabs[[1]], compl}, TextGrid /@ tabskey], ImageSize -> Large]]

Let us calculate the 10 most frequent categories, with their meaning and number of articles each:
In[]:= KeyValueMap[{#1, ArXivCategoriesLegend[#1], #2} &, ArXivTopCategories[10]] // Normal // TableForm
hep-ph | High Energy Physics - Phenomenology | 134315 |
quant-ph | Quantum Physics | 113002 |
cs.CV | Computer Vision and Pattern Recognition | 107036 |
hep-th | High Energy Physics - Theory | 106818 |
cs.LG | Machine Learning | 94387 |
astro-ph | Astrophysics | 94246 |
gr-qc | General Relativity and Quantum Cosmology | 64940 |
cond-mat.mes-hall | Mesoscale and Nanoscale Physics | 64255 |
cond-mat.mtrl-sci | Materials Science | 62135 |
cs.CL | Computation and Language | 56074 |
We can create train and test sets using only 5000={4500,500} titles and abstracts for each category:
In[]:= {train10, test10} = ArXivClassifyCategoriesTrainTest[10, 5000];
we can train a NN to classify these categories, with layers' dimension 80 and dropout level 0.5:
In[]:= net10 = ArXivClassifyCategoriesNet[10, 80, 0.5]

In[]:= netTrained10 = NetTrain[net10, train10, All, ValidationSet -> Scaled[0.07], MaxTrainingRounds -> 5]

Even with a basic 30 minutes training on laptop CPU, we obtain 89% accuracy:
In[]:= NetMeasurements[netTrained10["TrainedNet"], test10, "Accuracy"]~PercentForm~2

and a rather clean confusion matrix:
In[]:= NetMeasurements[netTrained10["TrainedNet"], test10, "ConfusionMatrixPlot"]

We could even classify authors within the same category, with ArXivClassifyAuthorNet.
Extracting
In[]:= ArXivTeXIntroduction[Echo@RandomChoice@ArXivIDs[All]] // Short[#, 10] &
>> "2211.13033"

also
In[]:= Table[i -> Take[Lookup[#, i], UpTo[50]], {i, Keys[#]}] &@ArXivTeXFormulae[Echo@RandomChoice[ArXivIDs["hep-th"]]] // TabView
>> "2305.12610"

Explain a technical concept using an article introduction and :
In[]:= ArXivExplainConcept["Viterbi algorithm", "2401.02314", LLMEvaluator ->
<|"Prompts" -> "Keep the output contained and emphasize the relation to this paper"|>] // Text

Let us visualize all authors with more than 7 papers, in primary category "cs.NA":
In[]:= ArXivTopAuthors["cs.NA", 7] // Column

Let us pick a random author among them and use LLM functionality to explain his overall work:
In[]:= ArXivExplainAuthor["Kevin Carlberg", "cs.NA", LLMEvaluator -> <|"Prompts" -> "Keep the output contained"|>] // Text

See the full documentation of all ArXivExplore paclet functions, see the Guides and Symbols webpages on the Wolfram Paclet Repository.