ArXivExplore

Headline Image

Basic Description

ArXivExplore helps the deep data analysis of all 2.6 million physics, math, cs, etc. articles on ArXiv, providing functionality for e.g. title/abstract word statistics; TeX source/formulae and citations dissection; neural networks for classification, clustering or recommendation; LLM automated concept definitions and author reports.

WTC Presentation

For a general explanation, see the presentation I gave at the Wolfram Technology Conference 2024. You can also watch the recording on Youtube.

Examples

Basic Examples

The first article ever on ArXiv:

In[]:= ArXivIDs[All] // First

Out[]= "physics/9403001"

In[]:= ArXivVersions["physics/9403001"]

Out[]= {{"version" -> "v1", "created" -> "Fri, 25 Apr 1986 15:39:49 GMT"}}

In[]:= ArXivTitles["physics/9403001"]

Out[]= "Desperately Seeking Superstrings"

In[]:= ArXivAuthors["physics/9403001"]

Out[]= {"Paul Ginsparg", "Sheldon Glashow"}

A showing the trends in the most popular title words in theoretical physics category ("hep-th", primary or cross-list):

In[]:= Block[{words = {"black", "gauge", "gravity", "string"}}, 
   ArXivPlot[words, {"hep-th", All}, PlotRange -> Full, PlotLegends -> words]]

All the 50 most common 2-neighbour title words on the whole ArXiv, ever:

In[]:= ArXivTopTitles[All, 50, 2] // Normal // Multicolumn[#, 3] &

Authors with more than one possible name (and categories) are conveniently registered as "ArXivAuthor" entities. For example:

In[]:= ArXivAuthorRegister["Vescovi", {"E. Vescovi", "Edoardo Vescovi"}]

We can then also easily create an author citations graph, with the tooltip indicating the articles ids:

In[]:= ArXivAuthorGraph[Entity["ArXivAuthor", "Vescovi"], VertexLabels -> Placed[Automatic, Tooltip]]

Scope

The dimensions of the whole ArXiv main dataset (at the end of June 2025):

In[]:= ArXivDataset[All] // Dimensions

Out[]= {2775152, 14}

Let us create a super-database with all computer science "cs" type (primary or cross-list) categories:

In[]:= ArXivDataset[{"cs", All}] = ArXivDatasetAggregate[{"cs", All}] // EchoFunction[Dimensions];

Out[]= {696632, 14}

and then let us visualize the most and less frequent title words:

In[]:= Block[{cat = {"cs", All}, tabs, colrules, tabskey, compl, cut = 160, res = 10}, 
   colrules = {"learning" -> Style["learning", Purple, Bold], "using" -> Style["using", Purple, Bold], "theory" -> Style["theory", Red, Bold], "understanding" -> Style["understanding", Red, Bold]}; 
   tabs = MapAt[Apply[Sequence, #] &, 
      MapIndexed[Partition[Riffle[Map[Style[#, Bold] &, Range[res*(First[#2] - 1) + 1, res*First[#2]]], #], 2] &, Partition[Normal@ArXivTopTitles[cat, cut], UpTo@res]], {All, All,2}] /. colrules; 
   tabskey = Cases[tabs, _List?(MemberQ[#[[All, 2]], Alternatives["theory", "understanding"] /. colrules] &)]; 
   compl = Text[Style["... " <> ToString[Round[First@tabskey[[1, 1, 1]] - 1, 10]] <> "+words morepopular than\"understanding\"or \"theory\"in CS !", Bold, 9, TextAlignment -> Center]]; 
   GraphicsRow[Join[{TextGrid@tabs[[1]], compl}, TextGrid /@ tabskey], ImageSize -> Large]]

Let us calculate the 10 most frequent categories, with their meaning and number of articles each:

In[]:= KeyValueMap[{#1, ArXivCategoriesLegend[#1], #2} &, ArXivTopCategories[10]] // Normal // TableForm


hep-ph	High Energy Physics - Phenomenology	134315
quant-ph	Quantum Physics	113002
cs.CV	Computer Vision and Pattern Recognition	107036
hep-th	High Energy Physics - Theory	106818
cs.LG	Machine Learning	94387
astro-ph	Astrophysics	94246
gr-qc	General Relativity and Quantum Cosmology	64940
cond-mat.mes-hall	Mesoscale and Nanoscale Physics	64255
cond-mat.mtrl-sci	Materials Science	62135
cs.CL	Computation and Language	56074

We can create train and test sets using only 5000={4500,500} titles and abstracts for each category:

In[]:= {train10, test10} = ArXivClassifyCategoriesTrainTest[10, 5000];

we can train a NN to classify these categories, with layers' dimension 80 and dropout level 0.5:

In[]:= net10 = ArXivClassifyCategoriesNet[10, 80, 0.5]

In[]:= netTrained10 = NetTrain[net10, train10, All, ValidationSet -> Scaled[0.07], MaxTrainingRounds -> 5]

Even with a basic 30 minutes training on laptop CPU, we obtain 89% accuracy:

In[]:= NetMeasurements[netTrained10["TrainedNet"], test10, "Accuracy"]~PercentForm~2

and a rather clean confusion matrix:

In[]:= NetMeasurements[netTrained10["TrainedNet"], test10, "ConfusionMatrixPlot"]

We could even classify authors within the same category, with ArXivClassifyAuthorNet.

Extracting $TEX$ introduction:

In[]:= ArXivTeXIntroduction[Echo@RandomChoice@ArXivIDs[All]] // Short[#, 10] &

>> "2211.13033"

also $TEX$ formulae:

In[]:= Table[i -> Take[Lookup[#, i], UpTo[50]], {i, Keys[#]}] &@ArXivTeXFormulae[Echo@RandomChoice[ArXivIDs["hep-th"]]] // TabView

>> "2305.12610"

Explain a technical concept using an article introduction and :

In[]:= ArXivExplainConcept["Viterbi algorithm", "2401.02314", LLMEvaluator -> 
     <|"Prompts" -> "Keep the output contained and emphasize the relation to this paper"|>] // Text

Let us visualize all authors with more than 7 papers, in primary category "cs.NA":

In[]:= ArXivTopAuthors["cs.NA", 7] // Column

Let us pick a random author among them and use LLM functionality to explain his overall work:

In[]:= ArXivExplainAuthor["Kevin Carlberg", "cs.NA", LLMEvaluator -> <|"Prompts" -> "Keep the output contained"|>] // Text

Full Documentation

See the full documentation of all ArXivExplore paclet functions, see the Guides and Symbols webpages on the Wolfram Paclet Repository.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
ArXivExplore.wl		ArXivExplore.wl
ArXivExplore_Previous_Releases.zip		ArXivExplore_Previous_Releases.zip
LICENSE		LICENSE
PacletInfo.wl		PacletInfo.wl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArXivExplore

Headline Image

Basic Description

WTC Presentation

Examples

Basic Examples

Scope

Full Documentation

About

Uh oh!

Releases

Packages

Languages

License

Daniele-Gregori/ArXivExplore

Folders and files

Latest commit

History

Repository files navigation

ArXivExplore

Headline Image

Basic Description

WTC Presentation

Examples

Basic Examples

Scope

Full Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages