Skip to content

Daniele-Gregori/ArXivExplore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXivExplore

Headline Image

0trfzzrk4horg

Basic Description

ArXivExplore helps the deep data analysis of all 2.6 million physics, math, cs, etc. articles on ArXiv, providing functionality for e.g. title/abstract word statistics; TeX source/formulae and citations dissection; neural networks for classification, clustering or recommendation; LLM automated concept definitions and author reports.

WTC Presentation

For a general explanation, see the presentation I gave at the Wolfram Technology Conference 2024. You can also watch the recording on Youtube.

Examples

Basic Examples

The first article ever on ArXiv:

In[]:= ArXivIDs[All] // First
Out[]= "physics/9403001"
In[]:= ArXivVersions["physics/9403001"]
Out[]= {{"version" -> "v1", "created" -> "Fri, 25 Apr 1986 15:39:49 GMT"}}
In[]:= ArXivTitles["physics/9403001"]
Out[]= "Desperately Seeking Superstrings"
In[]:= ArXivAuthors["physics/9403001"]
Out[]= {"Paul Ginsparg", "Sheldon Glashow"}

A 1lsbzz7lyarl4 showing the trends in the most popular title words in theoretical physics category ("hep-th", primary or cross-list):

In[]:= Block[{words = {"black", "gauge", "gravity", "string"}}, 
   ArXivPlot[words, {"hep-th", All}, PlotRange -> Full, PlotLegends -> words]]
05dv5s7h5t6mm

All the 50 most common 2-neighbour title words on the whole ArXiv, ever:

In[]:= ArXivTopTitles[All, 50, 2] // Normal // Multicolumn[#, 3] &
1aoujzxujkbl8

Authors with more than one possible name (and categories) are conveniently registered as "ArXivAuthor" entities. For example:

In[]:= ArXivAuthorRegister["Vescovi", {"E. Vescovi", "Edoardo Vescovi"}]
1xle9ymtgibyc

We can then also easily create an author citations graph, with the tooltip indicating the articles ids:

In[]:= ArXivAuthorGraph[Entity["ArXivAuthor", "Vescovi"], VertexLabels -> Placed[Automatic, Tooltip]]
19tvuxiq9lcch

Scope

The dimensions of the whole ArXiv main dataset (at the end of June 2025):

In[]:= ArXivDataset[All] // Dimensions
Out[]= {2775152, 14}

Let us create a super-database with all computer science "cs" type (primary or cross-list) categories:

In[]:= ArXivDataset[{"cs", All}] = ArXivDatasetAggregate[{"cs", All}] // EchoFunction[Dimensions];
Out[]= {696632, 14}

and then let us visualize the most and less frequent title words:

In[]:= Block[{cat = {"cs", All}, tabs, colrules, tabskey, compl, cut = 160, res = 10}, 
   colrules = {"learning" -> Style["learning", Purple, Bold], "using" -> Style["using", Purple, Bold], "theory" -> Style["theory", Red, Bold], "understanding" -> Style["understanding", Red, Bold]}; 
   tabs = MapAt[Apply[Sequence, #] &, 
      MapIndexed[Partition[Riffle[Map[Style[#, Bold] &, Range[res*(First[#2] - 1) + 1, res*First[#2]]], #], 2] &, Partition[Normal@ArXivTopTitles[cat, cut], UpTo@res]], {All, All,2}] /. colrules; 
   tabskey = Cases[tabs, _List?(MemberQ[#[[All, 2]], Alternatives["theory", "understanding"] /. colrules] &)]; 
   compl = Text[Style["... " <> ToString[Round[First@tabskey[[1, 1, 1]] - 1, 10]] <> "+words morepopular than\"understanding\"or \"theory\"in CS !", Bold, 9, TextAlignment -> Center]]; 
   GraphicsRow[Join[{TextGrid@tabs[[1]], compl}, TextGrid /@ tabskey], ImageSize -> Large]]
00k3mufuabzoz

Let us calculate the 10 most frequent categories, with their meaning and number of articles each:

In[]:= KeyValueMap[{#1, ArXivCategoriesLegend[#1], #2} &, ArXivTopCategories[10]] // Normal // TableForm
hep-ph High Energy Physics - Phenomenology 134315
quant-ph Quantum Physics 113002
cs.CV Computer Vision and Pattern Recognition 107036
hep-th High Energy Physics - Theory 106818
cs.LG Machine Learning 94387
astro-ph Astrophysics 94246
gr-qc General Relativity and Quantum Cosmology 64940
cond-mat.mes-hall Mesoscale and Nanoscale Physics 64255
cond-mat.mtrl-sci Materials Science 62135
cs.CL Computation and Language 56074

We can create train and test sets using only 5000={4500,500} titles and abstracts for each category:

In[]:= {train10, test10} = ArXivClassifyCategoriesTrainTest[10, 5000];

we can train a NN to classify these categories, with layers' dimension 80 and dropout level 0.5:

In[]:= net10 = ArXivClassifyCategoriesNet[10, 80, 0.5]
14511mlgn9487
In[]:= netTrained10 = NetTrain[net10, train10, All, ValidationSet -> Scaled[0.07], MaxTrainingRounds -> 5]
00mzq6bu6eua8

Even with a basic 30 minutes training on laptop CPU, we obtain 89% accuracy:

In[]:= NetMeasurements[netTrained10["TrainedNet"], test10, "Accuracy"]~PercentForm~2
02wsi35nxqu1g

and a rather clean confusion matrix:

In[]:= NetMeasurements[netTrained10["TrainedNet"], test10, "ConfusionMatrixPlot"]
1dbrfbhx0gznl

We could even classify authors within the same category, with ArXivClassifyAuthorNet.

Extracting $TEX$ introduction:

In[]:= ArXivTeXIntroduction[Echo@RandomChoice@ArXivIDs[All]] // Short[#, 10] &
>> "2211.13033"
0jitsl5yh3n0o

also $TEX$ formulae:

In[]:= Table[i -> Take[Lookup[#, i], UpTo[50]], {i, Keys[#]}] &@ArXivTeXFormulae[Echo@RandomChoice[ArXivIDs["hep-th"]]] // TabView
>> "2305.12610"
1bugb9ay5b4g9

Explain a technical concept using an article introduction and 0mnjfe6thkyvg:

In[]:= ArXivExplainConcept["Viterbi algorithm", "2401.02314", LLMEvaluator -> 
     <|"Prompts" -> "Keep the output contained and emphasize the relation to this paper"|>] // Text
1060eaxqqlb8d

Let us visualize all authors with more than 7 papers, in primary category "cs.NA":

In[]:= ArXivTopAuthors["cs.NA", 7] // Column
Screenshot 2025-08-15 alle 16 20 46

Let us pick a random author among them and use LLM functionality to explain his overall work:

In[]:= ArXivExplainAuthor["Kevin Carlberg", "cs.NA", LLMEvaluator -> <|"Prompts" -> "Keep the output contained"|>] // Text
03fbo247tbe8m

Full Documentation

See the full documentation of all ArXivExplore paclet functions, see the Guides and Symbols webpages on the Wolfram Paclet Repository.

About

ArXivExplore helps the deep data analysis of all research articles on ArXiv

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published