Wiki2Gexf is a python (3.13.5) project that allows you to run a breadth-first search on a locally stored (and indexed) Wikipedia archive, and then export this search to a .gexf format.
Example renders made in Gephi:
Chicory: A Colorful Tale | Wikipedia: Unusual Articles |
---|---|
![]() |
![]() |
- To start,
clone
this repository. You will need to install theNetworkX
package for Gexf functionality. - Next: you will need to download two wikipedia archive files from https://dumps.wikimedia.org/. MAKE SURE THEY HAVE THE SAME DATE
- pages-articles-multistream.xml.bz2 --> The (heavily compressed) text archive of Wikipedia, should be well over 20GB. Torrent if possible! DO NOT EXTRACT!
- pages-articles-multistream-index.txt.bz2 --> The index of the byte offset for all articles in the multistream archive.
- Extract the index file
- Place both the index and the multistream files in the same directory as this repository.
- Run
indexWiki.py
. This creates a folder (index/
) that splits the index file up and sorts the articles in the subfiles.- This may take a while, and due to open file limits it will NOT work on Windows. I suggest either running this in WSL or moving the index folder over to Windows from Linux if you must use Windows.
To create a gexf file for a search, run Wiki2Gexf.py.
- You can specify either an article name with
-n
or an article URL with-u
- You can limit the depth of the search with
-d
. This defaults to 1, and any search with d>1 will result in a very large gexf file. - You need to specify an output file.