Skip to content

slomas04/WikitoGexf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia 2 Gexf

There's a converter for everything these days!

Wiki2Gexf is a python (3.13.5) project that allows you to run a breadth-first search on a locally stored (and indexed) Wikipedia archive, and then export this search to a .gexf format.

Example renders made in Gephi:

Chicory: A Colorful Tale Wikipedia: Unusual Articles
chicory strange2

Setup/Install Guide

  • To start, clone this repository. You will need to install the NetworkX package for Gexf functionality.
  • Next: you will need to download two wikipedia archive files from https://dumps.wikimedia.org/. MAKE SURE THEY HAVE THE SAME DATE
    1. pages-articles-multistream.xml.bz2 --> The (heavily compressed) text archive of Wikipedia, should be well over 20GB. Torrent if possible! DO NOT EXTRACT!
    2. pages-articles-multistream-index.txt.bz2 --> The index of the byte offset for all articles in the multistream archive.
  • Extract the index file
  • Place both the index and the multistream files in the same directory as this repository.
  • Run indexWiki.py. This creates a folder (index/) that splits the index file up and sorts the articles in the subfiles.
    • This may take a while, and due to open file limits it will NOT work on Windows. I suggest either running this in WSL or moving the index folder over to Windows from Linux if you must use Windows.

Usage Guide

To create a gexf file for a search, run Wiki2Gexf.py.

  • You can specify either an article name with -n or an article URL with -u
  • You can limit the depth of the search with -d. This defaults to 1, and any search with d>1 will result in a very large gexf file.
  • You need to specify an output file.

About

Convert Wikipedia to a Gexf file with ease!

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages