Stack Overflow / Stack Exchange data dump import tool

Stack Exchange Data Dumps

The user generated content on Stack Overflow and the whole Stack Exchange network is licensed under the Creative Commons CC BY-SA 3.0 license. There are regular (more or less) data dumps provided in XML format that contain all the questions/answers/comments/vote counts/... of the Stack Exchange sites.

This sodata tool creates a real PostgreSQL or Sqlite3 database from the XML files.

For more information about the data dumps and where to download them, see:

If you just want to play around with the database a little without all the effort of downloading GBs of data and importing them into your own database, you might want to have look at the online Stack Exchange Data Explorer instead.

Usage

This code can be compiled to generate three import tools:

sqliteimport, which imports the XML files into a Sqlite3 database
pgimport, which imports the XML files into a PostgreSQL database
csvimport, which converts the XML files into CSV files

Generally it is recommended to use a new, empty database to import into, but the tools don't really care if there are existing tables, as long as there are no name conflicts.

All the tools expect to be started in a directory that contains the extracted data dump XML files which should be imported.

sqliteimport

Creates a Sqlite3 database. By default it creates a new database file called dump.db, but this can be changed with the -f flag:

sqliteimport -f myimport.db

pgimport

Expects an existing and preferably empty PostgreSQL database and fills it with data. You should specify connect options with the -c flag, like this:

pgimport -c "host=localhost dbname=so user=soimporter password=abc"

By default, this will use a temporary file and SQL COPY to import the data. This means that you'll need enough free diskspace to store a temporary copy of the data. Also the importing database user will need to have superuser privileges in the database to be allowed to use COPY.

The directory used for the temporary data can be specified with the -d option:

pgimport -d tmpdir -c ...

There has to be enough free diskspace to store a temporary copy of each table in that directory. Having this on a different hard disk than the database or the input files will speed up the import process.

The import can also be done without temporary files and without requiring a database superuser. For this use the -s flag:

pgimport -s -c ...

csvimport

Generates CSV (Comma separated value) files form the XML data. The -d parameter can be used to specify a target directory, by default the current working directory is used.

The resulting CSV files use escaping as understood by PostgreSQL. This means that the characters \\ (backslash), \n (newline), \r (carriage return), and , (comma) will be escaped by a preceding backslash when they occur in a text field. NULL values will be represented by \\N (a backslash character followed by a capital N).

Common options

With the -I switch generating of indexes can be disabled. This makes the import faster.

Created Database

The import tools create tables with the same names as the XML files, so they will be called posts, comments, users and so on. The column names in those tables correspond to the attribute names in the XML. For more detail see soschema.hpp, which defines the tables and columns that get imported.

Compilation

Dependencies

All programs use libexpat to parse the input files, it and it's header files have to be available. On Debian or Ubuntu based systems you can install the required files like this, for example:

sudo apt install libexpat1 libexpat1-dev

To compile sqliteimport you'll need to have libsqlite3 installed, including it's header files:

sudo apt install libsqlite3-0 libsqlite3-dev

To compile pgimport you'll need to have libpqxx installed, including it's header files:

sudo apt install libpqxx-4.0 libpqxx-dev

Your distribution might also have newer version of libpqxx, for example:

sudo apt install libpqxx-6.2 libpqxx-dev

(Tested with libpqxx-4.0 and libpqxx-6.2)

Compilation

You can compile the tools with CMake:

mkdir build cd build cmake .. make

This tries to build everything, but skips tools if the dependencies for sqlite/postgres cannot be found.

License

Licensed under LGPL. This allows the creation of a library from this code that can be used in non-GPL programs.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.chglog		.chglog
cmake/Modules		cmake/Modules
.gitignore		.gitignore
.hgignore		.hgignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
INSTALL		INSTALL
LICENSE.md		LICENSE.md
README.md		README.md
csvbuilder.cpp		csvbuilder.cpp
csvbuilder.hpp		csvbuilder.hpp
csvimport.cpp		csvimport.cpp
dbspec.cpp		dbspec.cpp
dbspec.hpp		dbspec.hpp
pgbuilder.cpp		pgbuilder.cpp
pgbuilder.hpp		pgbuilder.hpp
pgimport.cpp		pgimport.cpp
soimport.cpp		soimport.cpp
soimport.hpp		soimport.hpp
soschema.hpp		soschema.hpp
sqlitebuilder.cpp		sqlitebuilder.cpp
sqlitebuilder.hpp		sqlitebuilder.hpp
sqliteimport.cpp		sqliteimport.cpp
tablebuilder.hpp		tablebuilder.hpp
xmldb.cpp		xmldb.cpp
xmldb.hpp		xmldb.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stack Overflow / Stack Exchange data dump import tool

Stack Exchange Data Dumps

Usage

sqliteimport

pgimport

csvimport

Common options

Created Database

Compilation

Dependencies

Compilation

License

About

Releases

Packages

Contributors 2

Languages

License

sth/sodata

Folders and files

Latest commit

History

Repository files navigation

Stack Overflow / Stack Exchange data dump import tool

Stack Exchange Data Dumps

Usage

sqliteimport

pgimport

csvimport

Common options

Created Database

Compilation

Dependencies

Compilation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages