Social media data import tools

These include some tools to import social media data into sqlite3 databases. This allows more efficient access compared to parsing csv files directly.

Hints on using the data

Why databases?

You will be given some database files for your use.

The basic use is by executing queries and getting tabular data out. Imagine it's like reading a CSV file, but you can do pre-filtering in the database engine before you get the data. Since the database engine is very fast and designed for this, it saves you time and you can focus on only implementing interesting anylizes yourself.

Advantages include:

faster since it doesn't have to parse strings into numbers, etc
knows more about types than raw csv (ints come out as ints, etc)
if you want to do some basic filtering without reading everything in, you can do that (and since it pushes it to C and the database layer, it's must faster than reading all and filtering). You don't need as many types of intermediate files.
(then if you want to get fancier, you can slowly add more fancy queries, but isn't needed)

Disadvantages include:

Having to learn something new to use it (but it's a really valuable skill anyway)
Having to do a lot of work to insert the data into the database (but a lot of that work is basic data pre-processing - doing it once might be a worthy time investment anyway)

Usage

Interactively open from the command line:

$ module load sqlite
$ sqlite3 /PATH/TO/THE/DATABASE.sqlite3

You can directly iterate over rows that come out:

conn = sqlite3.connect("file:/PATH/TO/THE/DATABASE.sqlite3?immutable=1&mode=ro", uri=True)
for sub, title, author, time in conn.execute('SELECT subreddit, title, author, created_utc FROM submissions LIMIT 10'):
    print(f"{author} posted {title} to {sub}")

Reading the data directly into a pandas dataframe:

import sqlite3
import pandas as pd
# could give plain filename, but this lets us open read-only.
db = sqlite3.connect('file:/scratch/cs/socialmediadata/db.sqlite3?mode=ro&immutable=1', uri=True)
df = pd.read_sql("QUERY", db)
# do stuff with the dataframe

Loading into a networkx graph:

import sqlite3
sqlite3.connect('/scratch/cs/socialmediadata/db.sqlite3')
G = networkx.DiGraph()
G.add_edges_from(conn.execute('SELECT parent_id, id FROM comments '))

Common qeries

From here on out, it's all about making the right queries you need. SQL is fairly standardized so whatever you may know, can be used here. A lot will probabyl be learned by asking your colleagues what they use, since the list below can never be kept up to date.

TODO: add sample queries (ask your colleagues for now)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
duckdb		duckdb
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
load-queue.py		load-queue.py
load.py		load.py
requirements.txt		requirements.txt
zst.py		zst.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Social media data import tools

Hints on using the data

Why databases?

Usage

Common qeries

About

Releases

Packages

Contributors 2

Languages

AaltoRSE/socialmediadata

Folders and files

Latest commit

History

Repository files navigation

Social media data import tools

Hints on using the data

Why databases?

Usage

Common qeries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages