To structure and simplify Sparkify's access to user activity.
- facilitating queries concerning user activity (songs listened)
- moving from directory of JSON logs into snowflake RDBMS system on PostgreSQL
Currently, Sparkify doesn't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
- Define STAR-like data structure in PostgreSQL
- Define a set of SQL queries to":
- drop tables
- create designed tables (schema)
- populate tables
- Define a set of scripts for batch processing of gathered
JSON
logs
sql_queries.py
- defines schema structure, functions toDROP
andUPSERT
into tablescreate_Tables.py
- a helper function to clear existing schema, and create a blank (empty) structure ready to filletl.py
- a function to batch-process files on a)song_data
and b)log_data
in the provided directory
-
etl.ipynb
- a notebook to document and test the functions inserting data into tables. Notebook works on a single (example) file, and demonstrates/prototypes the functions and processing for the proper upload of data into PostgreSQL -
test.ipynb
- a notebook to document that the data was properly formatted and uploaded into designed schemas. A set of simple SQL queries to display table structure.
Overview:
songplays
- records in log data associated with song plays i.e. records with page NextSongsongplay_id
,start_time
,user_id
,level
,song_id
,artist_id
,session_id
,location
,user_agent
users
- users in the appuser_id
,first_name
,last_name
,gender
,level
songs
- songs in music databasesong_id
,title
,artist_id
,year
,duration
artists
- artists in music databaseartist_id
,name
,location
,latitude
,longitude
time
- timestamps of records in songplays broken down into specific unitsstart_time
,hour
,day
,week
,month
,year
,weekday
Please see test.ipynb
notebooks for example basic queries.
The central "fact" table is songplays
that could be easily queried to obtain desired information.