TL;DR: This project is analyzing MTG card data. Follow the steps mentioned under How to make it work? to set it up.

What is this about?

If you have never seen the logo above, here a very quick intro: Magic: The Gathering (Magic or MTG) is the first collectible card game of its kind, created in 1993. It can be played by two or more players. The exact rules differ a bit by the format played but in general: players battle against each other with their card decks, trying to defeat the other players by casting spells, artifacts and creatures. Curious? Learn how to play...

With this project I aim to built an end-to-end orchestrated data pipeline. I am calling the Scryfall API to get a complete export of the latest card information available, and save the data to my Google Cloud Storage. Afterwards, I filter the data on the needed columns, enforce a data type and push it to BigQuery. In there, I use DBT to derive new columns and make sure that only the most up-to-date dataset is used for visualization in the dashboard.

What questions am I trying to answer?

As this project looks at MTG from a meta-perspective we are diving into the following:

How many cards are being released over time?
What is the color distribution?
Which set has the highest count of cards?
Who are the most common artists?
What are the most expensive cards?

What technologies are being used?

Cloud: Google Cloud
Infrastructure: Terraform
Orchestration: Prefect
Data lake: Google Cloud Storage
Data transformation: DBT
Data warehouse: BigQuery
Data visualization: Google Looker Studio

Dashboard example

Click here to see my Looker dashboard.

What is the structure of the production table?

Column	Description
primary_key	Unique surrogate key from card_id and released_at data points
card_id	Card ID in database, IDs can be repeated due to reprintings
name	The name of this card
released_at	The date this card was first released
color_identity	This card’s color identity
color_category	Based on the color_identity: Black, Blue, White, Green, Red, Colorless or Mixed
set_name	This card’s full set name
artist	The name of the illustrator of this card face
price	Price information of this card in US Dollar
data_update	Timestamp when the data was updated in the database

Here the dbt lineage graph
Partitioned on the released_at column - in favor of question 1 and 3 - assuming that in most cases, cards with the same release date are from the same set
Clustered on the color_category column - in favor of question 2 - assuming that within one set the number of colors is lower than the numbers of unique prices and artists

How to make it work?

Setup your Google Cloud environment

Create a Google Cloud Platform project
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges: BigQuery Admin, Storage Admin and Storage Object Admin
Download the JSON credentials and save it, e.g. to ~/.gc/<credentials>
Install the Google Cloud SDK
Let the environment variable point to your GCP key, authenticate it and refresh the session token

export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login

Install all required dependencies into your environment

pip install -r requirements.txt

Setup your infrastructure

Assuming you are using Linux AMD64 run the following commands to install Terraform - if you are using a different OS please choose the correct version here and exchange the download link and zip file name

sudo apt-get install unzip
cd ~/bin
wget https://releases.hashicorp.com/terraform/1.4.1/terraform_1.4.1_linux_amd64.zip
unzip terraform_1.4.1_linux_amd64.zip
rm terraform_1.4.1_linux_amd64.zip

To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands

cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"

Setup your orchestration

If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here
Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_blocks.py and run

python magic-the-gathering/prefect/prefect_blocks.py

Adjust the keyfile location at dbt/profiles.yml to the path of your Google Cloud credentials JSON
To execute the flow, run the following commands in two different CL terminals

prefect agent start -q 'default'

python prefect/api_to_gcs_to_bq.py

Data deep dive

The data will be available in BigQuery at mtg_card_data_dbt.dbt_mtg_latest_data
Query the data in-place or build a dashboard

Potential next steps

With a growing database, I would be able to further explore the following:

What is the color distribution over time?
What is the price development of specific cards / colors / sets over time?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

What is this about?

What questions am I trying to answer?

What technologies are being used?

Dashboard example

What is the structure of the production table?

How to make it work?

Potential next steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

What is this about?

What questions am I trying to answer?

What technologies are being used?

Dashboard example

What is the structure of the production table?

How to make it work?

Potential next steps