A data pipeline that extracts Reddit data for a Google Data Studio report, focusing on a specific subreddit. Motivated by interest in the subject and a desire to enhance skills, the project includes tools like dbt, Airflow, Docker, and cloud-based storage for added complexity.
- Data Extraction from Reddit API
- Managing AWS Resources with Terraform
- Data Upload to S3
- Data Loading to Redshift Serverless
- Data Transformation with dbt
- Apache Airflow Orchestration in Docker
- Visualizes transformed data with Google Looker Studio
Link P.S. As to not incur further costs,data for this chart has been taken from csv file. As for how to convert table from redshift into csv file see it in this file
- Clone the repository:
git clone https://github.com/Baitur5/reddit_api_elt.git
cd reddit_api_elt
- Get reddit api keys from here
- You must have configured AWS CLI
- Create variables.tf file in the root directory and setup the following variables:
variable "s3_bucket"{
type = string
default = ""
}
variable "namespace_name"{
type = string
default = ""
}
variable "workgroup_name"{
type = string
default = ""
}
# password for db admin
variable "password"{
type = string
default = ""
}
- Initialize terraform and Create resources:
terraform init
terraform apply
- Create .env file in the root directory and setup the following variables:
AIRFLOW_UID=
#client id from reddit api
CLIENT_ID =''
#secret key from reddit api
SECRET_KEY= ''
IAM_ROLE =''
S3_BUCKET =''
SUBREDDIT =''
WORKGROUP=''
DBT_TABLE = ''
- Setup airflow in docker:
docker compose up airflow-init
docker compose up
- Open http://localhost:8080/ and trigger dag
- Transform data in Redshift with dbt
- Setup Google Data Looker Studio and extract data from Redshift Serverless. For some tutorials check this out
Contribute or report issues. Pull requests are welcome! Securely handle sensitive information like API keys and credentials.
Thanks to this repo for providing inspiration and clear docs to learn from