Airflow Dags to make it easier to transfer data from one place to another all while having full oversight of the data pipeline. In this project, I take tables/queries from a Postgre instance and transfer them to a Google Cloud Storage Bucket as a CSV or TSV. I also show how you can kick it up an notch and transfer the CSV to a BigQuery Table.
Table of Contents
-
I wanted to set up a process where I can automatically bring all my data from a Postgres instance to a Google Cloud Storage bucket and then transfer the CSV files to Google Cloud BigQuery for analytics/warehousing.
-
Note: The data depicted in the gifs below is from one of my personal servers.
- Install Airflow Locally or make a Google Cloud Composer Instance (~$10-$12/day). Pros and cons below.
- Local Airflow Instance
- Pros:
- Free
- The new Docker image makes it a bit easier to build.
- Best if you're experienced and don't have money/want to spend money.
- Cons:
- Adding a Ubuntu or other Linux distro VM on your computer can slow/overheat it unless it's a souped up PC.
- Can be very confusing for a newby.
- Many requirements, so it takes a while to set up (can be north of a full day's time).
- Pros:
- GCP Cloud Composer
- Pros:
- If you have never used Google Cloud before, you can get a FREE $300 credit fro your first month, so this would be a free environment for the first ~15-20 days.
- Reletively fast setup (~25 Mins to initialize)
- Easiest set up by far!
- All the GCP connections come preset. So you can run your dags with minimal connectino set up time.
- The best choice if you're a newby with $10-$20 to burn on a weekend and are looking for experience.
- Cons:
- Cost: $10-$12 USD/day
- You can't build over time, need to focus and dedicate yourself to coding. Time is money.
- Pros:
- Create your Postgres Connection
- Test your new connection with the Airflow "Data Profiling" Tool
-
Edit the dag(s) so they pull the right data from Postgres and push the right data in the right formats on Google Cloud.
-
In your Google Cloud Storage Instance (created for this airflow instance), upload these dags to the dag folder.
- Run the DAG(s). Read the Logs if the operation fails. Make sure the data got to the storage bucket and is in the right format.
- Check out BigQuery to make sure the CSV data got over (if you ran the postgre_table_to_gcs_to_bq dag)