This repository contains an Apache Beam batch processing pipeline that reads Parquet files from Google Cloud Storage (GCS), performs ETL transformations, and writes the processed data to a BigQuery table. The pipeline is designed to run on Google Cloud Dataflow. For example purposes, it uses a subset of the Yelp dataset. The ETL transformations and full project can be found here.
The pipeline is built using Apache Beam and includes the following main components:
- Reading Parquet files from a GCS bucket.
- Parsing the Parquet files and extracting the data.
- Performing custom ETL transformations on the data.
- Writing the processed data to a BigQuery table.
- Monitoring for new files in the GCS bucket and processing them on arrival.
The pipeline uses custom functions located in the pipeline_trial.custom_fns
module for ETL transformations and utility functions.
- Install the Google Cloud console. Create a Project on Google, GCS Bucket, BigQuery dataset, and activate necessary permissions for using Google Dataflow and Google Artifact Registry.
- Add a table named "registro" with the following schema: "archivo:STRING, fecha: DATETIME".
- Modify the
etl.py
script and add the necessary transformation in case you want to use your own data and ETL functions. If you use your own data, please modify "schema_original.json" to match your data. This will instruct BigQuery to create the table where data will be loaded. - Modify the Makefile with your GCP project data.
- Run
make init
just one time to create the necessary buckets and permissions. - Run
make template
to create your custom template, which will be saved on Artifact Registry. - Go to Dataflow and add a job with a custom template selecting the
.json
saved in the new bucket. Remember to put all the parameters and in optional parameters also complete runtime, temporary, and staging location (usually gs://yourbucket/temp and /staging). - In case your script fails, check the log saved in /staging.
- Add
sample.snappy.parquet
to your bucket directory, which you specified and wait for the magic to happen. - Enter BigQuery and make a query on your table and registro table to check your results.
- More detailed tutorial.
- Proper script documentation.
Thanks to kevenpinto which provides sample code for streaming on Apache beamand a Medium tutorial.