Skip to content

ETL pipeline with Apache Beam on Google's Dataflow runner

Notifications You must be signed in to change notification settings

jomavera/apacheBeam-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Apache Beam example

This data pipeline reads immigration data and performs transformations to join with other datasets using Apache Beam

For Local run

  1. On your working directory (Windows) run on terminal: python3 pipeline.py --input_dir <PATH>\<TO>\<DATA>\ --output_dir <PATH>\<TO>\<OUTPUT>\

For Dataflow run

This asumes to have installed and configured gsutil on local computer. Also that file paths are written with syntax as Linux (/) instead of syntax as Windows (\)

  1. Create Google storage bucket

  2. Upload data from directory data/ to Google storage bucket

gsutil cp -r data gs://<YOUR GCP BUCKET>/

  1. On Google cloud shell terminal install packages

sudo apt-get install python3-pip

sudo install -U pip sudo pip3 install apache-beam[gcp] oauth2client==3.0.0 pandas

  1. On Google cloud shell editor upload run file: pipeline.py

  2. Run job on cloud shell terminal

python3 pipeline.py --input_dir gs://<YOUR GCP BUCKET>/<PATH>/<TO>/<DATA>/ --output_dir gs://<YOUR GCP BUCKET>/<PATH>/<TO>/<DATA>/ --project <YOUR GCP PRJECT ID> --job_name <SET JOB NAME> --temp_location gs://<YOUR GCP BUCKET>/staging/ --staging_location gs://<YOUR GCP BUCKET>/staging/ --region us-central1 --runner DataflowRunner

About

ETL pipeline with Apache Beam on Google's Dataflow runner

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages