- Enable some of the GCP APIs (dataflow api, json api, logging api, biq query api, storage api, datastore api) from GCP console UI
- Establish an environment for beam
- create a conda
conda create -n beam-sandbox
environment, - activate w/
conda activate beam-sandbox
and - install
pip install apache-beam[gcp, test]
for GCP & Test additions
- create a conda
- Test environment w/
python -m apache_beam.examples.wordcount --output beam/text
,- Then,
cat beam/t*
to see words and counts.
- Create a bucket for dataflow on GCP Storage, right after creating a GCP Project !
- Edit
./run-count-dataflow.sh
file and change w/ your${PROJECT_ID}
- Create a bucket named
beam-pipelines-123
- Under this folder create folders for every beam file such as
line-count
- then, staging and temp folders such as
line-count\staging
and line-count\temp
folders
- then, staging and temp folders such as
- Edit
- Create a dataset bucket
gs://spark-dataset-1
on GCP Storage, and uploaddataset
folder into it. Public bucket level is much better. export GOOGLE_APPLICATION_CREDENTIALS=PATH_OF_SERVICE_ACCOUNT.json
- to run
-
python line-count.py
on your local (uses DirectRunner), or - Run on your local or GCP shell/Instance
./line-count-dataflow.sh
(uses DataFlowRunner)
-
- Look Dataflow UI on GCP console and dataflow jobs running.
- Check logs