This repo contains code examples for Cost Effective Data Pipelines Available for purchase wherever books are sold including:
Please send any comments, concerns, or problems to sev@thedatascout.com
- Install pyenv
- Install pyenv-virtualenv
- Install python 3.8.5
pyenv install 3.8.5
- Create virtualenv
pyenv virtualenv 3.8.5 oreilly-book
- Activate the virtual environment
pyenv activate oreilly-book
- Clone this repo
git clone git@github.com:gizm00/oreilly_dataeng_book.git
cd oreilly_dataeng_book
pip install wheel
- Install dependencies
python -m pip install -r requirements.txt
(based on these instructions)
Within the virtualenv created above run the following:
- Download apache-spark This material was developed using spark 3.2.1 with hadoop 3.2
- Move the tgz file to a place you will refer to it from, i.e. ~/Development/
tar -xvf ~/Development/spark-3.2.1-bin-hadoop3.2.tgz
- Add the following to your shell startup file, for example ~/.bash_profile:
export SPARK_HOME="/User/sev/Development/spark-3.2.1-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin:$PATH"
source ~/.bash_profile
pyspark
If you use the VSCode IDE on OSX, you can run pyspark notebooks with these instructions
- When you start the notebook in VS Code choose the
oreilly-book
venv as the python interpreter path