Pyspark easy start

This repo shows how to easily get started with local spark cluster (one master, one worker) and run pyspark jobs on it, provided you have docker.

How to use

docker-compose up -d
docker-compose exec work-env sql.py

Sample output:

Creating network "mg-spark_default" with the default driver
Creating mg-spark_spark_1          ... done
Creating mg-spark_spark-worker-1_1 ... done
Creating mg-spark_work-env_1       ... done
21/03/05 15:56:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/03/05 15:56:25 WARN SparkContext: Please ensure that the number of slots available on your executors is limited by the number of cores to task cpus and not another custom resource. If cores is not the limiting resource then dynamic allocation will not work properly!
[Row(col0=0, col1=1, col2=2), Row(col0=3, col1=1, col2=5), Row(col0=6, col1=2, col2=8)]
+----+---------+---------+---------+
|col1|sum(col0)|sum(col1)|sum(col2)|
+----+---------+---------+---------+
|   1|        3|        2|        7|
|   2|        6|        2|        8|
+----+---------+---------+---------+

There are also run_sql.sh, run_file.sh and run_s3.sh scripts working for mac and linux.

Feel free to edit either of the .py files provided or create new ones. However, make sure any new files are inside same directory.

Requirements:

No need to have:

python. It is provided in the bitnami/spark:3-debian-10 image
pyspark. It is already installed inside bitnami/spark:3-debian-10 image

Note about local file access

read_file.py reads file from local filesystem. However, it's the worker that actually reads the file. Because of that there is volumes section defined for each docker service:

    volumes:
      - .:/app

so that the "local" file path resolves same way for every node.

Note about s3 / google cloud storage

Please update run_s3.py with s3 credentials (and endpoint if running own s3 service) file to run the s3 example. Credentials are provided inside python code which is not optimal - please do not do that for files going into any code repository.

To properly provide credentials and other info like s3 endpoint, some ideas is to use env vars or envfile. Please remember to add .env and related files to .gitignore

Credits:

Bitnami for their easy to use image
@dani8art for excellent explaination how to connect to cluster from pyspark

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
SPARK_README.md		SPARK_README.md
docker-compose.yml		docker-compose.yml
read_file.py		read_file.py
read_s3.py		read_s3.py
run_file.sh		run_file.sh
run_s3.sh		run_s3.sh
run_sql.sh		run_sql.sh
sql.py		sql.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pyspark easy start

How to use

Requirements:

No need to have:

Note about local file access

Note about s3 / google cloud storage

Credits:

About

Releases

Packages

Languages

leriel/pyspark-easy-start

Folders and files

Latest commit

History

Repository files navigation

Pyspark easy start

How to use

Requirements:

No need to have:

Note about local file access

Note about s3 / google cloud storage

Credits:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages