Big Data with PySpark

Introduction

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer)
Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data
As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster
It is a fact that parallel computation can make certain types of programming tasks much faster
However, with greater computing power comes greater complexity
Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
- Is my data too big to work with on a single machine?
- Can my calculations be easily parallelized?

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
1. Introduction to PySpark.md		1. Introduction to PySpark.md
2. Big Data Fundamentals with PySpark.md		2. Big Data Fundamentals with PySpark.md
2.1 Big Data Fundamentals with PySpark.md		2.1 Big Data Fundamentals with PySpark.md
3.1 Cleaning Data with PySpark.md		3.1 Cleaning Data with PySpark.md
3.2 Cleaning Data with PySpark.md		3.2 Cleaning Data with PySpark.md
README.md		README.md