- Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer)
- Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data
- As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster
- It is a fact that parallel computation can make certain types of programming tasks much faster
- However, with greater computing power comes greater complexity
- Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
- Is my data too big to work with on a single machine?
- Can my calculations be easily parallelized?
-
Notifications
You must be signed in to change notification settings - Fork 0
IsaacMwendwa/Big-Data-with-PySpark
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
This repository contains the materials (code & theory) I compiled while undertaking DataCamp's Big Data with PySpark Learning Track
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published