Skip to content

This repository contains the materials (code & theory) I compiled while undertaking DataCamp's Big Data with PySpark Learning Track

Notifications You must be signed in to change notification settings

IsaacMwendwa/Big-Data-with-PySpark

Repository files navigation

Big Data with PySpark

Table of Contents

Introduction

  • Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer)
  • Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data
  • As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster
  • It is a fact that parallel computation can make certain types of programming tasks much faster
  • However, with greater computing power comes greater complexity
  • Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
    • Is my data too big to work with on a single machine?
    • Can my calculations be easily parallelized?

About

This repository contains the materials (code & theory) I compiled while undertaking DataCamp's Big Data with PySpark Learning Track

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published