-
Updated
Jul 11, 2021 - Jupyter Notebook
spark-cluster
Here are 24 public repositories matching this topic...
This is a self-documentation of learning distributed data storage, parallel processing, and Linux OS using Apache Hadoop, Apache Spark and Raspbian OS. In this project, 3-node cluster will be setup using Raspberry Pi 4, install HDFS and run Spark processing jobs via YARN.
-
Updated
Jul 13, 2024 - Shell
A distributed application to identify top 50 taxi pickup locations in New York by analyzing over 1 billion records using apache spark, hadoop file system and scala.
-
Updated
May 6, 2020 - Scala
I'll walk you through launching a cluster manually using Spark standalone deploy mode, as well as connecting an app to the cluster, launching the app, where to view the monitoring and logging.
-
Updated
Jul 28, 2020
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
-
Updated
Jul 12, 2024 - Python
Spark submit extension from bde2020/spark-submit for Scala with SBT
-
Updated
Apr 13, 2020 - Scala
KMeans, Cure and Canpoy algorithms are demonstrated using Pyspark.
-
Updated
May 19, 2021 - Jupyter Notebook
Start clusters in virtualbox VMs
-
Updated
Mar 10, 2020
To facilitate the initial setup of Apache Spark, this repository provides a beginner-friendly, step-by-step guide on setting up a master node and two worker nodes.
-
Updated
Jun 10, 2024 - Python
In this project, we used both Hadoop / MapReduce and Spark to do distributed computing. The first task was to perform a series of operations using a Mapper and Reduce java file that was implemented on a Hadoop server. The second task was to perform similar operations, but on Spark instead.
-
Updated
Oct 31, 2022 - Java
This is my contribution in the project Diastema
-
Updated
Sep 2, 2022 - Python
In this study, we propose to use a distributed storage and computation system in order to track money transfers instantly. In particular, we keep our transaction history in a distributed file system as a graph data structure. We try to detect illegal activities by using Graph Neural Networks (GNN) in distributed manner.
-
Updated
Jan 30, 2024 - Python
Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset
-
Updated
Jul 10, 2023 - Jupyter Notebook
A spark cluster containing multiple spark masters based on docker-compose.
-
Updated
Mar 23, 2018 - Shell
Spark standalone architecture, local architecture and reading hadoop file formats i.e. avro, parquet and ORC
-
Updated
Jan 4, 2021 - Jupyter Notebook
docker spark standalone
-
Updated
Jul 8, 2019 - Dockerfile
Script to run and find similarities between movies from a movie lens data set using Python & Spark Clustering.
-
Updated
Sep 30, 2020 - Python
Improve this page
Add a description, image, and links to the spark-cluster topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the spark-cluster topic, visit your repo's landing page and select "manage topics."