Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
-
Updated
Sep 19, 2022 - Python
Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag
Run Jupyter Notebooks (and store data) on Google Cloud Platform.
GCP_Data_Enginner
An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation and a machine learning model.
Data Workflows with GCP Dataproc, Apache Airflow and Apache Spark
A Java based project aims to extract news articles from large .sgm file, process them and load them into MongoDB Database. It includes an Apache Spark job for word frequency analysis directly from .sgm files, and a sentiment analysis implementation using a Bag-of-Words model in Java.
A Scala Spark based project to experiment with map-reduce algorithms on big data graph shaped
Yelp ETL Pipeline in Apache Spark on Google Cloud Dataproc
Collection of personal resources on Google Cloud
Projeto do Curso "Criando um Ecossistema Hadoop Totalmente Gerenciado com Google Cloud Dataproc" do Bootcamp Data Engineer da Digital Innovation One
Creating gcloud dataproc cluster with this github action
Example terraform project using GCP to provision an Apache Spark Cluster with a Jupyter Notebook interface.
Deploying production ready environment for Spark cluster
Determination of which words occur in a dataset of textbooks along with each word's occurrence count identification with the help of Google Cloud Platform based Dataproc cluster formation.
Código fuente: Análisis de Vuelos basado en trabajo de Valliappa Lakshmanan.
Training a classification model as a Dataproc Job and using Kafka/PubSub connector for real-time prediction using pre-trained models
PySpark Job that runs in Dataproc cluster, loads data from Cloud Storage to BigQuery table.
Kaggle - Outbrain Click Prediction (Oct-2016 - Jan-2017)
Content about how to create big data ecosystems on the Cloud
Add a description, image, and links to the dataproc-cluster topic page so that developers can more easily learn about it.
To associate your repository with the dataproc-cluster topic, visit your repo's landing page and select "manage topics."