You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Azure projects - End to End Data Engineering Project with medallion architecture using Azure Data Factory & Azure Databricks. Azure Serverless/Logical DataWarehouse using Azure Synapse Analystics to demo CETAS, Data Modeling, Incremental loading, CDC and Sql Monitoring the data processing connected to Power BI
Contains my project that analyzes air quality sensor data to determine if the NO (Nitric Oxide) sensor in N. Mai, Los Angeles, CA can be removed without affecting data accuracy.
A PySpark repository for data analysis, machine learning projects, and hands-on exercises. Explore scalable data processing and advanced ML workflows with Spark.
This repository contains an end-to-end real-time YouTube comments sentiment analysis solution. It uses Azure Event Hub for data ingestion, Azure Data Factory for orchestration, and Databricks for data processing with VADER for sentiment analysis. The pipeline outputs results to Delta Lake for scalable querying and storage.
This repository contains a data engineering project analyzing global earthquake events. Utilizing Microsoft Fabric, PySpark, and Power BI, it automates data fetching and cleaning from the USGS Earthquake Catalog and provides dynamic visualizations to uncover insights.
The IPL Data Analysis project aims to explore and analyze the Indian Premier League (IPL) data using PySpark for data processing and Matplotlib and Seaborn for data visualization. The goal is to derive actionable insights into player performances, match trends, and overall league dynamics.
This project presents a comprehensive data pipeline designed to predict customer churn using historical customer data. By leveraging Hadoop and PySpark, this pipeline efficiently processes large datasets, performs feature engineering, and trains a machine learning model to identify customers at risk of leaving.
In this project I have performed complete analysis on a psl data of 2017 , I uploaded the data on the AWS s3 bucket , then load the data in databricks and then applied various transformations on the datasets to get various insights from the data.
This repository contains Databricks projects utilizing RDDs, DataFrames, and SQL to process and analyze various real-world datasets. Data cleaning and analysis have been performed using PySpark functions to handle challenges such as inconsistent formats, missing values, and complex data structures. The project ensures efficient data transformation
Concevoir et alimenter un datalake sur la vente des jeux vidéos. Combiner 2 sources de données (semi) structurées et dénormalisées : API Kaggle (dataset de jeux avec dates de sorties et évaluation) + API Twitter(commentaires sur la base des hashtags des noms des jeux récupérés avec du code Python).