This project focuses on cleaning and analyzing a dataset containing information on layoffs in the tech industry. The dataset includes details on affected companies, industries, locations, and funding levels. The goal is to clean and process the data using both MySQL and Python (Pandas) to compare their effectiveness in handling data cleaning and analysis.
- layoffs_cleaning.sql – SQL script for cleaning the dataset using MySQL.
- layoffs_cleaning.ipynb – Jupyter Notebook replicating the cleaning process using Python (Pandas).
- layoffs.csv – The original raw dataset.
- layoffs_cleaned.csv – The cleaned dataset after processing.
- README.md – This file, which provides project details and comparisons between MySQL and Python.
- Created a staging table to preserve raw data.
- Identified and removed duplicates using
ROW_NUMBER(). - Standardized company and country names using
TRIM()andLIKE. - Converted the
datecolumn to a properDATEformat usingSTR_TO_DATE(). - Handled missing values by filling them based on related records.
- Removed rows where critical numerical values were missing.
- Performed analysis on layoffs by industry, company, country, and year.
- Loaded the dataset using Pandas.
- Removed duplicates with
groupby()andcumcount(). - Standardized text fields by converting them to lowercase and stripping special characters.
- Converted the
datecolumn todatetimeformat usingpd.to_datetime(). - Filled missing values using grouped data (mode per country).
- Identified and removed outliers using the interquartile range (IQR) method.
- Analyzed layoffs by company, industry, country, and year.
| Feature | MySQL | Python (Pandas) |
|---|---|---|
| Duplicate Removal | Uses ROW_NUMBER() & DELETE |
Uses groupby().cumcount() & drop_duplicates() |
| Text Standardization | Uses TRIM() & LIKE |
Uses str.strip() & apply() |
| Date Conversion | Uses STR_TO_DATE() & ALTER |
Uses pd.to_datetime() |
| Handling Missing Data | Uses UPDATE & JOIN |
Uses fillna() & map() |
| Performance | Faster for large structured data | More flexible for complex transformations |
| Ease of Use | Requires SQL queries | More programmatic and adaptable |
- MySQL is efficient for handling structured datasets stored in databases.
- Python (Pandas) is more flexible for complex data transformations and analysis.
- Both approaches work well, but Python simplifies handling missing values dynamically.
- Import
layoffs.csvinto MySQL. - Execute
layoffs_cleaning.sql. - Query
layoffs_clean2for the cleaned dataset.
- Open
layoffs_cleaning.ipynbin Jupyter Notebook. - Run all cells to process the dataset.
- The cleaned dataset will be saved as
layoffs_cleaned.csv.
This project explores layoff trends in the tech industry, highlighting affected companies, industries, and regions. The comparison between MySQL and Python demonstrates how both tools handle data cleaning efficiently but with different strengths.
If you have suggestions or improvements, feel free to contribute or raise an issue!
Author: Naitik Nayak