This repository contains an exercise on data cleaning using a dataset focused on company layoffs. The aim was to apply data cleaning techniques to prepare the dataset for further analysis, visualization, and modeling. This exercise helps demonstrate the importance of data cleaning in ensuring data integrity and reliability for effective data analysis and machine learning.
The dataset used for this exercise is layoffs.csv
, which includes various attributes related to layoffs in different companies.
- Pandas: For data manipulation and analysis.
- Seaborn: For creating informative and attractive visualizations.
- Matplotlib: For additional plotting capabilities.
- Scipy: For statistical functions.
- Scikit-learn: For preprocessing tasks.
-
Exploration:
- Loaded the dataset and performed initial exploration to identify missing values and get summary statistics.
-
Handling Missing Values:
- Addressed missing values by using mode for categorical data and mean/median for numerical fields based on their distribution.
-
Outlier Detection and Reduction:
- Detected outliers using z-score for
total_laid_off
and IQR forfunds_raised_millions
. - Utilized boxplots and distplots to visualize the impact of outliers and the effectiveness of the chosen methods.
- Detected outliers using z-score for
-
Data Type Conversion:
- Converted the date column to datetime type to ensure proper formatting.
-
Removing Duplicates:
- Cleaned up duplicate entries to enhance data quality.
-
Encoding & Scaling:
- Applied label encoding to categorical columns and Min-Max scaling to numerical features.
- Boxplots and Distplots: Visualized the distribution and characteristics of the data to aid in understanding and handling various data integrity issues.
This exercise provided hands-on experience with data cleaning techniques, illustrating their importance in preparing data for accurate analysis and modeling. The process involved trial and error to find the most effective methods, making the learning experience both comprehensive and engaging.
- Clone the repository:
git clone https://github.com/Mehnaz2004/Data-Cleaning-CaseStudy.git
- Navigate to the Directory:
cd Data-Cleaning-CaseStudy
- Run the Data Cleaning Script:
python data_cleaning_script.py