Data Cleaning Case Study

Overview

This repository contains an exercise on data cleaning using a dataset focused on company layoffs. The aim was to apply data cleaning techniques to prepare the dataset for further analysis, visualization, and modeling. This exercise helps demonstrate the importance of data cleaning in ensuring data integrity and reliability for effective data analysis and machine learning.

Dataset

The dataset used for this exercise is layoffs.csv, which includes various attributes related to layoffs in different companies.

Libraries Used

Pandas: For data manipulation and analysis.
Seaborn: For creating informative and attractive visualizations.
Matplotlib: For additional plotting capabilities.
Scipy: For statistical functions.
Scikit-learn: For preprocessing tasks.

Steps Undertaken

Exploration:
- Loaded the dataset and performed initial exploration to identify missing values and get summary statistics.
Handling Missing Values:
- Addressed missing values by using mode for categorical data and mean/median for numerical fields based on their distribution.
Outlier Detection and Reduction:
- Detected outliers using z-score for total_laid_off and IQR for funds_raised_millions.
- Utilized boxplots and distplots to visualize the impact of outliers and the effectiveness of the chosen methods.
Data Type Conversion:
- Converted the date column to datetime type to ensure proper formatting.
Removing Duplicates:
- Cleaned up duplicate entries to enhance data quality.
Encoding & Scaling:
- Applied label encoding to categorical columns and Min-Max scaling to numerical features.

Key Visualizations

Boxplots and Distplots: Visualized the distribution and characteristics of the data to aid in understanding and handling various data integrity issues.

Conclusion

This exercise provided hands-on experience with data cleaning techniques, illustrating their importance in preparing data for accurate analysis and modeling. The process involved trial and error to find the most effective methods, making the learning experience both comprehensive and engaging.

How to Run

Clone the repository:

git clone https://github.com/Mehnaz2004/Data-Cleaning-CaseStudy.git

Navigate to the Directory:
```
cd Data-Cleaning-CaseStudy
```
Run the Data Cleaning Script:
```
python data_cleaning_script.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
cleaned_data.csv		cleaned_data.csv
data_cleaning_script.py		data_cleaning_script.py
layoffs.csv		layoffs.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cleaning Case Study

Overview

Dataset

Libraries Used

Steps Undertaken

Key Visualizations

Conclusion

How to Run

About

Languages

Mehnaz2004/Data-Cleaning-CaseStudy

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning Case Study

Overview

Dataset

Libraries Used

Steps Undertaken

Key Visualizations

Conclusion

How to Run

About

Topics

Resources

Stars

Watchers

Forks

Languages