In this repository, I have uploaded tasks which were assigned to me by TechnoHacks EduTech during Data Analytics Internship.
-
Task 1 : Data Cleaning - Cleaned a dataset by removing missing values and outliers. I have used the Titanic dataset from Kaggle. I have used Python various libraries such as NumPy, Pandas and SciPy to complete this task.
Identified missing values from the dataset using functions such as
isnull()
,isnull().sum()
. Handled missing values by filling them using thefillna()
function. Also usedmean()
andmode()
functions to handle missing values for specific column. Dropped the column in which many missing values were present. Verified the cleaned DataFrame to check if there are any remaining missing values in the DataFrame.Used statistical Methods for Outlier Detection : IQR (Interquartile Range) Method to find outliers by identifying data points. Calculated IQR, detected and removed outliers for specific columns.
-
Task 2 : Summary Statistics - Calculated summary statistics (mean, median, mode, standard deviation) for numeric columns in a dataset. I have used the Titanic dataset from Kaggle. I have used Python libraries such as NumPy and Pandas to complete this task.
Calculated the mean (average) using Pandas
mean()
function, median (middle value) using Pandasmedian()
function and standard deviation (dispersion in a set of data points) of the numeric columns using Pandasstd()
function. Calculated the mode (most frequent value) of the numeric columns using Pandasmode()
function. -
Task 3 : Remove Duplicates - Identified and removed duplicate values in a dataset. I have used the Iris dataset from Kaggle. I have used Python libraries such as NumPy and Pandas to complete this task.
Identified duplicates values using the most important method to deal with duplicates,
duplicated()
method which will tell which values are duplicate. Removed duplicates usingdrop_duplicates()
method to get rid of duplicate values. And finally checked whether the duplicates are removed from the dataset.