Data Science Internship Project [Team M]
This project involves analyzing car data obtained from web scraping. The dataset includes information on car models, their prices, transmission types, and kilometers driven. The analysis includes handling missing data, detecting and removing outliers, and visualizing various aspects of the data.
To run the analysis, you will need the following Python libraries:
pandas numpy matplotlib seaborn
The dataset is loaded from a CSV file named "Final File.csv" into a pandas DataFrame.
The code checks for missing values in the DataFrame to ensure data completeness.
The maximum values for each column are calculated to understand the range of data.
Relevant columns are extracted from the DataFrame for focused analysis.
Outliers in the 'Kilometers' column are detected using Z-scores. A Z-score threshold of 3 is used to identify outliers, which are then removed from the DataFrame.
Various visualizations are created to gain insights from the data:
Bar Plot of Car Prices: A bar plot is created to show the prices of different car models. The x-axis represents the car models, and the y-axis represents their prices in lakhs.
Stacked Bar Chart of Transmission Distribution: The distribution of transmission types (manual or automatic) across different car models is visualized using a stacked bar chart. This helps in understanding the prevalence of each transmission type for different car models.
Scatter Plot of Car Prices: A scatter plot is created to show the relationship between car models and their prices. This plot helps in identifying any patterns or trends in pricing across different models.
The average price and average kilometers driven for the cars in the dataset are calculated and printed. This provides a summary statistic that gives an overall view of the data.
This project demonstrates the process of cleaning and analyzing car data obtained from web scraping. By handling missing data, removing outliers, and visualizing the data, valuable insights can be gained. This analysis can be extended further to include more sophisticated models and predictions based on the car data.