EDA is the method of analysing the dataset for finding correlation between variables, distribution of the dataset, handling outliers, and visualizing the dataset to better understand the nature and characteristics of the dataset. Here are some key concepts in EDA:
1. Data Cleaning:
Handle missing data: Imputation, removal, or advanced techniques. Outlier detection: Identify and handle data points significantly different from others. Duplicate data: Identify and remove identical or redundant records.
2. Univariate Analysis:
Descriptive statistics: Mean, median, mode, range, and percentiles. Visualizations: Histograms, kernel density plots, box plots for data distribution.
3. Bivariate Analysis:
Scatter plots: Visualize relationships between two continuous variables. Heatmaps: Depict correlations between two variables.
4. Multivariate Analysis:
Use techniques like pair plots or 3D scatter plots to explore relationships among multiple variables simultaneously.
5. Statistical Measures:
Mean: Average value. Median: Middle value. Mode: Most frequently occurring value. Variance: Measure of data dispersion. Standard Deviation: Measure of data variability.
6. Data Visualization:
Understand how to use various plots like bar charts, line charts, and pie charts. Choose the right visualization based on the type of data.
7. Dimensionality Reduction:
Principal Component Analysis (PCA): Reduce the number of variables while preserving essential information. Feature selection: Identify and keep only the most relevant features.
8. Correlation Analysis:
Correlation coefficient: Measure the strength and direction of relationships. Understand how to interpret positive, negative, and zero correlations.
9. Outlier Detection:
Z-score, IQR methods: Identify and handle outliers that can skew analysis.