This project performs a detailed analysis of Test cricket batting records using data cleaning, feature engineering, and exploratory data analysis (EDA). The goal is to uncover insights into players' performances and career trends.
import pandas as pd
df = pd.read_csv('TestMatch_Data - Test matches _ Batting records _ Highest career batting average _ ESPNcricinfo.csv')
- Initial Inspection:
- Displayed the dataset using
df.head()
and verified dimensions withdf.shape
. - Checked for missing and duplicate values using
df.isnull()
anddf.duplicated()
.
- Displayed the dataset using
- Columns contained mixed data formats (e.g.,
Span
,Matches
). - Player names included country affiliations.
df.rename(columns={
'Mat': 'Matches',
'NO': 'Not_Out',
'HS': 'Highest_Inns_Score',
'BF': 'Ball_Faced',
'SR': 'Batting_Strike_Rate',
'0': 'Ducks'
}, inplace=True)
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
- Split
Span
intoDebut_Year
andLast_Year
.
df['Debut_Year'] = df['Span'].str.split('-').str[0].astype(int)
df['Last_Year'] = df['Span'].str.split('-').str[1].astype(int)
df.drop(['Span'], axis=1, inplace=True)
- Extract Player Name and Country:
df['Player_Name'] = df['Player'].str.split('(').str[0]
df['Country'] = df['Player'].str.extract('\((.*?)\)')[0]
df.drop(['Player'], axis=1, inplace=True)
- Convert Columns to Numeric:
df['Highest_Inns_Score'] = df['Highest_Inns_Score'].str.replace('*', '').astype(int)
df['Matches'] = df['Matches'].astype(int)
df['4s'] = df['4s'].astype(int)
df['6s'] = df['6s'].astype(int)
df['Ball_Faced'] = df['Ball_Faced'].str.split('+|-').str[0].astype(int)
df['Career_Length'] = df['Last_Year'] - df['Debut_Year']
- Average Career Length:
df['Career_Length'].mean()
- Average Strike Rate for Players with Long Careers:
df[df['Career_Length'] > 10]['Batting_Strike_Rate'].mean()
- Players Who Debuted Before 1960:
df[df['Debut_Year'] < 1960]['Player_Name'].count()
- Highest Innings Scores by Country:
df.groupby('Country')['Highest_Inns_Score'].max()
- Centuries Scored by Country:
df.groupby('Country')['100'].max()
- Averages of Key Metrics by Country:
df.groupby('Country')[['100', '50', 'Ducks']].mean()
import seaborn as sns
sns.set_theme(style='darkgrid')
sns.scatterplot(df.groupby('Country')[['100', '50', 'Ducks']].mean())
This analysis provides a comprehensive view of Test cricket batting performances. It highlights:
- Career trends, including average career lengths and strike rates for long-tenured players.
- Country-wise performance metrics like highest innings scores and averages of centuries.
- The importance of data cleaning and feature engineering in deriving meaningful insights from raw data.