Skip to content

Latest commit

 

History

History
425 lines (325 loc) · 22.4 KB

README.md

File metadata and controls

425 lines (325 loc) · 22.4 KB

MY DATA CLEANING PROJECT WITH SQL

Data Wrangling - also referred to as data cleaning, data remediation, or data munging - encompasses a range of processes aimed at converting raw data into more easily usable formats. This step is crucial for successful data analysis as it enables proper and efficient data examination, leading to informed business decisions.

PROJECT OVERVIEW

In this project, I will meticulously detail every step I undertook to remediate a substantial FIFA 2021 dataset. This dataset was sourced from Kaggle and contains comprehensive records of over 18,000 players, featuring information such as ID, Name, Age, Country, Club, and numerous other attributes across 77 columns. My motivation for selecting this dataset is to demonstrate my proficiency in working with poorly formatted variables, rectifying data inconsistencies to establish a well-structured dataset devoid of errors during subsequent analysis. My aim is to showcase my ability to transform exceedingly messy data into a clean and usable format.

DATA PREPARATION

Prior to importing the dataset into Microsoft SQL Server Management Studio (MSSMS), I initially opened the dataset in Excel and saved it as an xlsx file. This saved file was then imported into MSSMS using the MSSMS Import and Export wizard. DATA CLEANING PROCESS The steps I executed during the data cleaning project within MSSMS encompassed the following phases:

  • Data Discovery: Examining the dataset to gain insights into its structure, dimensions, and initial quality.
  • Data Structuring: Organizing and structuring the data for analysis, including reformatting incorrectly formatted data types.
  • Cleaning: Addressing data entry inconsistencies and irregularities, handling missing or blank values, identifying, and resolving duplicates, and other necessary data quality enhancements.

By meticulously following these steps, I aimed to transform the FIFA 2021 dataset into a clean, well-organized, and error-free resource, ready for in-depth analysis and valuable insights. This project serves as a testament to my data wrangling skills and my ability to turn challenging data into an asset for decision-making.

LET’S GET STARTED!!!

You can take a look at the entire SQL code Here on my GitHub profile. I used Microsoft SQL Server Management Studio to develop this project.

DATA DISCOVERY

select * from FifaData.dbo.FIFA2021DATA

image

COLUMN STANDARDIZATION

When I initially examined my dataset, I observed that both the 'NAME' and 'LONGNAME' columns contained inconsistent data. However, the 'PLAYERURL' column consistently held the full names of the players. I decided to extract this information from the 'PLAYERURL' column and create a new column named 'FULLNAME.' Consequently, I eliminated the previous two inconsistent columns.

Procedure: Step 1: Adding a Temporary Column Added a new column named "tempcol" to the existing table.

-- Create a temporary column that stores the first substring from the playerUrl
alter table FifaData.dbo.FIFA2021DATA
add TempCol Nvarchar(255);

Wrote SQL queries to manipulate the data within "playerUrl" to extract the player names.

-- Reversing the string and counting manually counting from the number of strings before the name 
-- Which is where the substring will start extracting from
-- Updating the created column with the substring data
update FifaData.dbo.FIFA2021DATA
set TempCol = substring(reverse(playerUrl), 9, LEN(playerUrl))
from FifaData.dbo.FIFA2021DATA

Step 2: Adding the Fullname Column Added a new column named "Fullname" to the existing table.

-- Create the fullname column that stores the second substring from the TempCol 
-- Which is the names of the players
alter table FifaData.dbo.FIFA2021DATA
add Fullname Nvarchar(255);

Wrote SQL queries to manipulate the data within "tempcol" to extract the player names.

-- Update the fullname column with string extracted from the TempCol
update FifaData.dbo.FIFA2021DATA
set Fullname = reverse(substring((TempCol), 1, CHARINDEX('/', TempCol)-1)) 
from FifaData.dbo.FIFA2021DATA

Step 3: Dropping columns that are no more needed

-- Droppping uneccessary columns
alter table FifaData.dbo.FIFA2021DATA
drop column TempCol, TempCol2, Name, LongName, photoUrl, playerUrl

Step 4: Replace the dashes in the name and capitalize them.

-- Replace the '-' with space and capitalize the names
update FifaData.dbo.FIFA2021DATA
set Fullname = upper(replace(Fullname, '-', ' '))
from FifaData.dbo.FIFA2021DATA

Before:

image

After:

image

The dataset now has a standardized "fullname" column, and redundant columns "name" and "longname" have been removed.

COLUMN NORMALIZATION

The primary goal was to scale various columns within the dataset to the percentage format, allowing for uniform representation of percentage-based data points. For each relevant column, integers were converted into decimals by dividing the integer values by 100. This transformation ensured that the data, originally presented as whole numbers, were accurately scaled to fit within the percentage range of 0 to 100. The SQL statement

set BOV_ = BOV/100.0

was utilized to convert integer values in the column BOV and other relevant columns into decimal format, representing percentages. The initial columns were dropped after and the new created columns replaced them.

Before:

image

After:

image

COLUMN UNIT STANDARDIZATION

The dataset contained heights recorded in both centimeter (cm) and in feet and inches (e.g., 6'7"), weights recorder in kilogram (kg) and pounds (lbs), leading to inconsistencies and potential misinterpretations. To ensure accuracy and uniformity, a crucial step was taken to convert all weight values to pounds, and all height values to inches.

Height Conversion

Heights, presented in both centimeters (cm) and the feet and inches format (e.g., 6'7"), were standardized into inches (inch). For centimeters, the conversion factor of 1 cm equals approximately 0.393701 inches was applied. For the feet and inches format, the feet value was multiplied by 12 to convert it into inches, which was then added to the remaining inches. The query:

-- Convert Height to inches and update the table
update FifaData.dbo.FIFA2021DATA
set Height_In_Inch =
CASE 
    -- If the value contains 'cm', convert cm to inches (1 cm = 0.393701 inches)
    WHEN Height LIKE '%cm' THEN CAST(REPLACE(Height, 'cm', '') AS FLOAT) * 0.393701
    -- If the value contains feet and inches (e.g., 6'7"), convert to inches (1 foot = 12 inches)
    WHEN Height LIKE '%"%' THEN CAST(LEFT(Height, 1) AS FLOAT) * 12 + CAST(SUBSTRING(Height, len(Height) - 1, 1) AS FLOAT)
    -- For other cases, assume the value is already in inches
    ELSE CAST(Height AS FLOAT)
END
from FifaData.dbo.FIFA2021DATA

Before: This shows that there are having inconsistencies in the Height column

image

After: This inconsistencies as now been fixed and the column has been standardized

image

Weight Conversion

The weight data in the dataset shows a variety of formats, including 'kg' and 'lbs' appended to the numerical values (e.g., '23kg', '44lbs'). To ensure uniformity, a two-step approach was adopted.

Handling 'kg' Values:

  • Values with 'kg' units were first parsed to remove the 'kg' suffix.
  • The parsed values were then converted to floating-point numbers to ensure numerical accuracy.
  • These numerical values were subsequently converted into pounds using the conversion factor of 1 kilogram equals approximately 2.20462 pounds.

Handling 'lbs' Values:

  • Values with 'lbs' units were parsed to remove the 'lbs' suffix.
  • The parsed values were converted to floating-point numbers directly since the goal was to represent all weights in pounds uniformly.

The Query:

-- Convert Weight to Lbs and update the table
update FifaData.dbo.FIFA2021DATA
set Weight_In_Lbs =
CASE 
    -- If the value contains 'kg', convert kg to lbs (1 kg = 2.20462 pounds)
    WHEN Weight LIKE '%kg' THEN CAST(REPLACE(Weight, 'kg', '') AS FLOAT) * 2.20462
    -- If the value contains lbs, remove the appended 'lbs' and convert to float
    WHEN Weight LIKE '%lbs' THEN CAST(REPLACE(Weight, 'lbs', '') AS FLOAT)
    -- For other cases, assume the value is already in lbs
    ELSE CAST(Weight AS FLOAT)
END
from FifaData.dbo.FIFA2021DATA

Result:

image

Currency and Unit Normalization

Within the dataset, the columns Value, Wage, and Release Clause had some special characters that needed fixing. These columns featured the Euro sign (€) before the numerical values, indicating amounts in euros. These euro values were then converted to dollars using the average 2021 exchange rate. Additionally, the values were tagged with 'K' for thousands and 'M' for millions. To make sure the data was consistent for accurate analysis, i performed the following steps:

Approach

  1. Identify Euro Values:

    • I locate rows containing the euro symbol (€) in the respective columns: Value, Wage, and Release Clause.
    • I identify values denoted in millions (M) or thousands (K).
  2. Conversion Process:

    • Millions Conversion:

      • Eliminate the euro symbol (€) and the 'M' suffix.
      • Convert the numeric part to float.
      • Multiply by 1,000,000 to get the equivalent value in euros.
      • Multiply by the exchange rate of 1.183 to convert to US dollars.
    • Thousands Conversion:

      • Eliminate the euro symbol (€) and the 'K' suffix.
      • Convert the numeric part to float.
      • Multiply by 1,000 to get the equivalent value in euros.
      • Multiply by the exchange rate of 1.183 to convert to US dollars.
    • Other Values:

      • Eliminate the euro symbol (€).
      • Convert the numeric part to float.
      • Multiply by the exchange rate of 1.183 to convert to US dollars.
  3. SQL Implementation:

    • Utilized SQL CASE statements to handle different scenarios for millions, thousands, and other values.
    • Replaced euro symbols and suffixes ('M', 'K') with empty strings.
    • Converted numeric parts to floats and applied necessary multiplications for conversion.
    • Applied the exchange rate of 1.183 to get the final values in US dollars.

SQL QUERY

update FifaData.dbo.FIFA2021DATA
set ValueIN$ = 
			case
				-- Eliminate "€", "M", convert to million then convert to $ using $1.183 per euro xchnage rate
				when Value like '%€%' and Value like '%M%' then (cast(replace(replace(Value, '€', ''), 'M', '') as float) * 1000000) * 1.183
				-- Eliminate "€", "K", convert to thousand then convert to $ using $1.183 per euro xchnage rate
				when Value like '%€%' and Value like '%K%' then (cast(replace(replace(Value, '€', ''), 'K', '') as float) * 1000) * 1.183
				-- Eliminate "€", convert to float then convert to $ using $1.183 per euro xchnage rate
				else cast(replace(Value, '€', '') as float) * 1.183
			end,
	WageIN$ = 
			case
				-- Eliminate "€", "M", convert to million then convert to $ using $1.183 per euro xchnage rate
				when Wage like '%€%' and Wage like '%M%' then (cast(replace(replace(Wage, '€', ''), 'M', '') as float) * 1000000) * 1.183
				-- Eliminate "€", "K", convert to thousand then convert to $ using $1.183 per euro xchnage rate
				when Wage like '%€%' and Wage like '%K%' then (cast(replace(replace(Wage, '€', ''), 'K', '') as float) * 1000) * 1.183
				-- Eliminate "€", convert to float then convert to $ using $1.183 per euro xchnage rate
				else cast(replace(Wage, '€', '') as float) * 1.183
			end,
	Release_ClauseIN$ = 
			case
				-- Eliminate "€", "M", convert to million then convert to $ using $1.183 per euro xchnage rate
				when [Release Clause] like '%€%' and [Release Clause] like '%M%' then (cast(replace(replace([Release Clause], '€', ''), 'M', '') as float) * 1000000) * 1.183
				-- Eliminate "€", "K", convert to thousand then convert to $ using $1.183 per euro xchnage rate
				when [Release Clause] like '%€%' and [Release Clause] like '%K%' then (cast(replace(replace([Release Clause], '€', ''), 'K', '') as float) * 1000) * 1.183
				-- Eliminate "€", convert to float then convert to $ using $1.183 per euro xchnage rate
				else cast(replace([Release Clause], '€', '') as float) * 1.183
			end
from FifaData.dbo.FIFA2021DATA

Before:

image

After:

image

The columns ValueIN$, WageIN$, and Release_ClauseIN$ now contain the corresponding values in US dollars after the conversion process.

Standardizing Player Contract Dates

Upon thorough examination of the dataset, it became evident that the columns Contract, Joined, and Loan End Date held crucial information regarding players' contract start and end dates, loan start and end dates, as well as instances of players marked as "Free" (indicating no contractual obligation). However, these columns exhibited inconsistencies and redundancies.

The Contract column housed data related to the player's contract start year, contract end year, loan end date, and instances marked as "Free." Meanwhile, the Joined column contained both contract start dates and loan start dates. Lastly, the Loan End Date column denoted the end date of player loans.

See the screenshot below

image

To address these inconsistencies, i harmonized the data and eliminating redundant entries to ensures a clean and properly formatted columns.

Approach:

  1. Created a New Column for Contract Status: I established a fresh column named "ContractStatus" to house information about free players, as well as players on loan or under contract. For free players, the data was extracted from the existing "Contract" column. To ensure completeness, I filled the null values by verifying the player's status. If they were on contract, the field was populated with "On Contract," and for players on loan, it was marked as "On Loan."

SQL Query

-- Creating the Contract status column
alter table FifaData.dbo.FIFA2021DATA
add ContractStatus nvarchar(255)

-- Updating the contract status column with the Contract column
update FifaData.dbo.FIFA2021DATA
set ContractStatus = 
		case
			when Contract like '%Free%' then Contract
			when Contract like '%Loan%' then 'On Loan'
			else 'On Contract'
		end
from FifaData.dbo.FIFA2021DATA

Result:

image

  1. Created two new columns for Contract Start Date and Contract End Date:

Contract End Date Extraction: The contract end dates were sourced from the existing "Contract" column, which contained data in the format "2004 ~ 2023." The earlier year represented the contract start date, while the latter indicated the contract's end. Using the PARSENAME function, I precisely extracted the second year, denoting the contract's end date, and stored it in the Contract End Date column.

Contract Start Date Extraction: Simultaneously, the Contract Start Date was extracted from the "Joined" column, ensuring a comprehensive representation of player contract durations.

SQL Query

--Creating the ContractStartDate column and the ContractEndDate column
alter table FifaData.dbo.FIFA2021DATA
add ContractStartDate date, ContractEndDate nvarchar(255)


-- Updating the ContractStartDate column using the Joined column
update FifaData.dbo.FIFA2021DATA
set ContractStartDate = 
		case
			when Contract not like '%Loan%' and Contract not like '%Free%' and isdate(Joined) = 1 then Joined
			else null
		end
from FifaData.dbo.FIFA2021DATA

---- Updating the ContractEndDate column using the Joined column
update FifaData.dbo.FIFA2021DATA
set ContractEndDate = 
		case
			when Contract not like '%Loan%' and Contract not like '%Free%' 
			then parsename(replace(Contract, '~', '.'), 1)
			else null
		end
from FifaData.dbo.FIFA2021DATA

Result:

image

  1. Created two new columns for LoanStartDate, LoanEndDate:

Loan Start Date Extraction: The Loan Start Date was extracted from the "Joined" column, by checking if the innitial "Loan End Date" column is not empty and ensuring a comprehensive representation of player Loan durations.

Loan End Date Extraction: To capture the Loan End Dates, I implemented a straightforward approach. First, I checked the "Loan End Date" column for non-empty values. If a value existed, I transferred it to the new Loan End Date column. This strategic move was essential to rectify the data format. The original "Loan End Date" column, being a string, required a more precise data type for accurate storage. Therefore, the creation of a designated column with the appropriate data type ensured the seamless and accurate representation of loan end dates in the dataset.

SQL Query

--Creating the LoanStartDate column and the LoanEndDate column
alter table FifaData.dbo.FIFA2021DATA
add LoanStartDate date, LoanEndDate date

-- Updating the LoanStartDate
update FifaData.dbo.FIFA2021DATA
set LoanStartDate = 
	case
		when [Loan Date End] is not null and isdate(Joined) = 1 then Joined
		else null
	end
from FifaData.dbo.FIFA2021DATA

-- Updating the LoanEndDate
update FifaData.dbo.FIFA2021DATA
set LoanEndDate = 
	case
		when [Loan Date End] is not null and isdate([Loan Date End]) = 1 then [Loan Date End]
		else null
	end
from FifaData.dbo.FIFA2021DATA

Result:

image

  1. Dropping the column that are no more needed
--Dropping columns
alter table FifaData.dbo.FIFA2021DATA
drop Contract, Joined, [Loan Date End]

Before the four approach

image

After the four approach

image

Cleaning and Standardizing W/F, SM, IR Columns

This columns with ratings (1-5 stars) contain diacritics, introducing inconsistencies in the data. To maintain uniformity and ensure accurate analysis, the following approach has been implemented to remove these diacritics and standardize the rating format.

Approach:

  1. Create new columns to store the standardize data

SQL Query:

-- Creating new columns for W/F, SM, IR
ALTER TABLE FifaData.dbo.FIFA2021DATA
ADD [W/F_] int, SM_ int, IR_ int
  1. Populate the newly created columns with cleaned data, where diacritics have been removed and extraneous spaces trimmed, sourced from the original columns.

SQL Query:

-- Updating the newly created columns
update FifaData.dbo.FIFA2021DATA
set [W/F_] = rtrim(ltrim(replace([W/F], '★', ''))),
	SM_ = rtrim(ltrim(replace(SM, '★', ''))),
	IR_ = rtrim(ltrim(replace(IR, '★', '')))
from  FifaData.dbo.FIFA2021DATA
  1. Drop the intial columns SQL Query:
--Dropping the intial columns
alter table FifaData.dbo.FIFA2021DATA
drop column [W/F], SM, IR

Before:

image

After:

image

Conclusion

In the realm of data analytics, the significance of data quality cannot be overstated. A thorough approach to cleaning and organizing raw data is the bedrock upon which robust analyses and informed decision-making processes stand. Throughout this data cleaning project, a comprehensive strategy was devised to address a myriad of challenges present in the FIFA 2021 dataset sourced from Kaggle.

The process embarked upon a transformative journey, addressing issues like inconsistent units of measurement, special characters, and varied formats. By systematically converting weights to a unified kilogram format and heights to inches, the dataset achieved uniformity, enabling precise comparisons and analyses. Furthermore, handling of special characters and textual inconsistencies paved the way for a standardized dataset.

Striking a balance between accuracy and efficiency, the project applied various SQL functions and techniques. String manipulations, datatype conversions, and collation adjustments were delicately executed, ensuring that the resulting dataset met the highest standards of quality. The creation of new columns, careful data copying, and intelligent collation choices added layers of depth to the cleaning process.

The journey was not without its challenges. Diacritics, special characters, and inconsistent formats tested the limits of my data wrangling expertise. However, each hurdle was an opportunity for growth, and the solutions devised stand testament to the power of systematic problem-solving. Not only did this project serve the purpose of preparing the dataset for analysis, but it also aimed at upskilling myself and preparing for greater challenges in the realm of data analysis. Through this endeavour, I have honed my skills, ensuring the dataset is not only cleaned but also primed for more complex analyses. As I move forward, I am now well-equipped to tackle more significant data challenges, armed with a refined skill set and a deeper understanding of data cleaning methodologies.