You are a data engineer at an aeronautics consulting company. Your company prides itself in being able to efficiently design airfoils for use in planes and sports cars. Data scientists in your office need to work with different algorithms and data in different formats. While they are good at Machine Learning, they count on you to be able to do ETL jobs and build ML pipelines. In this project, you will use the modified version of the NASA Airfoil Self Noise dataset. You will clean this dataset, by dropping the duplicate rows and removing the rows with null values. You will create an ML pipeline to create a model that will predict SoundLevel based on all the other columns. You will evaluate the model and towards the end, you will persist the model.
- Part 1 Perform ETL activity
- Load a CSV dataset
- Remove duplicates if any
- Drop rows with null values if any
- Make transformations
- Store the cleaned data in parquet format
- Part 2 Create a Machine Learning Pipeline
- Create a machine learning pipeline for prediction
- Part 3 Evaluate the Model
- Evaluate the model using relevant metrics
- Part 4 Persist the Model
- Save the model for future production use
- Load and verify the stored model
Contributions are welcome! Please open an issue or pull request for any changes or improvements.