Project DATA Engineering is provided by INSA Lyon.
Students: LARIOZA Andrea, SIA Zhong Qing
This repo contains a jupyter notebook, where you will find the code for the population of relations from csv into postgresql, and the outputs for the sql code used to answer the natural language queries below. Most pre-processing steps were completed using a csv file manager, and can be found in the commit history. They include replacement of '\N' values, improper date-time formats, and special characters in names and circuits. It was simply easier to clean them using excel rather than python.
We are using the data on Formula 1 Races from 1950 to 2024 and can be obtained on Kaggle. It consists of several separate csv files; circuits, results, races, constructors, drivers and pit stops.
- List circuits in the USA
- List race count of the top 10 drivers in order.
- Which driver has the most wins?
- Which drivers have the same nationality as the constructor?
- Show across time who drove the fastest in the Singapore Grand Prix?
- Which constructors have not gotten first place before?
- Which races have the tightest outcome?
- Who are the constructors within the top 10 fastest pitstops of all time?
- List all drivers who have driven for Mclaren
- Which constructor and driver-pairs have the most podium finishes?
All you will have to do is open the .ipynb file and click Run-All. As some files contain over 25,000 records and there are 7 files, please expect a waiting time of around 2 minutes.