- Python 3.7 installed
- Pyspark installed and correctly configured
- Knowledge on how to run jupyter notebooks
The code for this exercise is written as Jupyter notebook file. To be able to run please follow these steps.
- Download the dataset at
http://stat-computing.org/dataexpo/2009/the-data.html
or
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7 - Extract the files to a directory
- Name the files in such way that they begin with the year. 2009_some_name.csv etc
- Open lab4.ipynb file and modify the variable
fileLocation
to point to the directory created in step 2. - Run
lab4.ipynb
from jupyter notebook