Project 1 of the Machine Learning course given at the EPFL Fall 2021.
- Quentin Deschamps
- Emilien Seiler
- Louis Le Guillouzic
To reproduce our submission on AIcrowd, move in the scripts
folder and run:
python3 run.py
The csv file produced will be out/predictions.csv
.
To compute the accuracy scores obtained for each model, use the run_accuracy.py
script. It loads the parameters of optimization algorithms in the parameters.json
file. The figures are saved in the figs
directory.
The program shows:
- The global accuracy score and the one for each subset.
- The global confusion matrix and the one for each subset.
Usage:
python3 run_accuracy.py --clf [CLASSIFIER]
Where CLASSIFIER
can be:
gradient_descent
stochastic_gradient_descent
least_squares
ridge_regression
(default)logistic_regression
regularized_logistic_regression
Options:
--save
: save the figures in thefigs
folder.--hide
: hide the figures.-h, --help
: show help.
The main strategy is the following:
- Split the train and the test set in 3 subsets:
JET = 0
JET = 1
JET >= 2
- Clean the subsets individually:
- Remove the columns with the same data at each row.
- Replace -999 by the median of the column.
- Apply log transformation on the data.
- Standardize the columns using the mean and the standard deviation of the train dataset.
- Expand the features using polynomial expansion. The degree is determined using cross validation.
- Perform ridge regression on each subset.
- Predict the labels for each test subset using the model determined with each train subset.
- Merge the results.
This is the structure of the repository:
data
: contains the datasetsdocs
: contains the documentationfigs
: contains the figures (accuracies, confusion matrices, results of cross validation)scripts
: contains the main scripts and the notebookscsv_utils.py
: functions to load data and create submissionmain_cross_validation.ipynb
: performs cross validationmain_ridge_regression.ipynb
: explore the training dataset and compute accuracy score with ridge regressionparameters.json
: parameters for optimization algorithmspath.py
: paths and procedures to manage archives and directoriesrun_accuracy.py
: compute the accuracy with a classifierrun.py
: make predictions for AIcrowd using ridge regression
src
: source codeclean_data.py
: functions to clean datacross_validation.py
: functions to perform cross validationgradient.py
: gradient functionshelpers.py
: utils functionsimplementations.py
: Machine Learning algorithms implementationsloss.py
: loss functionsmetrics.py
: score and performance functionsplot_utils.py
: plot utils using matplotlibprint_utils.py
: print utilssplit_data.py
: split functions to handle datastats_tests.py
: statistical tests
Best accuracy score on AIcrowd: 0.831 (link)
Results of all models with the run_accuracy.py
script:
Model | Accuracy |
---|---|
Gradient descent | 0.715 |
Stochastic gradient descent | 0.709 |
Least squares | 0.827 |
Ridge regression | 0.828 |
Logistic regression | 0.760 |
Regularized logistic regression | 0.760 |