Machine Learning: Project 1 - Higgs Boson Challenge

Project 1 of the Machine Learning course given at the EPFL Fall 2021.

Team members

Quentin Deschamps
Emilien Seiler
Louis Le Guillouzic

Instructions

Predictions for AIcrowd

To reproduce our submission on AIcrowd, move in the scripts folder and run:

python3 run.py

The csv file produced will be out/predictions.csv.

Accuracy score

To compute the accuracy scores obtained for each model, use the run_accuracy.py script. It loads the parameters of optimization algorithms in the parameters.json file. The figures are saved in the figs directory.

The program shows:

The global accuracy score and the one for each subset.
The global confusion matrix and the one for each subset.

Usage:

python3 run_accuracy.py --clf [CLASSIFIER]

Where CLASSIFIER can be:

gradient_descent
stochastic_gradient_descent
least_squares
ridge_regression (default)
logistic_regression
regularized_logistic_regression

Options:

--save: save the figures in the figs folder.
--hide: hide the figures.
-h, --help: show help.

Strategy

The main strategy is the following:

Split the train and the test set in 3 subsets:
- JET = 0
- JET = 1
- JET >= 2
Clean the subsets individually:
- Remove the columns with the same data at each row.
- Replace -999 by the median of the column.
- Apply log transformation on the data.
- Standardize the columns using the mean and the standard deviation of the train dataset.
Expand the features using polynomial expansion. The degree is determined using cross validation.
Perform ridge regression on each subset.
Predict the labels for each test subset using the model determined with each train subset.
Merge the results.

Structure

This is the structure of the repository:

data: contains the datasets
docs: contains the documentation
figs: contains the figures (accuracies, confusion matrices, results of cross validation)
scripts: contains the main scripts and the notebooks
- csv_utils.py: functions to load data and create submission
- main_cross_validation.ipynb: performs cross validation
- main_ridge_regression.ipynb: explore the training dataset and compute accuracy score with ridge regression
- parameters.json: parameters for optimization algorithms
- path.py: paths and procedures to manage archives and directories
- run_accuracy.py: compute the accuracy with a classifier
- run.py: make predictions for AIcrowd using ridge regression
src: source code
- clean_data.py: functions to clean data
- cross_validation.py: functions to perform cross validation
- gradient.py: gradient functions
- helpers.py: utils functions
- implementations.py: Machine Learning algorithms implementations
- loss.py: loss functions
- metrics.py: score and performance functions
- plot_utils.py: plot utils using matplotlib
- print_utils.py: print utils
- split_data.py: split functions to handle data
- stats_tests.py: statistical tests

Results

Best accuracy score on AIcrowd: 0.831 (link)

Results of all models with the run_accuracy.py script:

Model	Accuracy
Gradient descent	0.715
Stochastic gradient descent	0.709
Least squares	0.827
Ridge regression	0.828
Logistic regression	0.760
Regularized logistic regression	0.760

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
data		data
docs		docs
figs		figs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
report.pdf		report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning: Project 1 - Higgs Boson Challenge

Team members

Instructions

Predictions for AIcrowd

Accuracy score

Strategy

Structure

Results

About

Contributors 2

Languages

License

Quentin18/higgs-boson-challenge

Folders and files

Latest commit

History

Repository files navigation

Machine Learning: Project 1 - Higgs Boson Challenge

Team members

Instructions

Predictions for AIcrowd

Accuracy score

Strategy

Structure

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages