This project focuses on analyzing and predicting machine learning (ML) job failures within the SURFLisa HPC cluster. It includes an in-depth examination of various aspects such as failure statistics, arrival patterns, CPU usage, time correlations of failures, and peak analysis. Furthermore, the project involves the implementation of a hybrid LSTM-TCN model to predict the occurrence of failed ML jobs across varying time granularities, leveraging insights from ML job failures.
- job_analysis/: Contains scripts for analyzing the data and generating plots.
- job_prediction/: Includes code for predicting failure events based on historical data.
- job_datasets/: Stores the datasets used for analysis and prediction.