Skip to content

cactusutcac/honours-programme-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Job Failure Characterization and Prediction in HPC Datacenter Using Deep Learning

Overview

This project focuses on analyzing and predicting machine learning (ML) job failures within the SURFLisa HPC cluster. It includes an in-depth examination of various aspects such as failure statistics, arrival patterns, CPU usage, time correlations of failures, and peak analysis. Furthermore, the project involves the implementation of a hybrid LSTM-TCN model to predict the occurrence of failed ML jobs across varying time granularities, leveraging insights from ML job failures.

Repository Structure

  • job_analysis/: Contains scripts for analyzing the data and generating plots.
  • job_prediction/: Includes code for predicting failure events based on historical data.
  • job_datasets/: Stores the datasets used for analysis and prediction.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published