Job Failure Characterization and Prediction in HPC Datacenter Using Deep Learning

Overview

This project focuses on analyzing and predicting machine learning (ML) job failures within the SURFLisa HPC cluster. It includes an in-depth examination of various aspects such as failure statistics, arrival patterns, CPU usage, time correlations of failures, and peak analysis. Furthermore, the project involves the implementation of a hybrid LSTM-TCN model to predict the occurrence of failed ML jobs across varying time granularities, leveraging insights from ML job failures.

Repository Structure

job_analysis/: Contains scripts for analyzing the data and generating plots.
job_prediction/: Includes code for predicting failure events based on historical data.
job_datasets/: Stores the datasets used for analysis and prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
job_analysis		job_analysis
job_datasets		job_datasets
job_prediction		job_prediction
2024_HP_Project_Report.pdf		2024_HP_Project_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Failure Characterization and Prediction in HPC Datacenter Using Deep Learning

Overview

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

cactusutcac/honours-programme-project

Folders and files

Latest commit

History

Repository files navigation

Job Failure Characterization and Prediction in HPC Datacenter Using Deep Learning

Overview

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages