Methodology

Title

Abstract

Introduction

Background

Virtual Address Layout
ASLR
Stack Smashing
Pointer Encryption
Process failure
- Intro
- Failure Prediction
Checkpoint [Recovery]
Replication
- Redundant Computaion
ULFM
- What is ULFM?
- What functionalities are supported?
- Explain different functions in detail
- How ULFM is used

Design

Adaptive Process Replication [talk about different mapped regions and how user segment is totally different]
- Data Seg Replication
  
  Data segment is the memory where global initialized and uninitialized variables reside. Memory is allocated and mapped to a variable name during compilation and this mapping does not change during program execution.
  
  As already referred, this segment is divided into two parts initialized and uninitialized segments. Variables initialized with a non zero value goes in the initialized segment and variables uninitialized or initialized with "0" goes to uninitialized segment.
  
  On looking at the symbol table of an executable (using nm) one can find four symbols associated with the data segment which are of interest. __data_start marks the start of initialized data section, _edata marks the end of initialized data section, __bss_start marks the start of uninitialized data section and _end marks the end of uninitialized data section. These symbol cannot be used as variables instead they are only used to extract the boundaries of memory associated with them using & operator. Ex: void *a = &__data_start.
- Stack Seg Replication:
  
  Stack segment is used for function calling and storing local variables only associated to a particular function.
- Heap Seg Replication
  - Pointers by value to functions
Integrating ULFM
- Basic Design
- comm err handler
- MPI_* functions pseudocode [Algorithm design] [imp]
- Explain above in detail [imp]
- Handling Collectives
- Special case for send/recv

Checkpointing and Restart

One process failure [if job has more than 1 ranks]
All process [rank] failure in a job [Fail Stop]
More inter checkpointing time because of replication

Others

Fortran wrapper
Process Manager

Experiments

Setup
- Computation configuration
- Failure Model [Fault Injection]
Plain run with n order redundancy [n = 1, 2, 3]
n order redundancy with fault injection
- random faults [kill any]
- kill the job with > 1 ranks [no fail stop]
n order redundancy with replication map update
n order redundancy with replication map update and fault injection

Future Work

Failure Predictions
More intelligent process manager
Handling communicator split
Compatibility with more Async MPI_I* functions
Optimizing heap segment container storage for fast insertion/deletion.
Efficient MPI_ANY_SOURCE

Terminologies
- Job
- Node
Constraints
- ASLR [drawback]
- Stack Smashing problem [drawback]
Out of the box support
Compiler

How this framework support different Fault Tolerance mechanisms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology

Title

Abstract

Introduction

Background

Design

Checkpointing and Restart

Others

Experiments

Future Work

Clone this wiki locally