Skip to content

Methodology

Abhishek Upperwal edited this page May 16, 2018 · 12 revisions

Title

Abstract

Introduction

Background

  • Virtual Address Layout
  • ASLR
  • Stack Smashing
  • Pointer Encryption
  • Process failure
    • Intro
    • Failure Prediction
  • Checkpoint [Recovery]
  • Replication
    • Redundant Computaion
  • ULFM
    • What is ULFM?
    • What functionalities are supported?
    • Explain different functions in detail
    • How ULFM is used

Design

  • Adaptive Process Replication [talk about different mapped regions and how user segment is totally different]
    • Data Seg Replication

      Data segment is the memory where global initialized and uninitialized variables reside. Memory is allocated and mapped to a variable name during compilation and this mapping does not change during program execution.

      As already referred, this segment is divided into two parts initialized and uninitialized segments. Variables initialized with a non zero value goes in the initialized segment and variables uninitialized or initialized with "0" goes to uninitialized segment.

      On looking at the symbol table of an executable (using nm) one can find four symbols associated with the data segment which are of interest. __data_start marks the start of initialized data section, _edata marks the end of initialized data section, __bss_start marks the start of uninitialized data section and _end marks the end of uninitialized data section. These symbol cannot be used as variables instead they are only used to extract the boundaries of memory associated with them using & operator. Ex: void *a = &__data_start.

    • Stack Seg Replication:

      Stack segment is used for function calling and storing local variables only associated to a particular function.

    • Heap Seg Replication

      • Pointers by value to functions
  • Integrating ULFM
    • Basic Design
    • comm err handler
    • MPI_* functions pseudocode [Algorithm design] [imp]
    • Explain above in detail [imp]
    • Handling Collectives
    • Special case for send/recv

Checkpointing and Restart

  • One process failure [if job has more than 1 ranks]
  • All process [rank] failure in a job [Fail Stop]
  • More inter checkpointing time because of replication

Others

  • Fortran wrapper
  • Process Manager

Experiments

  • Setup
    • Computation configuration
    • Failure Model [Fault Injection]
  • Plain run with n order redundancy [n = 1, 2, 3]
  • n order redundancy with fault injection
    • random faults [kill any]
    • kill the job with > 1 ranks [no fail stop]
  • n order redundancy with replication map update
  • n order redundancy with replication map update and fault injection

Future Work

  • Failure Predictions
  • More intelligent process manager
  • Handling communicator split
  • Compatibility with more Async MPI_I* functions
  • Optimizing heap segment container storage for fast insertion/deletion.
  • Efficient MPI_ANY_SOURCE

  • Terminologies
    • Job
    • Node
  • Constraints
    • ASLR [drawback]
    • Stack Smashing problem [drawback]
  • Out of the box support
  • Compiler

How this framework support different Fault Tolerance mechanisms.

Clone this wiki locally