Skip to content

Commit

Permalink
Add redundancy chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
dgptha committed Feb 17, 2025
1 parent 1774acf commit 4691163
Show file tree
Hide file tree
Showing 3 changed files with 277 additions and 0 deletions.
247 changes: 247 additions & 0 deletions src/chapters/redundancy.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
[#sec:redundancy]
## Redundancy

[#sec:redundancy:safety]
### Safety needs

To be safe is to reduce faults to a level that avoids unreasonable risk.
These include transient faults, which are one-off faults, after which the
behavior returns to normal (unless the fault has already been
consumed/propagated), and permanent faults, which persist and require
intervention.

Faults can have an internal (e.g. bug, device aging) or external (e.g. high
energy particle, electrostatic discharge, excessive heat) origin.

A failure that has a single root cause but affects more than one component is
known as a common-cause failure.

Redundancy is the (partial) replication of components or information to detect
and/or correct faults.

Redundancy can be introduced into the system on different levels:

* Hardware redundancy is achieved by replicating physical resources, for example
including multiple processors or whole SoCs or replicating elements such as
memory or sensors.
* Software redundancy means a task is implemented in software in multiple
different ways and each version is executed.
* Information redundancy can help detect or correct data corruption by storing
additional data, such as an error detecting/correcting code.
* Time redundancy is brought about by executing the same task on the same
processor multiple times.
In all cases execution on a redundant system is followed by a decision step, in
which the output from each redundant execution (or the redundant information in
the case of information redundancy), is compared.
If the output differs then a failure has occurred in one or more executions and
an error can be reported.
If the redundancy is implemented by replicating by more than two the system can
also continue executing in the case where the output of the majority of
executions concurs.

Different types of redundancy are used to detect and/or prevent different
categories of failures.
Hardware redundancy is used to handle malfunction of hardware elements, software
redundancy is applied to avoid defects introduced during the development phase,
information redundancy protects the system against data corruption and time
redundancy is employed to avoid transient faults.
It has to be noted that HW-SW co-engineering methods exist, for which the strict
classification above might not be fully correct, however it still is helpful to
establish fundamental understanding.

Redundancy can mitigate both random-permanent and transient faults, of both
internal and external origin. Diverse redundancy also supports claims with
respect to systematic faults.

Both software redundancy and time redundancy are typically managed by purely
software means without explicit hardware support.
Hence, we exclude them from the discussion in the remainder of this chapter.

[#sec:redundancy:features]
#### Features

##### Information Redundancy

Apart from full replication, information redundancy can be achieved by
techniques such as error detection and correction codes (EDC/ECC) in memory.

##### Hardware Redundancy

Redundancy can be characterized by its level of diversity.
If, for example, multiple identical hardware elements are integrated without modification the diversity is low.
Diversity can be increased by triggering the execution of the software on such a
system with an offset of a number of clock cycles. Additional diversity can be
achieved by incorporating elements of differing design, differing ISAs or even
differing manufacturers.

Diverse redundancy is particularly useful to mitigate common-cause failures.
Upon a fault affecting redundant diverse components, the error experienced by
any such component is likely to differ and hence, errors will be detected and
the failure avoided.

The level of diversity required by a system depends on the target safety level,
as well as the domain.
For example, executing a process with a clock cycle offset on identical
processors may be sufficient for systems in the automotive domain, but not in
avionics.

Redundancy can be further characterized by its cardinality.
In general, redundancy can be characterized as a M-out-of-N (MooN) system, where
N is the cardinality, i.e. the number of redundant elements, and M is the number
of these elements that are required to be functional for the whole system to
remain functional.

By including two redundant components (1oo2) in the system (also known as dual
modular redundancy) individual faults can be detected, but when the output of
the two components differs the system cannot decide in which the fault occurred
and hence which result is the correct one.
To allow the system to continue to execute in such a case, and therefore correct
the fault, the component needs to be included in triplicate (2oo3, triple
modular redundancy), or more (such as 3oo5).
The comparison module can then decide by majority voting which output is the
right one and which component has failed.

NOTE: This implies that the comparison module has to fulfill high integrity
requirements with respect to absence of systematic faults.

Redundant hardware can be used either as a fallback component which is idle
until the primary component fails (assuming full replication), or during normal
execution and in parallel with the primary system (certain models also support
partial replication), in order to detect or correct faults in the components.

In the former case the functionality of the fallback system is typically a
subset of the main system providing degraded operation and it is used only in a
time-limited fashion until the primary system is back online or to put the
system into a safe state.

Hardware redundancy is often implemented by replicating a processor and
comparing the output of the computation at the end of a task's execution.
If, instead, the task is initiated in the same, or offset, clock cycle on
identical processors and the state of the execution is compared at regular
intervals, or steps, it is referred to as lockstep computing.
The definition of the step can vary, e.g. a set number of clock cycles, or
delineated by specific events such as writes to memory, interrupts, or other
events off-core.

For lockstep computing an additional comparison module is required to perform
regular checks of the state of the systems.
Such module must also be intrinsically redundant or tolerant to the relevant
fault modes affecting the redundant components.

Time redundancy can additionally be introduced into lockstep execution by
starting the redundant tasks not in the same clock cycle but offset by a number
of cycles.
This method avoids common-cause failures due to transient faults.

Examples of lock-step include dual core lock-step (DCLS), as an implementation
of 1oo2, and triple core lock-step (TCLS), which is 2oo3.
The cores may be initiated in the same clock cycle or offset.

In the context of RISC-V there may be additional levels of diversity of
redundancy that can be considered.
Since RISC-V is an open instruction set architecture (ISA) several different
implementations exist by different independent providers.
Different implementations could be combined in a system to provide a level of
diverse (microarchitectural) redundancy.
However, while this kind of redundancy will detect (and possibly correct)
failures due to the implementation, it assumes absence of faults in the ISA
itself.

Traditionally, heterogenous systems imply diversity of ISAs. Diversity of ISA
has the added benefit of introducing a certain degree of diversity in the
software as well, since different tools, such as compilers, are required to
program for the processor.
In the case of using different RISC-V implementations it is therefore worth
considering using different compiler implementations for each replicated
processor, unless other measures (like qualification kits etc.) exist.

Whether the level of redundancy achieved by combining RISC-V implementations,
and perhaps using different toolchains, is sufficient depends on the safety
level and domain of the system

[#sec:redundancy:safety:level]
#### Level

Both the core and the whole SoC can be replicated for redundancy purposes.

[#sec:redundancy:safety:importance]
#### Importance

While most safety standards do not require redundancy, to achieve the levels of
fault tolerance required at higher safety levels redundancy is one of the key
techniques employed. This can be seen in functional safety standards such as
ISO 26262 and IEC 61508.

In the context of ISO 26262 cite:[iso26262:2018], redundancy is integral to
"`ASIL tailoring`" i.e. the decomposition of safety requirements to redundant
architectural elements,
whereby, if absence of dependent failures can be demonstrated, the ASIL
allocated to the redundant elements can be reduced.

Similarly IEC 61508-2:2010 cite:[iec16508-2:2010] provides guidance for the maximum
diagnostics coverage of a variety of redundancy techniques, and
IEC 61508-6:2010 cite:[iec16508-6:2010] Annex B provides guidance for evaluating probabilities of hardware failure that includes various redundant voting schemas.
In addition, IEC 61508-2:2010 cite:[iec16508-2:2010] Annex E defines normative
requirements for:
"`__Special architecture requirements for digital integrated circuits (ICs) with
on-chip redundancy__`" to avoid common cause failures for ICs that share the same
substrate.

[#sec:redundancy:safety:justification]
#### Justification

Redundancy is often the only mechanism to detect errors and remain operational
to the extent required by systems with high safety levels.

Basic redundancy can improve integrity by providing a method for error detection
and eventually correction (both could by accompanied by degradation of main
functionality).
If the redundancy is further increased the system can also show improved fault
tolerance and hence reliability, since single faults are corrected as long as
they do not lead to common cause failures, which would need diversity in
addition.
In the case of a primary-backup setup availability can be said to increase,
since the backup component may be available even if the primary component has
failed.

ISO 26262:5 cite:[iso26262-5:2018] mentions redundancy as a safety mechanism, with typical diagnostic
coverage considered achievable described as "`High`".

ISO 26262:11 cite:[iso26262-11:2018] also specifically mentions diverse redundancy as a tool to reduce
risk of hardware failures when using IP with limited documentation and
insufficient historic (aka "`proven in use`") data.

Error detection/correction modes are described in
ISO 26262:11 cite:[iso26262-11:2018] as a technique to detect failures in
memory.

[#sec:redundancy:rv]
### RISC-V solutions

Given that redundancy is intended to be completely transparent, no RISC-V
specific features have been devised to our knowledge.
However, it has to be noted that control- and capture-interfaces will add to
register-interface (core and uncore-IP), and consequently a standardized minimal
set (ideally mapped against safety requirements from various standards), will
improve consideration of RISC-V by Safety-System vendors.

[#sec:redundancy:recom]
### Recommendations

Redundancy is intended to be completely transparent, hence no changes to the ISA
are required.

[#sec:redundancy:activities]
### Relevant activities

#### Related external bodies

None.

#### Related chapters

Potentially the error management chapter (to be released), for errors detected
and/or corrected by means of redundancy.
For instance, to program actions to take upon unrecoverable errors, and to
collect statistics about corrected errors.
2 changes: 2 additions & 0 deletions src/fusa-whitepaper.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ include::chapters/pmc.adoc[]

include::chapters/qos.adoc[]

include::chapters/redundancy.adoc[]

// The index must precede the bibliography
include::index.adoc[]

Expand Down
28 changes: 28 additions & 0 deletions src/fusa-whitepaper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,31 @@ @techreport{ft-qosid:2023
institution = {RISC-V International},
year = {2023}
}

% REF9
@techreport{iec16508-2:2010,
title = {{IEC}16508-2:2010 {F}unctional safety of electrical/electronic/programmable electronic safety-related systems - {P}art 2: {R}equirements for electrical/electronic/programmable electronic safety-related systems},
institution = {International Electrotechnical Commission (IEC), available at https://webstore.iec.ch},
year = {2010}
}

% REF10
@techreport{iec16508-6:2010,
title = {{IEC} 16508-2:2010 {F}unctional safety of electrical/electronic/programmable electronic safety-related systems - {P}art 6: {G}uidelines on the application of {IEC} 61508-2 and {IEC} 61508-3},
institution = {International Electrotechnical Commission (IEC), available at https://webstore.iec.ch},
year = {2010}
}

#REF11
@techreport{iso26262-5:2018,
title = {{ISO} 26262-5:2018 {R}oad vehicles - {F}unctional safety, {P}art 5: Product development at the hardware level},
institution = {International Organization for Standardization (ISO), avalaible at https://www.iso.org/standard/68387.html},
year = 2018
}

#REF12
@techreport{iso26262-11:2018,
title = {{ISO} 26262-11:2018 {R}oad vehicles - {F}unctional safety, {P}art 11: Guidelines on application of {ISO} 26262 to semiconductors},
institution = {International Organization for Standardization (ISO), avalaible at https://www.iso.org/standard/69604.html},
year = 2018
}

0 comments on commit 4691163

Please sign in to comment.