generated from riscv/docs-spec-template
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
277 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,247 @@ | ||
[#sec:redundancy] | ||
## Redundancy | ||
|
||
[#sec:redundancy:safety] | ||
### Safety needs | ||
|
||
To be safe is to reduce faults to a level that avoids unreasonable risk. | ||
These include transient faults, which are one-off faults, after which the | ||
behavior returns to normal (unless the fault has already been | ||
consumed/propagated), and permanent faults, which persist and require | ||
intervention. | ||
|
||
Faults can have an internal (e.g. bug, device aging) or external (e.g. high | ||
energy particle, electrostatic discharge, excessive heat) origin. | ||
|
||
A failure that has a single root cause but affects more than one component is | ||
known as a common-cause failure. | ||
|
||
Redundancy is the (partial) replication of components or information to detect | ||
and/or correct faults. | ||
|
||
Redundancy can be introduced into the system on different levels: | ||
|
||
* Hardware redundancy is achieved by replicating physical resources, for example | ||
including multiple processors or whole SoCs or replicating elements such as | ||
memory or sensors. | ||
* Software redundancy means a task is implemented in software in multiple | ||
different ways and each version is executed. | ||
* Information redundancy can help detect or correct data corruption by storing | ||
additional data, such as an error detecting/correcting code. | ||
* Time redundancy is brought about by executing the same task on the same | ||
processor multiple times. | ||
In all cases execution on a redundant system is followed by a decision step, in | ||
which the output from each redundant execution (or the redundant information in | ||
the case of information redundancy), is compared. | ||
If the output differs then a failure has occurred in one or more executions and | ||
an error can be reported. | ||
If the redundancy is implemented by replicating by more than two the system can | ||
also continue executing in the case where the output of the majority of | ||
executions concurs. | ||
|
||
Different types of redundancy are used to detect and/or prevent different | ||
categories of failures. | ||
Hardware redundancy is used to handle malfunction of hardware elements, software | ||
redundancy is applied to avoid defects introduced during the development phase, | ||
information redundancy protects the system against data corruption and time | ||
redundancy is employed to avoid transient faults. | ||
It has to be noted that HW-SW co-engineering methods exist, for which the strict | ||
classification above might not be fully correct, however it still is helpful to | ||
establish fundamental understanding. | ||
|
||
Redundancy can mitigate both random-permanent and transient faults, of both | ||
internal and external origin. Diverse redundancy also supports claims with | ||
respect to systematic faults. | ||
|
||
Both software redundancy and time redundancy are typically managed by purely | ||
software means without explicit hardware support. | ||
Hence, we exclude them from the discussion in the remainder of this chapter. | ||
|
||
[#sec:redundancy:features] | ||
#### Features | ||
|
||
##### Information Redundancy | ||
|
||
Apart from full replication, information redundancy can be achieved by | ||
techniques such as error detection and correction codes (EDC/ECC) in memory. | ||
|
||
##### Hardware Redundancy | ||
|
||
Redundancy can be characterized by its level of diversity. | ||
If, for example, multiple identical hardware elements are integrated without modification the diversity is low. | ||
Diversity can be increased by triggering the execution of the software on such a | ||
system with an offset of a number of clock cycles. Additional diversity can be | ||
achieved by incorporating elements of differing design, differing ISAs or even | ||
differing manufacturers. | ||
|
||
Diverse redundancy is particularly useful to mitigate common-cause failures. | ||
Upon a fault affecting redundant diverse components, the error experienced by | ||
any such component is likely to differ and hence, errors will be detected and | ||
the failure avoided. | ||
|
||
The level of diversity required by a system depends on the target safety level, | ||
as well as the domain. | ||
For example, executing a process with a clock cycle offset on identical | ||
processors may be sufficient for systems in the automotive domain, but not in | ||
avionics. | ||
|
||
Redundancy can be further characterized by its cardinality. | ||
In general, redundancy can be characterized as a M-out-of-N (MooN) system, where | ||
N is the cardinality, i.e. the number of redundant elements, and M is the number | ||
of these elements that are required to be functional for the whole system to | ||
remain functional. | ||
|
||
By including two redundant components (1oo2) in the system (also known as dual | ||
modular redundancy) individual faults can be detected, but when the output of | ||
the two components differs the system cannot decide in which the fault occurred | ||
and hence which result is the correct one. | ||
To allow the system to continue to execute in such a case, and therefore correct | ||
the fault, the component needs to be included in triplicate (2oo3, triple | ||
modular redundancy), or more (such as 3oo5). | ||
The comparison module can then decide by majority voting which output is the | ||
right one and which component has failed. | ||
|
||
NOTE: This implies that the comparison module has to fulfill high integrity | ||
requirements with respect to absence of systematic faults. | ||
|
||
Redundant hardware can be used either as a fallback component which is idle | ||
until the primary component fails (assuming full replication), or during normal | ||
execution and in parallel with the primary system (certain models also support | ||
partial replication), in order to detect or correct faults in the components. | ||
|
||
In the former case the functionality of the fallback system is typically a | ||
subset of the main system providing degraded operation and it is used only in a | ||
time-limited fashion until the primary system is back online or to put the | ||
system into a safe state. | ||
|
||
Hardware redundancy is often implemented by replicating a processor and | ||
comparing the output of the computation at the end of a task's execution. | ||
If, instead, the task is initiated in the same, or offset, clock cycle on | ||
identical processors and the state of the execution is compared at regular | ||
intervals, or steps, it is referred to as lockstep computing. | ||
The definition of the step can vary, e.g. a set number of clock cycles, or | ||
delineated by specific events such as writes to memory, interrupts, or other | ||
events off-core. | ||
|
||
For lockstep computing an additional comparison module is required to perform | ||
regular checks of the state of the systems. | ||
Such module must also be intrinsically redundant or tolerant to the relevant | ||
fault modes affecting the redundant components. | ||
|
||
Time redundancy can additionally be introduced into lockstep execution by | ||
starting the redundant tasks not in the same clock cycle but offset by a number | ||
of cycles. | ||
This method avoids common-cause failures due to transient faults. | ||
|
||
Examples of lock-step include dual core lock-step (DCLS), as an implementation | ||
of 1oo2, and triple core lock-step (TCLS), which is 2oo3. | ||
The cores may be initiated in the same clock cycle or offset. | ||
|
||
In the context of RISC-V there may be additional levels of diversity of | ||
redundancy that can be considered. | ||
Since RISC-V is an open instruction set architecture (ISA) several different | ||
implementations exist by different independent providers. | ||
Different implementations could be combined in a system to provide a level of | ||
diverse (microarchitectural) redundancy. | ||
However, while this kind of redundancy will detect (and possibly correct) | ||
failures due to the implementation, it assumes absence of faults in the ISA | ||
itself. | ||
|
||
Traditionally, heterogenous systems imply diversity of ISAs. Diversity of ISA | ||
has the added benefit of introducing a certain degree of diversity in the | ||
software as well, since different tools, such as compilers, are required to | ||
program for the processor. | ||
In the case of using different RISC-V implementations it is therefore worth | ||
considering using different compiler implementations for each replicated | ||
processor, unless other measures (like qualification kits etc.) exist. | ||
|
||
Whether the level of redundancy achieved by combining RISC-V implementations, | ||
and perhaps using different toolchains, is sufficient depends on the safety | ||
level and domain of the system | ||
|
||
[#sec:redundancy:safety:level] | ||
#### Level | ||
|
||
Both the core and the whole SoC can be replicated for redundancy purposes. | ||
|
||
[#sec:redundancy:safety:importance] | ||
#### Importance | ||
|
||
While most safety standards do not require redundancy, to achieve the levels of | ||
fault tolerance required at higher safety levels redundancy is one of the key | ||
techniques employed. This can be seen in functional safety standards such as | ||
ISO 26262 and IEC 61508. | ||
|
||
In the context of ISO 26262 cite:[iso26262:2018], redundancy is integral to | ||
"`ASIL tailoring`" i.e. the decomposition of safety requirements to redundant | ||
architectural elements, | ||
whereby, if absence of dependent failures can be demonstrated, the ASIL | ||
allocated to the redundant elements can be reduced. | ||
|
||
Similarly IEC 61508-2:2010 cite:[iec16508-2:2010] provides guidance for the maximum | ||
diagnostics coverage of a variety of redundancy techniques, and | ||
IEC 61508-6:2010 cite:[iec16508-6:2010] Annex B provides guidance for evaluating probabilities of hardware failure that includes various redundant voting schemas. | ||
In addition, IEC 61508-2:2010 cite:[iec16508-2:2010] Annex E defines normative | ||
requirements for: | ||
"`__Special architecture requirements for digital integrated circuits (ICs) with | ||
on-chip redundancy__`" to avoid common cause failures for ICs that share the same | ||
substrate. | ||
|
||
[#sec:redundancy:safety:justification] | ||
#### Justification | ||
|
||
Redundancy is often the only mechanism to detect errors and remain operational | ||
to the extent required by systems with high safety levels. | ||
|
||
Basic redundancy can improve integrity by providing a method for error detection | ||
and eventually correction (both could by accompanied by degradation of main | ||
functionality). | ||
If the redundancy is further increased the system can also show improved fault | ||
tolerance and hence reliability, since single faults are corrected as long as | ||
they do not lead to common cause failures, which would need diversity in | ||
addition. | ||
In the case of a primary-backup setup availability can be said to increase, | ||
since the backup component may be available even if the primary component has | ||
failed. | ||
|
||
ISO 26262:5 cite:[iso26262-5:2018] mentions redundancy as a safety mechanism, with typical diagnostic | ||
coverage considered achievable described as "`High`". | ||
|
||
ISO 26262:11 cite:[iso26262-11:2018] also specifically mentions diverse redundancy as a tool to reduce | ||
risk of hardware failures when using IP with limited documentation and | ||
insufficient historic (aka "`proven in use`") data. | ||
|
||
Error detection/correction modes are described in | ||
ISO 26262:11 cite:[iso26262-11:2018] as a technique to detect failures in | ||
memory. | ||
|
||
[#sec:redundancy:rv] | ||
### RISC-V solutions | ||
|
||
Given that redundancy is intended to be completely transparent, no RISC-V | ||
specific features have been devised to our knowledge. | ||
However, it has to be noted that control- and capture-interfaces will add to | ||
register-interface (core and uncore-IP), and consequently a standardized minimal | ||
set (ideally mapped against safety requirements from various standards), will | ||
improve consideration of RISC-V by Safety-System vendors. | ||
|
||
[#sec:redundancy:recom] | ||
### Recommendations | ||
|
||
Redundancy is intended to be completely transparent, hence no changes to the ISA | ||
are required. | ||
|
||
[#sec:redundancy:activities] | ||
### Relevant activities | ||
|
||
#### Related external bodies | ||
|
||
None. | ||
|
||
#### Related chapters | ||
|
||
Potentially the error management chapter (to be released), for errors detected | ||
and/or corrected by means of redundancy. | ||
For instance, to program actions to take upon unrecoverable errors, and to | ||
collect statistics about corrected errors. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters