Add redundancy chapter

riscv · Feb 17, 2025 · 4691163 · 4691163
1 parent 1774acf
commit 4691163
Show file tree

Hide file tree

Showing 3 changed files with 277 additions and 0 deletions.
diff --git a/src/chapters/redundancy.adoc b/src/chapters/redundancy.adoc
@@ -0,0 +1,247 @@
+[#sec:redundancy]
+## Redundancy
+
+[#sec:redundancy:safety]
+### Safety needs
+
+To be safe is to reduce faults to a level that avoids unreasonable risk.
+These include transient faults, which are one-off faults, after which the
+behavior returns to normal (unless the fault has already been
+consumed/propagated), and permanent faults, which persist and require
+intervention.
+
+Faults can have an internal (e.g. bug, device aging) or external (e.g. high
+energy particle, electrostatic discharge, excessive heat) origin.
+
+A failure that has a single root cause but affects more than one component is
+known as a common-cause failure.
+
+Redundancy is the (partial) replication of components or information to detect
+and/or correct faults.
+
+Redundancy can be introduced into the system on different levels:
+
+* Hardware redundancy is achieved by replicating physical resources, for example
+including multiple processors or whole SoCs or replicating elements such as
+memory or sensors.
+* Software redundancy means a task is implemented in software in multiple
+different ways and each version is executed.
+* Information redundancy can help detect or correct data corruption by storing
+additional data, such as an error detecting/correcting code.
+* Time redundancy is brought about by executing the same task on the same
+processor multiple times.
+
+In all cases execution on a redundant system is followed by a decision step, in
+which the output from each redundant execution (or the redundant information in
+the case of information redundancy), is compared.
+If the output differs then a failure has occurred in one or more executions and
+an error can be reported.
+If the redundancy is implemented by replicating by more than two the system can
+also continue executing in the case where the output of the majority of
+executions concurs.
+
+Different types of redundancy are used to detect and/or prevent different
+categories of failures.
+Hardware redundancy is used to handle malfunction of hardware elements, software
+redundancy is applied to avoid defects introduced during the development phase,
+information redundancy protects the system against data corruption and time
+redundancy is employed to avoid transient faults.
+It has to be noted that HW-SW co-engineering methods exist, for which the strict
+classification above might not be fully correct, however it still is helpful to
+establish fundamental understanding.
+
+Redundancy can mitigate both random-permanent and transient faults, of both
+internal and external origin. Diverse redundancy also supports claims with
+respect to systematic faults.
+
+Both software redundancy and time redundancy are typically managed by purely
+software means without explicit hardware support.
+Hence, we exclude them from the discussion in the remainder of this chapter.
+
+[#sec:redundancy:features]
+#### Features
+
+##### Information Redundancy
+
+Apart from full replication, information redundancy can be achieved by
+techniques such as error detection and correction codes (EDC/ECC) in memory.
+
+##### Hardware Redundancy
+
+Redundancy can be characterized by its level of diversity.
+If, for example, multiple identical hardware elements are integrated without modification the diversity is low.
+Diversity can be increased by triggering the execution of the software on such a
+system with an offset of a number of clock cycles. Additional diversity can be
+achieved by incorporating elements of differing design, differing ISAs or even
+differing manufacturers.
+
+Diverse redundancy is particularly useful to mitigate common-cause failures.
+Upon a fault affecting redundant diverse components, the error experienced by
+any such component is likely to differ and hence, errors will be detected and
+the failure avoided.
+
+The level of diversity required by a system depends on the target safety level,
+as well as the domain.
+For example, executing a process with a clock cycle offset on identical
+processors may be sufficient for systems in the automotive domain, but not in
+avionics.
+
+Redundancy can be further characterized by its cardinality.
+In general, redundancy can be characterized as a M-out-of-N (MooN) system, where
+N is the cardinality, i.e. the number of redundant elements, and M is the number
+of these elements that are required to be functional for the whole system to
+remain functional.
+
+By including two redundant components (1oo2) in the system (also known as dual
+modular redundancy) individual faults can be detected, but when the output of
+the two components differs the system cannot decide in which the fault occurred
+and hence which result is the correct one.
+To allow the system to continue to execute in such a case, and therefore correct
+the fault, the component needs to be included in triplicate (2oo3, triple
+modular redundancy), or more (such as 3oo5).
+The comparison module can then decide by majority voting which output is the
+right one and which component has failed.
+
+NOTE: This implies that the comparison module has to fulfill high integrity
+requirements with respect to absence of systematic faults.
+
+Redundant hardware can be used either as a fallback component which is idle
+until the primary component fails (assuming full replication), or during normal
+execution and in parallel with the primary system (certain models also support
+partial replication), in order to detect or correct faults in the components.
+
+In the former case the functionality of the fallback system is typically a
+subset of the main system providing degraded operation and it is used only in a
+time-limited fashion until the primary system is back online or to put the
+system into a safe state.
+
+Hardware redundancy is often implemented by replicating a processor and
+comparing the output of the computation at the end of a task's execution.
+If, instead, the task is initiated in the same, or offset, clock cycle on
+identical processors and the state of the execution is compared at regular
+intervals, or steps, it is referred to as lockstep computing.
+The definition of the step can vary, e.g. a set number of clock cycles, or
+delineated by specific events such as writes to memory, interrupts, or other
+events off-core.
+
+For lockstep computing an additional comparison module is required to perform
+regular checks of the state of the systems.
+Such module must also be intrinsically redundant or tolerant to the relevant
+fault modes affecting the redundant components.
+
+Time redundancy can additionally be introduced into lockstep execution by
+starting the redundant tasks not in the same clock cycle but offset by a number
+of cycles.
+This method avoids common-cause failures due to transient faults.
+
+Examples of lock-step include dual core lock-step (DCLS), as an implementation
+of 1oo2, and triple core lock-step (TCLS), which is 2oo3.
+The cores may be initiated in the same clock cycle or offset.
+
+In the context of RISC-V there may be additional levels of diversity of
+redundancy that can be considered.
+Since RISC-V is an open instruction set architecture (ISA) several different
+implementations exist by different independent providers.
+Different implementations could be combined in a system to provide a level of
+diverse (microarchitectural) redundancy.
+However, while this kind of redundancy will detect (and possibly correct)
+failures due to the implementation, it assumes absence of faults in the ISA
+itself.
+
+Traditionally, heterogenous systems imply diversity of ISAs. Diversity of ISA
+has the added benefit of introducing a certain degree of diversity in the
+software as well, since different tools, such as compilers, are required to
+program for the processor.
+In the case of using different RISC-V implementations it is therefore worth
+considering using different compiler implementations for each replicated
+processor, unless other measures (like qualification kits etc.) exist.
+
+Whether the level of redundancy achieved by combining RISC-V implementations,
+and perhaps using different toolchains, is sufficient depends on the safety
+level and domain of the system
+
+[#sec:redundancy:safety:level]
+#### Level
+
+Both the core and the whole SoC can be replicated for redundancy purposes.
+
+[#sec:redundancy:safety:importance]
+#### Importance
+
+While most safety standards do not require redundancy, to achieve the levels of
+fault tolerance required at higher safety levels redundancy is one of the key
+techniques employed. This can be seen in functional safety standards such as
+ISO 26262 and IEC 61508.
+
+In the context of ISO 26262 cite:[iso26262:2018], redundancy is integral to
+"`ASIL tailoring`" i.e. the decomposition of safety requirements to redundant
+architectural elements,
+whereby, if absence of dependent failures can be demonstrated, the ASIL
+allocated to the redundant elements can be reduced.
+
+Similarly IEC 61508-2:2010 cite:[iec16508-2:2010] provides guidance for the maximum
+diagnostics coverage of a variety of redundancy techniques, and
+IEC 61508-6:2010 cite:[iec16508-6:2010] Annex B provides guidance for evaluating probabilities of hardware failure that includes various redundant voting schemas.
+In addition, IEC 61508-2:2010 cite:[iec16508-2:2010] Annex E defines normative
+requirements for:
+"`__Special architecture requirements for digital integrated circuits (ICs) with
+on-chip redundancy__`" to avoid common cause failures for ICs that share the same
+substrate.
+
+[#sec:redundancy:safety:justification]
+#### Justification
+
+Redundancy is often the only mechanism to detect errors and remain operational
+to the extent required by systems with high safety levels.
+
+Basic redundancy can improve integrity by providing a method for error detection
+and eventually correction (both could by accompanied by degradation of main
+functionality).
+If the redundancy is further increased the system can also show improved fault
+tolerance and hence reliability, since single faults are corrected as long as
+they do not lead to common cause failures, which would need diversity in
+addition.
+In the case of a primary-backup setup availability can be said to increase,
+since the backup component may be available even if the primary component has
+failed.
+
+ISO 26262:5 cite:[iso26262-5:2018] mentions redundancy as a safety mechanism, with typical diagnostic
+coverage considered achievable described as "`High`".
+
+ISO 26262:11 cite:[iso26262-11:2018] also specifically mentions diverse redundancy as a tool to reduce
+risk of hardware failures when using IP with limited documentation and
+insufficient historic (aka "`proven in use`") data.
+
+Error detection/correction modes are described in
+ISO 26262:11 cite:[iso26262-11:2018] as a technique to detect failures in
+memory.
+
+[#sec:redundancy:rv]
+### RISC-V solutions
+
+Given that redundancy is intended to be completely transparent, no RISC-V
+specific features have been devised to our knowledge.
+However, it has to be noted that control- and capture-interfaces will add to
+register-interface (core and uncore-IP), and consequently a standardized minimal
+set (ideally mapped against safety requirements from various standards), will
+improve consideration of RISC-V by Safety-System vendors.
+
+[#sec:redundancy:recom]
+### Recommendations
+
+Redundancy is intended to be completely transparent, hence no changes to the ISA
+are required.
+
+[#sec:redundancy:activities]
+### Relevant activities
+
+#### Related external bodies
+
+None.
+
+#### Related chapters
+
+Potentially the error management chapter (to be released), for errors detected
+and/or corrected by means of redundancy.
+For instance, to program actions to take upon unrecoverable errors, and to
+collect statistics about corrected errors.
diff --git a/src/fusa-whitepaper.adoc b/src/fusa-whitepaper.adoc
@@ -79,6 +79,8 @@ include::chapters/pmc.adoc[]
 
 include::chapters/qos.adoc[]
 
+include::chapters/redundancy.adoc[]
+
 // The index must precede the bibliography
 include::index.adoc[]
 

diff --git a/src/fusa-whitepaper.bib b/src/fusa-whitepaper.bib
@@ -151,3 +151,31 @@ @techreport{ft-qosid:2023
   institution = {RISC-V International},
   year = {2023}
 }
+
+% REF9
+@techreport{iec16508-2:2010,
+  title = {{IEC}16508-2:2010 {F}unctional safety of electrical/electronic/programmable electronic safety-related systems - {P}art 2: {R}equirements for electrical/electronic/programmable electronic safety-related systems},
+  institution = {International Electrotechnical Commission (IEC), available at https://webstore.iec.ch},
+  year = {2010}
+}
+
+% REF10
+@techreport{iec16508-6:2010,
+  title = {{IEC} 16508-2:2010 {F}unctional safety of electrical/electronic/programmable electronic safety-related systems - {P}art 6: {G}uidelines on the application of {IEC} 61508-2 and {IEC} 61508-3},
+  institution = {International Electrotechnical Commission (IEC), available at https://webstore.iec.ch},
+  year = {2010}
+}
+
+#REF11
+@techreport{iso26262-5:2018,
+  title = {{ISO} 26262-5:2018 {R}oad vehicles - {F}unctional safety, {P}art 5: Product development at the hardware level},
+  institution = {International Organization for Standardization (ISO), avalaible at https://www.iso.org/standard/68387.html},
+  year = 2018
+}
+
+#REF12
+@techreport{iso26262-11:2018,
+  title = {{ISO} 26262-11:2018 {R}oad vehicles - {F}unctional safety, {P}art 11: Guidelines on application of {ISO} 26262 to semiconductors},
+  institution = {International Organization for Standardization (ISO), avalaible at https://www.iso.org/standard/69604.html},
+  year = 2018
+}