OTERA

OTERA: Online Test Strategies for Reliable Reconfigurable Architectures

Dynamically reconfigurable architectures enable a major acceleration of diverse applications by changing and optimizing the structure of the system at runtime. Permanent and transient faults threaten the correct operation of such an architecture. This project aims to increase dependability of runtime reconfigurable systems by a novel system-level strategy for online tests and online adaptation to an impaired state. This will be achieved by (a) scheduling such that tests for reconfigurable resources are executed with minimal performance impact, (b) resource management such that partially faulty resources are used for components which do not require the faulty elements, and (c) online monitoring and error checking. To ensure reliable runtime reconfiguration, each reconfiguration process is thoroughly tested by a novel and efficient combination of online structural and functional tests. Compared to existing fault-tolerance approaches, our proposal avoids the large hardware overhead of structural redundancy schemes. The saved resources are available for further application acceleration. Still, the proposed scheme covers faults in the fabric, in the reconfigured application logic and errors in the process of reconfiguration.

10.2010 - 06.2017, DFG-Project: WU 245/10-1, 10-2, 10-3

The project in detail:

In the framework of the SPP 1500 priority program, this project contributes to

Dependable Hardware Architectures,
Design Methods and
Operation, Observation and Adaptation.

Motivation

Dynamically reconfigurable architectures allow to adapt and optimize at runtime according to the current system state, load, and application. This enables a major acceleration of diverse applications with low hardware overhead. To achieve the desired execution of these applications, the underlying reconfigurable fabric must provide a high degree of reliability, i.e. error-free operation. In addition to classical fault models found in static VLSI hardware, new types of faults threaten the operation of such reconfigurable architectures and need to be considered. The reconfigured application Module may not only suffer from permanent faults in the underlying reconfigurable fabric (e.g. due to aging) but also from errors during the reconfiguration process and operation. In particular, an erroneous reconfiguration or a transient fault affecting the configuration memory changes the structure of the reconfigured application Module.

Effective methods for a manufacturing test of FPGA fabric exist, i.e. the tests can be performed in a reasonable time while covering the entire configurable fabric. Reconfigurability of the system can also be exploited to conduct efficient and thorough tests both of the fabric and the application logic. In case of faults, mitigation is possible by adaptation to the impaired state. This requires that the system is capable of detecting and diagnosing faults at runtime, and taking an appropriate adaptation decision.

Altogether, these approaches target non-reconfigurable systems that are implemented using a reconfigurable fabric. The reconfigurability can be used to increase the reliability in these cases. However, missing are techniques that address runtime reconfigurable systems that exploit partial runtime reconfiguration as part of their normal operation. Here, it is crucial to assure a reliable runtime reconfiguration that can be applied efficiently with low overhead, online in the field, with a limited number of hardware resources. Therefore, our novel OTERA project aims to realize a reliable reconfiguration and system adaptation that additionally provides a high resource efficiency, for instance by utilizing partially faulty FPGA structures, i.e. selectively reconfiguring those application Modules to a partially faulty area that do not demand the faulty parts.

Goals

This project aims to increase the dependability of reconfigurable systems by a novel online system-level strategy for reliable runtime reconfiguration and system adaptation. The whole approach over all three phases (3 x 2 years) comprises that:

Errors are detected concurrently and can be contained (do not spread system-wide),
Faults are detected in the reconfigurable fabric and reconfigured application logic to ensure correct completion of a reconfiguration,
Root causes of detected errors are determined by diagnosis,
Potential future errors are predicted (based on recent errors, online monitoring, and system load),
Reliable system operation is achieved by the runtime system that dynamically schedules test routines (while trading test coverage and system stress due to testing), and
Adaptation to an impaired state is managed by the runtime system with minimal impact to the application performance.

This work is supported by the German Research Foundation (DFG) under grant WU 245/10-1 (2011-2012), WU 245/10-2 (2013-2014), and WU 245/10-3 (2015-2016).

Books and Book Chapters

2019
1. Advances in Hardware Reliability of Reconfigurable Many-core Embedded Systems. Lars Bauer; Hongyan Zhang; Michael A. Kochte; Eric Schneider; Hans-Joachim. Wunderlich and Jörg Henkel. In Many-Core Computing: Hardware and software, B. M. Al-Hashimi and G. V. Merrett (eds.). Institution of Engineering and Technology (IET), 2019, pp. 395–416. DOI: https://doi.org/10.1049/PBPC022E_ch16
  Abstract
  The chapter discusses the background for the most demanding dependability hallenges for reconfigurable processors in many-core systems and presents a dependable runtime reconfigurable processor for high reliability. It uses an adaptive modular redundancy technique that guarantees an application-specified level of reliability under changing SEU rates by budgeting the effective critical bits among all kernels and all accelerators of an application. This allows to deploy reconfigurable processors in harsh environments without statically protecting them.
  BibTeX
  @inbook{BauerZKSWH2019, abstract = {The chapter discusses the background for the most demanding dependability hallenges for reconfigurable processors in many-core systems and presents a dependable runtime reconfigurable processor for high reliability. It uses an adaptive modular redundancy technique that guarantees an application-specified level of reliability under changing SEU rates by budgeting the effective critical bits among all kernels and all accelerators of an application. This allows to deploy reconfigurable processors in harsh environments without statically protecting them.}, abteilung = {rabook}, author = {Bauer, Lars and Zhang, Hongyan and Kochte, Michael A. and Schneider, Eric and Wunderlich, Hans-Joachim. and Henkel, Jörg}, booktitle = {Many-Core Computing: Hardware and software}, chapter = {Chapter 16}, doi = {10.1049/PBPC022E_ch16}, editor = {Al-Hashimi, B. M. and Merrett, G. V.}, isbn = {978-1-78561-582-5}, pages = {395--416}, project = {OTERA}, publisher = {Institution of Engineering and Technology (IET)}, series = {Computing}, title = {{Advances in Hardware Reliability of Reconfigurable Many-core Embedded Systems}}, url = {https://digital-library.theiet.org/content/books/10.1049/pbpc022e_ch16}, year = 2019 }
  Link
  https://digital-library.theiet.org/content/books/10.1049/pbpc022e_ch16
  DOI
  10.1049/PBPC022E_ch16

Journals and Conference Proceedings

2017
1. Aging Resilience and Fault Tolerance in Runtime Reconfigurable Architectures. Hongyan Zhang; Lars Bauer; Michael A. Kochte; Eric Schneider; Hans-Joachim Wunderlich and Jörg Henkel. IEEE Transactions on Computers 66, (June 2017), pp. 957–970. DOI: https://doi.org/10.1109/TC.2016.2616405
  Abstract
  Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) allow area- and power-efficient acceleration of complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability and lifetime of such systems. Aging mitigation and fault tolerance techniques for the reconfigurable fabric become essential to realize dependable reconfigurable architectures. This article presents an accelerator diversification method that creates multiple configurations for runtime reconfigurable accelerators that are diversified in their usage of Configurable Logic Blocks (CLBs). In particular, it creates a minimal number of configurations such that all single-CLB and some multi-CLB faults can be tolerated. For each fault we ensure that there is at least one configuration that does not use that CLB. Secondly, a novel runtime accelerator placement algorithm is presented that exploits the diversity in resource usage of these configurations to balance the stress imposed by executions of the accelerators on the reconfigurable fabric. By tracking the stress due to accelerator usage at runtime, the stress is balanced both within a reconfigurable region as well as over all reconfigurable regions of the system. The accelerator placement algorithm also considers faulty CLBs in the regions and selects the appropriate configuration such that the system maintains a high performance in presence of multiple permanent faults. Experimental results demonstrate that our methods deliver up to 3.7x higher performance in presence of faults at marginal runtime costs and 1.6x higher MTTF than state-of-the-art aging mitigation methods.
  BibTeX
  @article{ZhangBKSWH2017, abstract = {Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) allow area- and power-efficient acceleration of complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability and lifetime of such systems. Aging mitigation and fault tolerance techniques for the reconfigurable fabric become essential to realize dependable reconfigurable architectures. This article presents an accelerator diversification method that creates multiple configurations for runtime reconfigurable accelerators that are diversified in their usage of Configurable Logic Blocks (CLBs). In particular, it creates a minimal number of configurations such that all single-CLB and some multi-CLB faults can be tolerated. For each fault we ensure that there is at least one configuration that does not use that CLB. Secondly, a novel runtime accelerator placement algorithm is presented that exploits the diversity in resource usage of these configurations to balance the stress imposed by executions of the accelerators on the reconfigurable fabric. By tracking the stress due to accelerator usage at runtime, the stress is balanced both within a reconfigurable region as well as over all reconfigurable regions of the system. The accelerator placement algorithm also considers faulty CLBs in the regions and selects the appropriate configuration such that the system maintains a high performance in presence of multiple permanent faults. Experimental results demonstrate that our methods deliver up to 3.7x higher performance in presence of faults at marginal runtime costs and 1.6x higher MTTF than state-of-the-art aging mitigation methods. }, abteilung = {ra}, author = {Zhang, Hongyan and Bauer, Lars and Kochte, Michael A. and Schneider, Eric and Wunderlich, Hans-Joachim and Henkel, Jörg}, day = 1, doi = {10.1109/TC.2016.2616405}, journal = {IEEE Transactions on Computers}, month = {06}, number = 6, owner = {haefneht}, pages = {957--970}, project = {OTERA}, title = {{Aging Resilience and Fault Tolerance in Runtime Reconfigurable Architectures}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2017/TC_ZhangBKSWH2017.pdf}, volume = 66, year = 2017 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2017/TC_ZhangBKSWH2017.pdf
  DOI
  10.1109/TC.2016.2616405
2016
1. Functional Diagnosis for Graceful Degradation of NoC Switches. Atefe Dalirsani and Hans-Joachim Wunderlich. In Proceedings of the 25th IEEE Asian Test Symposium (ATS′16), Hiroshima, Japan, 2016, pp. 246–251. DOI: https://doi.org/10.1109/ATS.2016.18
  Abstract
  Reconfigurable Networks-on-Chip (NoCs) allow discarding the corrupted ports of a defective switch instead of deactivating it entirely, and thus enable fine-grained reconfiguration of the network, making the NoC structures more robust. A prerequisite for such a fine-grained reconfiguration is to identify the corrupted port of a faulty switch. This paper presents a functional diagnosis approach which extracts structural fault information from functional tests and utilizes this information to identify the broken functions/ports of a defective switch. The broken parts are discarded while the remaining functions are used for the normal operation. The non-intrusive method introduced is independent of the switch architecture and the NoC topology and can be applied for any type of structural fault. The i diagnostic resolution of the functional test is so high that for nearly 64% of the faults in the example switch only a single port has to be switched off. As the remaining parts stay completely functional, the impact of faults on throughput and performance is minimized.
  BibTeX
  @inproceedings{DalirW2016, abstract = {Reconfigurable Networks-on-Chip (NoCs) allow discarding the corrupted ports of a defective switch instead of deactivating it entirely, and thus enable fine-grained reconfiguration of the network, making the NoC structures more robust. A prerequisite for such a fine-grained reconfiguration is to identify the corrupted port of a faulty switch. This paper presents a functional diagnosis approach which extracts structural fault information from functional tests and utilizes this information to identify the broken functions/ports of a defective switch. The broken parts are discarded while the remaining functions are used for the normal operation. The non-intrusive method introduced is independent of the switch architecture and the NoC topology and can be applied for any type of structural fault. The i diagnostic resolution of the functional test is so high that for nearly 64% of the faults in the example switch only a single port has to be switched off. As the remaining parts stay completely functional, the impact of faults on throughput and performance is minimized.}, abteilung = {ra}, address = {Hiroshima, Japan}, author = {Dalirsani, Atefe and Wunderlich, Hans-Joachim}, booktitle = {Proceedings of the 25th IEEE Asian Test Symposium (ATS'16)}, day = {21--24}, doi = {10.1109/ATS.2016.18}, location = {Hiroshima, Japan}, month = {11}, owner = {hellmelr}, pages = {246--251}, project = {OTERA}, title = {{Functional Diagnosis for Graceful Degradation of NoC Switches}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2016/ATS_DalirW2016.pdf}, year = 2016 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2016/ATS_DalirW2016.pdf
  DOI
  10.1109/ATS.2016.18
2015
1. STRAP: Stress-Aware Placement for Aging Mitigation in Runtime Reconfigurable Architectures. Hongyan Zhang; Michael A. Kochte; Eric Schneider; Lars Bauer; Hans-Joachim Wunderlich and Jörg Henkel. In Proceedings of the 34th IEEE/ACM International Conference onComputer-Aided Design (ICCAD′15), Austin, Texas, USA, 2015, pp. 38–45.
  Abstract
  Aging effects in nano-scale CMOS circuits impair the reliability and Mean Time to Failure (MTTF) of embedded systems. Especially for FPGAs that are manufactured in the latest technology node, aging is a major concern. We introduce the first cross-layer aging-aware placement method for accelerators in FPGA-based runtime reconfigurable architectures. It optimizes stress distribution by accelerator placement at runtime, i.e. to which reconfigurable region an accelerator shall be reconfigured. Additionally, it optimizes logic placement at synthesis time to diversify the resource usage of individual accelerators, i.e. which CLBs of a reconfigurable region shall be used by an accelerator. Both layers together balance the intra- and inter-region stress induced by the application workload at negligible performance cost. Experimental results show significant reduction of maximum stress of up to 64% and 35%, which leads to up to 177% and 14% MTTF improvement relative to state-of- the-art methods w.r.t. HCI and BTI aging, respectively.
  BibTeX
  @inproceedings{ZhangKSBWH2015, abstract = {Aging effects in nano-scale CMOS circuits impair the reliability and Mean Time to Failure (MTTF) of embedded systems. Especially for FPGAs that are manufactured in the latest technology node, aging is a major concern. We introduce the first cross-layer aging-aware placement method for accelerators in FPGA-based runtime reconfigurable architectures. It optimizes stress distribution by accelerator placement at runtime, i.e. to which reconfigurable region an accelerator shall be reconfigured. Additionally, it optimizes logic placement at synthesis time to diversify the resource usage of individual accelerators, i.e. which CLBs of a reconfigurable region shall be used by an accelerator. Both layers together balance the intra- and inter-region stress induced by the application workload at negligible performance cost. Experimental results show significant reduction of maximum stress of up to 64% and 35%, which leads to up to 177% and 14% MTTF improvement relative to state-of- the-art methods w.r.t. HCI and BTI aging, respectively.}, abteilung = {ra}, address = {Austin, Texas, USA}, author = {Zhang, Hongyan and Kochte, Michael A. and Schneider, Eric and Bauer, Lars and Wunderlich, Hans-Joachim and Henkel, Jörg}, booktitle = {Proceedings of the 34th IEEE/ACM International Conference onComputer-Aided Design (ICCAD'15)}, day = {2--6}, isbn = {978-1-4673-8389-9}, location = {Austin, Texas, USA}, month = {11}, owner = {haefneht}, pages = {38-45}, project = {OTERA}, title = {{STRAP: Stress-Aware Placement for Aging Mitigation in Runtime Reconfigurable Architectures}}, url = {https://doi.org/10.1109/ICCAD.2015.7372547}, year = 2015 }
  Link
  https://doi.org/10.1109/ICCAD.2015.7372547
2. Adaptive Multi-Layer Techniques for Increased System Dependability. Lars Bauer; Jörg Henkel; Andreas Herkersdorf; Michael A. Kochte; Johannes M. Kühn; Wolfgang Rosenstiel; Thomas Schweizer; Stefan Wallentowitz; Volker Wenzel; Thomas Wild; Hans-Joachim Wunderlich and Hongyan Zhang. it - Information Technology 57, (June 2015), pp. 149–158. DOI: https://doi.org/10.1515/itit-2014-1082
  Abstract
  Achieving system-level dependability is a demanding task. The manifold requirements and dependability threats can no longer be statically addressed at individual abstraction layers. Instead, all components of future multi-processor systems-on-chip (MPSoCs) have to contribute to this common goal in an adaptive manner. In this paper we target a generic heterogeneous MPSoC that combines general purpose processors along with dedicated application-specific hard-wired accelerators, fine-grained reconfigurable processors, and coarse-grained reconfigurable architectures. We present different reactive and proactive measures at the layers of the runtime system (online resource management), system architecture (global communication), micro architecture (individual tiles), and gate netlist (tile-internal circuits) to address dependability threats.
  BibTeX
  @article{BauerHHKKRSWWWWZ2015, abstract = {Achieving system-level dependability is a demanding task. The manifold requirements and dependability threats can no longer be statically addressed at individual abstraction layers. Instead, all components of future multi-processor systems-on-chip (MPSoCs) have to contribute to this common goal in an adaptive manner. In this paper we target a generic heterogeneous MPSoC that combines general purpose processors along with dedicated application-specific hard-wired accelerators, fine-grained reconfigurable processors, and coarse-grained reconfigurable architectures. We present different reactive and proactive measures at the layers of the runtime system (online resource management), system architecture (global communication), micro architecture (individual tiles), and gate netlist (tile-internal circuits) to address dependability threats.}, abteilung = {ra}, author = {Bauer, Lars and Henkel, Jörg and Herkersdorf, Andreas and Kochte, Michael A. and Kühn, Johannes M. and Rosenstiel, Wolfgang and Schweizer, Thomas and Wallentowitz, Stefan and Wenzel, Volker and Wild, Thomas and Wunderlich, Hans-Joachim and Zhang, Hongyan}, day = 8, doi = {10.1515/itit-2014-1082}, issn = {1611-2776}, journal = {it - Information Technology}, month = {06}, number = 3, owner = {hellmelr}, pages = {149--158}, project = {OTERA}, title = {{Adaptive Multi-Layer Techniques for Increased System Dependability}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2015/ITIT_BauerHHKKRSWWWWZ2015.pdf}, volume = 57, year = 2015 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2015/ITIT_BauerHHKKRSWWWWZ2015.pdf
  DOI
  10.1515/itit-2014-1082
2014
1. GUARD: GUAranteed Reliability in Dynamically Reconfigurable Systems. Hongyan Zhang; Michael A. Kochte; Michael E. Imhof; Lars Bauer; Hans-Joachim Wunderlich and Jörg Henkel. In Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC′14), San Francisco, California, USA, 2014, pp. 1–6. DOI: https://doi.org/10.1145/2593069.2593146
  Abstract
  Soft errors are a reliability threat for reconfigurable systems implemented with SRAM-based FPGAs. They can be handled through fault tolerance techniques like scrubbing and modular redundancy. However, selecting these techniques statically at design or compile time tends to be pessimistic and prohibits optimal adaptation to changing soft error rate at runtime. We present the GUARD method which allows for autonomous runtime reliability management in reconfigurable architectures: Based on the error rate observed during runtime, the runtime system dynamically determines whether a computation should be executed by a hardened processor, or whether it should be accelerated by inherently less reliable reconfigurable hardware which can trade-off performance and reliability. GUARD is the first runtime system for reconfigurable architectures that guarantees a target reliability while optimizing the performance. This allows applications to dynamically chose the desired degree of reliability. Compared to related work with statically optimized fault tolerance techniques, GUARD provides up to 68.3% higher performance at the same target reliability.
  BibTeX
  @inproceedings{ZhangKIBWH2014, abstract = {Soft errors are a reliability threat for reconfigurable systems implemented with SRAM-based FPGAs. They can be handled through fault tolerance techniques like scrubbing and modular redundancy. However, selecting these techniques statically at design or compile time tends to be pessimistic and prohibits optimal adaptation to changing soft error rate at runtime. We present the GUARD method which allows for autonomous runtime reliability management in reconfigurable architectures: Based on the error rate observed during runtime, the runtime system dynamically determines whether a computation should be executed by a hardened processor, or whether it should be accelerated by inherently less reliable reconfigurable hardware which can trade-off performance and reliability. GUARD is the first runtime system for reconfigurable architectures that guarantees a target reliability while optimizing the performance. This allows applications to dynamically chose the desired degree of reliability. Compared to related work with statically optimized fault tolerance techniques, GUARD provides up to 68.3% higher performance at the same target reliability.}, abteilung = {ra}, address = {San Francisco, California, USA}, author = {Zhang, Hongyan and Kochte, Michael A. and Imhof, Michael E. and Bauer, Lars and Wunderlich, Hans-Joachim and Henkel, Jörg}, award = {HiPEAC Paper Award}, booktitle = {Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC'14)}, comment = {ISBN: 978-1-4503-2730-5}, day = {1--5}, doi = {10.1145/2593069.2593146}, location = {San Francisco, California, USA}, month = {06}, owner = {imhofml}, pages = {1--6}, project = {OTERA}, title = {{GUARD: GUAranteed Reliability in Dynamically Reconfigurable Systems}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2014/DAC_ZhangKIBWH2014.pdf}, year = 2014 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2014/DAC_ZhangKIBWH2014.pdf
  DOI
  10.1145/2593069.2593146
2. Resilience Articulation Point (RAP): Cross-layer Dependability Modeling for Nanometer System-on-chip Resilience. Andreas Herkersdorf; Hananeh Aliee; Michael Engel; Michael Glaß; Christina Gimmler-Dumont; Jörg Henkel; Veit B. Kleeberger; Michael A. Kochte; Johannes M. Kühn; Daniel Mueller-Gritschneder; Sani R. Nassif; Holm Rauchfuss; Wolfgang Rosenstiel; Ulf Schlichtmann; Muhammad Shafique; Mehdi B. Tahoori; Jürgen Teich; Norbert Wehn; Christian Weis and Hans-Joachim Wunderlich. Elsevier Microelectronics Reliability Journal 54, (2014), pp. 1066–1074. DOI: https://doi.org/10.1016/j.microrel.2013.12.012
  Abstract
  The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro-interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells, voltage variations and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture-level resilience methods.
  BibTeX
  @article{HerkeAEGGHKKKMNRRSSTTWWW2014, abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro-interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells, voltage variations and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture-level resilience methods. }, abteilung = {ra}, author = {Herkersdorf, Andreas and Aliee, Hananeh and Engel, Michael and Glaß, Michael and Gimmler-Dumont, Christina and Henkel, Jörg and Kleeberger, Veit B. and Kochte, Michael A. and Kühn, Johannes M. and Mueller-Gritschneder, Daniel and Nassif, Sani R. and Rauchfuss, Holm and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Shafique, Muhammad and Tahoori, Mehdi B. and Teich, Jürgen and Wehn, Norbert and Weis, Christian and Wunderlich, Hans-Joachim}, day = {June--July}, doi = {10.1016/j.microrel.2013.12.012}, journal = {Elsevier Microelectronics Reliability Journal}, number = {6--7}, owner = {haefneht}, pages = {1066--1074}, project = {OTERA}, title = {{Resilience Articulation Point (RAP): Cross-layer Dependability Modeling for Nanometer System-on-chip Resilience}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2014/Elsevier_HerkeAEGGHKKKMNRRSSTTWWW2014.pdf}, volume = 54, year = 2014 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2014/Elsevier_HerkeAEGGHKKKMNRRSSTTWWW2014.pdf
  DOI
  10.1016/j.microrel.2013.12.012
2013
1. Module Diversification: Fault Tolerance and Aging Mitigation for Runtime Reconfigurable Architectures. Hongyan Zhang; Lars Bauer; Michael A. Kochte; Eric Schneider; Claus Braun; Michael E. Imhof; Hans-Joachim Wunderlich and Jörg Henkel. In Proceedings of the IEEE International Test Conference (ITC′13), Anaheim, California, USA, 2013. DOI: https://doi.org/10.1109/TEST.2013.6651926
  Abstract
  Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) are attractive for realizing complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability of such systems and must be tackled by aging mitigation and application of fault tolerance techniques. This paper presents module diversification, a novel design method that creates different configurations for runtime reconfigurable modules. Our method provides fault tolerance by creating the minimal number of configurations such that for any faulty Configurable Logic Block (CLB) there is at least one configuration that does not use that CLB. Additionally, we determine the fraction of time that each configuration should be used to balance the stress and to mitigate the aging process in FPGA-based runtime reconfigurable systems. The generated configurations significantly improve reliability by fault-tolerance and aging mitigation.
  BibTeX
  @inproceedings{ZhangBKSBIWH2013, abstract = {Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) are attractive for realizing complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability of such systems and must be tackled by aging mitigation and application of fault tolerance techniques. This paper presents module diversification, a novel design method that creates different configurations for runtime reconfigurable modules. Our method provides fault tolerance by creating the minimal number of configurations such that for any faulty Configurable Logic Block (CLB) there is at least one configuration that does not use that CLB. Additionally, we determine the fraction of time that each configuration should be used to balance the stress and to mitigate the aging process in FPGA-based runtime reconfigurable systems. The generated configurations significantly improve reliability by fault-tolerance and aging mitigation.}, abteilung = {ra}, address = {Anaheim, California, USA}, author = {Zhang, Hongyan and Bauer, Lars and Kochte, Michael A. and Schneider, Eric and Braun, Claus and Imhof, Michael E. and Wunderlich, Hans-Joachim and Henkel, Jörg}, booktitle = {Proceedings of the IEEE International Test Conference (ITC'13)}, day = {10--12}, doi = {10.1109/TEST.2013.6651926}, location = {Anaheim, California, USA}, month = {09}, owner = {imhofml}, project = {OTERA}, title = {{Module Diversification: Fault Tolerance and Aging Mitigation for Runtime Reconfigurable Architectures}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/ITC_ZhangBKSBIWH2013.pdf}, year = 2013 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/ITC_ZhangBKSBIWH2013.pdf
  DOI
  10.1109/TEST.2013.6651926
2. Test Strategies for Reliable Runtime Reconfigurable Architectures. Lars Bauer; Claus Braun; Michael E. Imhof; Michael A. Kochte; Eric Schneider; Hongyan Zhang; Jörg Henkel and Hans-Joachim Wunderlich. IEEE Transactions on Computers 62, (August 2013), pp. 1494–1507. DOI: https://doi.org/10.1109/TC.2013.53
  Abstract
  FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. The reliability of FPGAs, being manufactured in latest technologies, is threatened by soft errors, as well as aging effects and latent defects.To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the reconfigurable fabric. This can be achieved by periodic or on-demand online testing. This paper presents a reliable system architecture for runtime-reconfigurable systems, which integrates two non-concurrent online test strategies: Pre-configuration online tests (PRET) and post-configuration online tests (PORT). The PRET checks that the reconfigurable hardware is free of faults by periodic or on-demand tests. The PORT has two objectives: It tests reconfigured hardware units after reconfiguration to check that the configuration process completed correctly and it validates the expected functionality. During operation, PORT is used to periodically check the reconfigured hardware units for malfunctions in the programmable logic. Altogether, this paper presents PRET, PORT, and the system integration of such test schemes into a runtime-reconfigurable system, including the resource management and test scheduling. Experimental results show that the integration of online testing in reconfigurable systems incurs only minimum impact on performance while delivering high fault coverage and low test latency.
  BibTeX
  @article{BauerBIKSZHW2013, abstract = {FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. The reliability of FPGAs, being manufactured in latest technologies, is threatened by soft errors, as well as aging effects and latent defects.To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the reconfigurable fabric. This can be achieved by periodic or on-demand online testing. This paper presents a reliable system architecture for runtime-reconfigurable systems, which integrates two non-concurrent online test strategies: Pre-configuration online tests (PRET) and post-configuration online tests (PORT). The PRET checks that the reconfigurable hardware is free of faults by periodic or on-demand tests. The PORT has two objectives: It tests reconfigured hardware units after reconfiguration to check that the configuration process completed correctly and it validates the expected functionality. During operation, PORT is used to periodically check the reconfigured hardware units for malfunctions in the programmable logic. Altogether, this paper presents PRET, PORT, and the system integration of such test schemes into a runtime-reconfigurable system, including the resource management and test scheduling. Experimental results show that the integration of online testing in reconfigurable systems incurs only minimum impact on performance while delivering high fault coverage and low test latency.}, abteilung = {ra}, address = {Los Alamitos, California, USA}, author = {Bauer, Lars and Braun, Claus and Imhof, Michael E. and Kochte, Michael A. and Schneider, Eric and Zhang, Hongyan and Henkel, Jörg and Wunderlich, Hans-Joachim}, doi = {10.1109/TC.2013.53}, issn = {0018-9340}, journal = {IEEE Transactions on Computers}, location = {Los Alamitos, California, USA}, month = {08}, number = 8, owner = {imhofml}, pages = {1494--1507}, project = {OTERA}, publisher = {IEEE Computer Society}, title = {{Test Strategies for Reliable Runtime Reconfigurable Architectures}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/TC_BauerBIKSZHW2013.pdf}, volume = 62, year = 2013 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/TC_BauerBIKSZHW2013.pdf
  DOI
  10.1109/TC.2013.53
2012
1. OTERA: Online Test Strategies for Reliable Reconfigurable Architectures. Lars Bauer; Claus Braun; Michael E. Imhof; Michael A. Kochte; Hongyan Zhang; Hans-Joachim Wunderlich and Jörg Henkel. In Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS′12), Erlangen, Germany, 2012, pp. 38–45. DOI: https://doi.org/10.1109/AHS.2012.6268667
  Abstract
  FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. However, the reliability of FPGAs, which are manufactured in latest technologies, is threatened not only by soft errors, but also by aging effects and latent defects. To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the underlying reconfigurable fabric. This can be achieved by periodic or on-demand online testing. The OTERA project develops and evaluates components and strategies for reconfigurable systems that feature reliable reconfiguration. The research focus ranges from structural online tests for the FPGA infrastructure and functional online tests for the configured functionality up to the resource management and test scheduling. This paper gives an overview of the project tasks and presents first results.
  BibTeX
  @inproceedings{BauerBIKZWH2012, abstract = {FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. However, the reliability of FPGAs, which are manufactured in latest technologies, is threatened not only by soft errors, but also by aging effects and latent defects. To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the underlying reconfigurable fabric. This can be achieved by periodic or on-demand online testing. The OTERA project develops and evaluates components and strategies for reconfigurable systems that feature reliable reconfiguration. The research focus ranges from structural online tests for the FPGA infrastructure and functional online tests for the configured functionality up to the resource management and test scheduling. This paper gives an overview of the project tasks and presents first results.}, abteilung = {ra}, address = {Erlangen, Germany}, author = {Bauer, Lars and Braun, Claus and Imhof, Michael E. and Kochte, Michael A. and Zhang, Hongyan and Wunderlich, Hans-Joachim and Henkel, Jörg}, booktitle = {Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS'12)}, day = {25--28}, doi = {10.1109/AHS.2012.6268667}, isbn = {978-1-4673-1915-7}, language = {English}, location = {Erlangen, Germany}, month = {06}, owner = {Thomas}, pages = {38--45}, project = {OTERA}, publisher = {IEEE Computer Society}, title = {{OTERA: Online Test Strategies for Reliable Reconfigurable Architectures}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2012/AHS_BauerBIKZWH2012.pdf}, year = 2012 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2012/AHS_BauerBIKZWH2012.pdf
  DOI
  10.1109/AHS.2012.6268667
2011
1. Efficient BDD-based Fault Simulation in Presence of Unknown Values. Michael A. Kochte; S. Kundu; Kohei Miyase; Xiaoqing Wen and Hans-Joachim Wunderlich. In Proceedings of the 20th IEEE Asian Test Symposium (ATS′11), New Delhi, India, 2011, pp. 383–388. DOI: https://doi.org/10.1109/ATS.2011.52
  Abstract
  Unknown (X) values, originating from memories, clock domain boundaries or A/D interfaces, may compromise test signatures and fault coverage. Classical logic and fault simulation algorithms are pessimistic w.r.t. the propagation of X values in the circuit. This work proposes efficient hybrid logic and stuck-at fault simulation algorithms which combine heuristics and local BDDs to increase simulation accuracy. Experimental results on benchmark and large industrial circuits show significantly increased fault coverage and low runtime. The achieved simulation precision is quantified for the first time.
  BibTeX
  @inproceedings{KochtKMWW2011, abstract = {Unknown (X) values, originating from memories, clock domain boundaries or A/D interfaces, may compromise test signatures and fault coverage. Classical logic and fault simulation algorithms are pessimistic w.r.t. the propagation of X values in the circuit. This work proposes efficient hybrid logic and stuck-at fault simulation algorithms which combine heuristics and local BDDs to increase simulation accuracy. Experimental results on benchmark and large industrial circuits show significantly increased fault coverage and low runtime. The achieved simulation precision is quantified for the first time.}, abteilung = {ra}, address = {New Delhi, India}, author = {Kochte, Michael A. and Kundu, S. and Miyase, Kohei and Wen, Xiaoqing and Wunderlich, Hans-Joachim}, booktitle = {Proceedings of the 20th IEEE Asian Test Symposium (ATS'11)}, day = {20--23}, doi = {10.1109/ATS.2011.52}, isbn = {978-1-4577-1984-4}, issn = {1081-7735}, language = {English}, location = {New Delhi, India}, month = {11}, owner = {Thomas}, pages = {383--388}, project = {OTERA}, publisher = {IEEE Computer Society}, title = {{Efficient BDD-based Fault Simulation in Presence of Unknown Values}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2011/ATS_KochtKMWW2011.pdf}, year = 2011 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2011/ATS_KochtKMWW2011.pdf
  DOI
  10.1109/ATS.2011.52
2. Design and Architectures for Dependable Embedded Systems. Jörg Henkel; Lars Bauer; Joachim Becker; Oliver Bringmann; Uwe Brinkschulte; Samarjit Chakraborty; Michael Engel; Rolf Ernst; Hermann Härtig; Lars Hedrich; Andreas Herkersdorf; Rüdiger Kapitza; Daniel Lohmann; Peter Marwedel; Marco Platzner; Wolfgang Rosenstiel; Ulf Schlichtmann; Olaf Spinczyk; Mehdi Tahoori; Jürgen Teich; Norbert Wehn and Hans-Joachim Wunderlich. In Proceedings of the 9th IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS′11), Taipei, Taiwan, 2011, pp. 69–78. DOI: https://doi.org/10.1145/2039370.2039384
  Abstract
  The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a 'dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classification on faults, errors, and failures.
  BibTeX
  @inproceedings{HenkeBBBBCEEHHHKLMPRSSTTWW2011, abstract = {The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a 'dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classification on faults, errors, and failures.}, abteilung = {ra}, address = {Taipei, Taiwan}, author = {Henkel, Jörg and Bauer, Lars and Becker, Joachim and Bringmann, Oliver and Brinkschulte, Uwe and Chakraborty, Samarjit and Engel, Michael and Ernst, Rolf and Härtig, Hermann and Hedrich, Lars and Herkersdorf, Andreas and Kapitza, Rüdiger and Lohmann, Daniel and Marwedel, Peter and Platzner, Marco and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Spinczyk, Olaf and Tahoori, Mehdi and Teich, Jürgen and Wehn, Norbert and Wunderlich, Hans-Joachim}, booktitle = {Proceedings of the 9th IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS'11)}, day = {9--14}, doi = {10.1145/2039370.2039384}, isbn = {978-1-4503-0715-4}, language = {English}, location = {Taipei, Taiwan}, month = {10}, owner = {Thomas}, pages = {69--78}, project = {OTERA}, publisher = {ACM}, title = {{Design and Architectures for Dependable Embedded Systems}}, url = {https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2011/CODES_ISSS_HenkeBBBBCEEHHHKLMPRSSTTWW2011.pdf}, year = 2011 }
  Link
  https://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2011/CODES_ISSS_HenkeBBBBCEEHHHKLMPRSSTTWW2011.pdf
  DOI
  10.1145/2039370.2039384

Workshop Contributions

2013
1. Cross-Layer Dependability Modeling and Abstraction in Systems on Chip. Andreas Herkersdorf; Michael Engel; Michael Glaß; Jörg Henkel; Veit B. Kleeberger; Michael A. Kochte; Johannes M. Kühn; Sani R. Nassif; Holm Rauchfuss; Wolfgang Rosenstiel; Ulf Schlichtmann; Muhammad Shafique; Mehdi B. Tahoori; Jürgen Teich; Norbert Wehn; Christian Weis and Hans-Joachim Wunderlich. In Selse-9: The 9th Workshop on Silicon Errors in Logic - System Effects, Stanford, California, USA, 2013.
  - Abstract
  - BibTeX
  Abstract
  The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.
  BibTeX
  @inproceedings{HerkersdorfEGHKKKNRRSSTTWWW2013, abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.}, abteilung = {rawork}, address = {Stanford, California, USA}, author = {Herkersdorf, Andreas and Engel, Michael and Glaß, Michael and Henkel, Jörg and Kleeberger, Veit B. and Kochte, Michael A. and Kühn, Johannes M. and Nassif, Sani R. and Rauchfuss, Holm and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Shafique, Muhammad and Tahoori, Mehdi B. and Teich, Jürgen and Wehn, Norbert and Weis, Christian and Wunderlich, Hans-Joachim}, booktitle = {Selse-9: The 9th Workshop on Silicon Errors in Logic - System Effects}, day = {26--27}, location = {Stanford, California, USA}, month = {03}, owner = {hellmelr}, project = {OTERA}, title = {{Cross-Layer Dependability Modeling and Abstraction in Systems on Chip}}, year = 2013 }
2012
1. Fault Modeling in Testing. Stefan Holst; Michael A. Kochte and Hans-Joachim Wunderlich. In RAP Day Workshop, DFG SPP 1500, Munich, Germany, 2012.
  - BibTeX
  BibTeX
  @inproceedings{HolstKW2012, abteilung = {rawork}, address = {Munich, Germany}, author = {Holst, Stefan and Kochte, Michael A. and Wunderlich, Hans-Joachim}, booktitle = {RAP Day Workshop, DFG SPP 1500}, day = 21, location = {Munich, Germany}, month = {12}, owner = {hellmelr}, project = {OTERA}, title = {{Fault Modeling in Testing}}, year = 2012 }

This image shows Hans-Joachim Wunderlich

OTERA

OTERA: Online Test Strategies for Reliable Reconfigurable Architectures

The project in detail:

Books and Book Chapters

2019

Abstract

BibTeX

Link

DOI

Journals and Conference Proceedings

2017

Abstract

BibTeX

Link

DOI

2016

Abstract

BibTeX

Link

DOI

2015

Abstract

BibTeX

Link

Abstract

BibTeX

Link

DOI

2014

Abstract

BibTeX

Link

DOI

Abstract

BibTeX

Link

DOI

2013

Abstract

BibTeX

Link

DOI

Abstract

BibTeX

Link

DOI

2012

Abstract

BibTeX

Link

DOI

2011

Abstract

BibTeX

Link

DOI

Abstract

BibTeX

Link

DOI

Workshop Contributions

2013

Abstract

BibTeX

2012

BibTeX

Hans-Joachim Wunderlich

Here you can reach us

Audience

Formalities

Services

Organization