Project Partners

Zur Webseite der Uni Stuttgart

OTERA: Online Test Strategies for Reliable Reconfigurable Architectures

since 10.2010, DFG-Project: WU 245/10-1, 10-2, 10-3   

Project Description

Dynamically reconfigurable architectures enable a major acceleration of diverse applications by changing and optimizing the structure of the system at runtime. Permanent and transient faults threaten the correct operation of such an architecture. This project aims to increase dependability of runtime reconfigurable systems by a novel system-level strategy for online tests and online adaptation to an impaired state. This will be achieved by (a) scheduling such that tests for reconfigurable resources are executed with minimal performance impact, (b) resource management such that partially faulty resources are used for components which do not require the faulty elements, and (c) online monitoring and error checking. To ensure reliable runtime reconfiguration, each reconfiguration process is thoroughly tested by a novel and efficient combination of online structural and functional tests.
Compared to existing fault-tolerance approaches, our proposal avoids the large hardware overhead of structural redundancy schemes. The saved resources are available for further application acceleration. Still, the proposed scheme covers faults in the fabric, in the reconfigured application logic and errors in the process of reconfiguration.

In the framework of the SPP 1500 priority program, this project contributes to
- Dependable Hardware Architectures,
- Design Methods and
- Operation, Observation and Adaptation.

Motivation (Motivation)

Dynamically reconfigurable architectures allow to adapt and optimize at runtime according to the current system state, load, and application. This enables a major acceleration of diverse applications with low hardware overhead. To achieve the desired execution of these applications, the underlying reconfigurable fabric must provide a high degree of reliability, i.e. error-free operation. In addition to classical fault models found in static VLSI hardware, new types of faults threaten the operation of such reconfigurable architectures and need to be considered. The reconfigured application Module may not only suffer from permanent faults in the underlying reconfigurable fabric (e.g. due to aging) but also from errors during the reconfiguration process and operation. In particular, an erroneous reconfiguration or a transient fault affecting the configuration memory changes the structure of the reconfigured application Module.
Effective methods for a manufacturing test of FPGA fabric exist, i.e. the tests can be performed in a reasonable time while covering the entire configurable fabric. Reconfigurability of the system can also be exploited to conduct efficient and thorough tests both of the fabric and the application logic. In case of faults, mitigation is possible by adaptation to the impaired state. This requires that the system is capable of detecting and diagnosing faults at runtime, and taking an appropriate adaptation decision.
Altogether, these approaches target non-reconfigurable systems that are implemented using a reconfigurable fabric. The reconfigurability can be used to increase the reliability in these cases. However, missing are techniques that address runtime reconfigurable systems that exploit partial runtime reconfiguration as part of their normal operation. Here, it is crucial to assure a reliable runtime reconfiguration that can be applied efficiently with low overhead, online in the field, with a limited number of hardware resources. Therefore, our novel OTERA project aims to realize a reliable reconfiguration and system adaptation that additionally provides a high resource efficiency, for instance by utilizing partially faulty FPGA structures, i.e. selectively reconfiguring those application Modules to a partially faulty area that do not demand the faulty parts.

Goals

This project aims to increase the dependability of reconfigurable systems by a novel online system-level strategy for reliable runtime reconfiguration and system adaptation. The whole approach over all three phases (3 x 2 years) comprises that:

  • Errors are detected concurrently and can be contained (do not spread system-wide),
  • Faults are detected in the reconfigurable fabric and reconfigured application logic to ensure correct completion of a reconfiguration,
  • Root causes of detected errors are determined by diagnosis,
  • Potential future errors are predicted (based on recent errors, online monitoring, and system load),
  • Reliable system operation is achieved by the runtime system that dynamically schedules test routines (while trading test coverage and system stress due to testing), and
  • Adaptation to an impaired state is managed by the runtime system with minimal impact to the application performance.

 This work is supported by the German Research Foundation (DFG) under grant WU 245/10-1 (2011-2012), WU 245/10-2 (2013-2014), and WU 245/10-3 (2015-2016).


Activities
  • H.-J. Wunderlich: "Fault Tolerance Meets Diagnosis", Keynote at the 21st IEEE International On-Line Testing Symposium (IOLTS), Elia, Halkidiki, Greece, July 6-8, 2015

Publications

Journals and Conference Proceedings
Matching entries: 0
settings...
13. Aging Resilience and Fault Tolerance in Runtime Reconfigurable Architectures
Zhang, H., Bauer, L., Kochte, M.A., Schneider, E., Wunderlich, H.-J. and Henkel, J.
IEEE Transactions on Computers
Vol. 66(6), 1 June 2017, pp. 957-970
2017
DOI PDF 
Keywords: Runtime reconfiguration, aging mitigation, fault-tolerance, resilience, graceful degradation, FPGA
Abstract: Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) allow area- and power-efficient acceleration of complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability and lifetime of such systems. Aging mitigation and fault tolerance techniques for the reconfigurable fabric become essential to realize dependable reconfigurable architectures. This article presents an accelerator diversification method that creates multiple configurations for runtime reconfigurable accelerators that are diversified in their usage of Configurable Logic Blocks (CLBs). In particular, it creates a minimal number of configurations such that all single-CLB and some multi-CLB faults can be tolerated. For each fault we ensure that there is at least one configuration that does not use that CLB.
Secondly, a novel runtime accelerator placement algorithm is presented that exploits the diversity in resource usage of these configurations to balance the stress imposed by executions of the accelerators on the reconfigurable fabric. By tracking the stress due to accelerator usage at runtime, the stress is balanced both within a reconfigurable region as well as over all reconfigurable regions of the system. The accelerator placement algorithm also considers faulty CLBs in the regions and selects the appropriate configuration such that the system maintains a high performance in presence of multiple permanent faults.
Experimental results demonstrate that our methods deliver up to 3.7x higher performance in presence of faults at marginal runtime costs and 1.6x higher MTTF than state-of-the-art aging mitigation methods.
BibTeX:
@article{ZhangBKSWH2017,
  author = {Zhang, Hongyan and Bauer, Lars and Kochte, Michael A. and Schneider, Eric and Wunderlich, Hans-Joachim and Henkel, Jörg},
  title = {{Aging Resilience and Fault Tolerance in Runtime Reconfigurable Architectures}},
  journal = {IEEE Transactions on Computers},
  year = {2017},
  volume = {66},
  number = {6},
  pages = {957--970},
  keywords = {Runtime reconfiguration, aging mitigation, fault-tolerance, resilience, graceful degradation, FPGA},
  abstract = {Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) allow area- and power-efficient acceleration of complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability and lifetime of such systems. Aging mitigation and fault tolerance techniques for the reconfigurable fabric become essential to realize dependable reconfigurable architectures. This article presents an accelerator diversification method that creates multiple configurations for runtime reconfigurable accelerators that are diversified in their usage of Configurable Logic Blocks (CLBs). In particular, it creates a minimal number of configurations such that all single-CLB and some multi-CLB faults can be tolerated. For each fault we ensure that there is at least one configuration that does not use that CLB.
Secondly, a novel runtime accelerator placement algorithm is presented that exploits the diversity in resource usage of these configurations to balance the stress imposed by executions of the accelerators on the reconfigurable fabric. By tracking the stress due to accelerator usage at runtime, the stress is balanced both within a reconfigurable region as well as over all reconfigurable regions of the system. The accelerator placement algorithm also considers faulty CLBs in the regions and selects the appropriate configuration such that the system maintains a high performance in presence of multiple permanent faults.
Experimental results demonstrate that our methods deliver up to 3.7x higher performance in presence of faults at marginal runtime costs and 1.6x higher MTTF than state-of-the-art aging mitigation methods. }, doi = {http://dx.doi.org/10.1109/TC.2016.2616405}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2017/TC_ZhangBKSWH2017.pdf} }
12. Functional Diagnosis for Graceful Degradation of NoC Switches
Dalirsani, A. and Wunderlich, H.-J.
Proceedings of the 25th IEEE Asian Test Symposium (ATS'16), Hiroshima, Japan, 21-24 November 2016, pp. 246-251
2016
DOI PDF 
Keywords: Functional test, functional failure mode, fault classification, functional diagnosis, pattern generation, finegrained reconfiguration
Abstract: Reconfigurable Networks-on-Chip (NoCs) allow discarding the corrupted ports of a defective switch instead of deactivating it entirely, and thus enable fine-grained reconfiguration of the network, making the NoC structures more robust. A prerequisite for such a fine-grained reconfiguration is to identify the corrupted port of a faulty switch. This paper presents a functional diagnosis approach which extracts structural fault information from functional tests and utilizes this information to identify the broken functions/ports of a defective switch. The broken parts are discarded while the remaining functions are used for the normal operation. The non-intrusive method introduced is independent of the switch architecture and the NoC topology and can be applied for any type of structural fault. The i diagnostic resolution of the functional test is so high that for nearly 64% of the faults in the example switch only a single port has to be switched off. As the remaining parts stay completely functional, the impact of faults on throughput and performance is minimized.
BibTeX:
@inproceedings{DalirW2016,
  author = {Dalirsani, Atefe and Wunderlich, Hans-Joachim},
  title = {{Functional Diagnosis for Graceful Degradation of NoC Switches}},
  booktitle = {Proceedings of the 25th IEEE Asian Test Symposium (ATS'16)},
  year = {2016},
  pages = {246--251},
  keywords = {Functional test, functional failure mode, fault classification, functional diagnosis, pattern generation, finegrained reconfiguration},
  abstract = {Reconfigurable Networks-on-Chip (NoCs) allow discarding the corrupted ports of a defective switch instead of deactivating it entirely, and thus enable fine-grained reconfiguration of the network, making the NoC structures more robust. A prerequisite for such a fine-grained reconfiguration is to identify the corrupted port of a faulty switch. This paper presents a functional diagnosis approach which extracts structural fault information from functional tests and utilizes this information to identify the broken functions/ports of a defective switch. The broken parts are discarded while the remaining functions are used for the normal operation. The non-intrusive method introduced is independent of the switch architecture and the NoC topology and can be applied for any type of structural fault. The i diagnostic resolution of the functional test is so high that for nearly 64% of the faults in the example switch only a single port has to be switched off. As the remaining parts stay completely functional, the impact of faults on throughput and performance is minimized.},
  doi = {http://dx.doi.org/10.1109/ATS.2016.18},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2016/ATS_DalirW2016.pdf}
}
11. STRAP: Stress-Aware Placement for Aging Mitigation in Runtime Reconfigurable Architectures
Zhang, H., Kochte, M.A., Schneider, E., Bauer, L., Wunderlich, H.-J. and Henkel, J.
Proceedings of the 34th IEEE/ACM International Conference on Computer-Aided Design (ICCAD'15), Austin, Texas, USA, 2-6 November 2015, pp. 38-45
2015
URL PDF 
Abstract: Aging effects in nano-scale CMOS circuits impair the reliability and Mean Time to Failure (MTTF) of embedded systems. Especially for FPGAs that are manufactured in the latest technology node, aging is a major concern. We introduce the first cross-layer aging-aware placement method for accelerators in FPGA-based runtime reconfigurable architectures. It optimizes stress distribution by accelerator placement at runtime, i.e. to which reconfigurable region an accelerator shall be reconfigured. Additionally, it optimizes logic placement at synthesis time to diversify the resource usage of individual accelerators, i.e. which CLBs of a reconfigurable region shall be used by an accelerator. Both layers together balance the intra- and inter-region stress induced by the application workload at negligible performance cost. Experimental results show significant reduction of maximum stress of up to 64% and 35%, which leads to up to 177% and 14% MTTF improvement relative to state-of- the-art methods w.r.t. HCI and BTI aging, respectively.
BibTeX:
@inproceedings{ZhangKSBWH2015,
  author = {Zhang, Hongyan and Kochte, Michael A. and Schneider, Eric and Bauer, Lars and Wunderlich, Hans-Joachim and Henkel, Jörg},
  title = {{STRAP: Stress-Aware Placement for Aging Mitigation in Runtime Reconfigurable Architectures}},
  booktitle = {Proceedings of the 34th IEEE/ACM International Conference on Computer-Aided Design (ICCAD'15)},
  year = {2015},
  pages = {38-45},
  abstract = {Aging effects in nano-scale CMOS circuits impair the reliability and Mean Time to Failure (MTTF) of embedded systems. Especially for FPGAs that are manufactured in the latest technology node, aging is a major concern. We introduce the first cross-layer aging-aware placement method for accelerators in FPGA-based runtime reconfigurable architectures. It optimizes stress distribution by accelerator placement at runtime, i.e. to which reconfigurable region an accelerator shall be reconfigured. Additionally, it optimizes logic placement at synthesis time to diversify the resource usage of individual accelerators, i.e. which CLBs of a reconfigurable region shall be used by an accelerator. Both layers together balance the intra- and inter-region stress induced by the application workload at negligible performance cost. Experimental results show significant reduction of maximum stress of up to 64% and 35%, which leads to up to 177% and 14% MTTF improvement relative to state-of- the-art methods w.r.t. HCI and BTI aging, respectively.},
  url = { http://dl.acm.org/citation.cfm?id=2840825 },
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2015/ICCAD_ZhangKSBWH2015.pdf}
}
10. Adaptive Multi-Layer Techniques for Increased System Dependability
Bauer, L., Henkel, J., Herkersdorf, A., Kochte, M.A., Kühn, J.M., Rosenstiel, W., Schweizer, T., Wallentowitz, S., Wenzel, V., Wild, T., Wunderlich, H.-J. and Zhang, H.
it - Information Technology
Vol. 57(3), 8 June 2015, pp. 149-158
2015
DOI PDF 
Keywords: Dependability, fault tolerance, graceful degradation, aging mitigation, online test and error detection, thermal management, multi-core architecture, reconfigurable architecture
Abstract: Achieving system-level dependability is a demanding task. The manifold requirements and dependability threats can no longer be statically addressed at individual abstraction layers. Instead, all components of future multi-processor systems-on-chip (MPSoCs) have to contribute to this common goal in an adaptive manner.
In this paper we target a generic heterogeneous MPSoC that combines general purpose processors along with dedicated application-specific hard-wired accelerators, fine-grained reconfigurable processors, and coarse-grained reconfigurable architectures. We present different reactive and proactive measures at the layers of the runtime system (online resource management), system architecture (global communication), micro architecture (individual tiles), and gate netlist (tile-internal circuits) to address dependability threats.
BibTeX:
@article{BauerHHKKRSWWWWZ2015,
  author = {Bauer, Lars and Henkel, Jörg and Herkersdorf, Andreas and Kochte, Michael A. and Kühn, Johannes M. and Rosenstiel, Wolfgang and Schweizer, Thomas and Wallentowitz, Stefan and Wenzel, Volker and Wild, Thomas and Wunderlich, Hans-Joachim and Zhang, Hongyan},
  title = {{Adaptive Multi-Layer Techniques for Increased System Dependability}},
  journal = {it - Information Technology},
  year = {2015},
  volume = {57},
  number = {3},
  pages = {149--158},
  keywords = {Dependability, fault tolerance, graceful degradation, aging mitigation, online test and error detection, thermal management, multi-core architecture, reconfigurable architecture},
  abstract = {Achieving system-level dependability is a demanding task. The manifold requirements and dependability threats can no longer be statically addressed at individual abstraction layers. Instead, all components of future multi-processor systems-on-chip (MPSoCs) have to contribute to this common goal in an adaptive manner.
In this paper we target a generic heterogeneous MPSoC that combines general purpose processors along with dedicated application-specific hard-wired accelerators, fine-grained reconfigurable processors, and coarse-grained reconfigurable architectures. We present different reactive and proactive measures at the layers of the runtime system (online resource management), system architecture (global communication), micro architecture (individual tiles), and gate netlist (tile-internal circuits) to address dependability threats.}, doi = {http://dx.doi.org/10.1515/itit-2014-1082}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2015/ITIT_BauerHHKKRSWWWWZ2015.pdf} }
9. GUARD: GUAranteed Reliability in Dynamically Reconfigurable Systems
Zhang, H., Kochte, M.A., Imhof, M.E., Bauer, L., Wunderlich, H.-J. and Henkel, J.
Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC'14), San Francisco, California, USA, 1-5 June 2014, pp. 1-6
HiPEAC Paper Award
2014
DOI PDF 
Abstract: Soft errors are a reliability threat for reconfigurable systems implemented with SRAM-based FPGAs. They can be handled through fault tolerance techniques like scrubbing and modular redundancy. However, selecting these techniques statically at design or compile time tends to be pessimistic and prohibits optimal adaptation to changing soft error rate at runtime.
We present the GUARD method which allows for autonomous runtime reliability management in reconfigurable architectures: Based on the error rate observed during runtime, the runtime system dynamically determines whether a computation should be executed by a hardened processor, or whether it should be accelerated by inherently less reliable reconfigurable hardware which can trade-off performance and reliability. GUARD is the first runtime system for reconfigurable architectures that guarantees a target reliability while optimizing the performance. This allows applications to dynamically chose the desired degree of reliability. Compared to related work with statically optimized fault tolerance techniques, GUARD provides up to 68.3% higher performance at the same target reliability.
BibTeX:
@inproceedings{ZhangKIBWH2014,
  author = {Zhang, Hongyan and Kochte, Michael A. and Imhof, Michael E. and Bauer, Lars and Wunderlich, Hans-Joachim and Henkel, Jörg},
  title = {{GUARD: GUAranteed Reliability in Dynamically Reconfigurable Systems}},
  booktitle = {Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC'14)},
  year = {2014},
  pages = {1--6},
  abstract = {Soft errors are a reliability threat for reconfigurable systems implemented with SRAM-based FPGAs. They can be handled through fault tolerance techniques like scrubbing and modular redundancy. However, selecting these techniques statically at design or compile time tends to be pessimistic and prohibits optimal adaptation to changing soft error rate at runtime.
We present the GUARD method which allows for autonomous runtime reliability management in reconfigurable architectures: Based on the error rate observed during runtime, the runtime system dynamically determines whether a computation should be executed by a hardened processor, or whether it should be accelerated by inherently less reliable reconfigurable hardware which can trade-off performance and reliability. GUARD is the first runtime system for reconfigurable architectures that guarantees a target reliability while optimizing the performance. This allows applications to dynamically chose the desired degree of reliability. Compared to related work with statically optimized fault tolerance techniques, GUARD provides up to 68.3% higher performance at the same target reliability.}, doi = {http://dx.doi.org/10.1145/2593069.2593146}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2014/DAC_ZhangKIBWH2014.pdf} }
8. Resilience Articulation Point (RAP): Cross-layer Dependability Modeling for Nanometer System-on-chip Resilience
Herkersdorf, A., Aliee, H., Engel, M., Glaß, M., Gimmler-Dumont, C., Henkel, J., Kleeberger, V.B., Kochte, M.A., Kühn, J.M., Mueller-Gritschneder, D., Nassif, S.R., Rauchfuss, H., Rosenstiel, W., Schlichtmann, U., Shafique, M., Tahoori, M.B., Teich, J., Wehn, N., Weis, C. and Wunderlich, H.-J.
Elsevier Microelectronics Reliability Journal
Vol. 54(6-7), June-July 2014, pp. 1066-1074
2014
DOI PDF 
Keywords: Cross-layer SoC resilience, probabilistic dependability modeling, SRAM error models, critical charge, transient soft errors, permanent aging defects, error abstraction, error transformation, system-level failure analysis, resilience articulation point
Abstract: The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro-interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells, voltage variations and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture-level resilience methods.
BibTeX:
@article{HerkeAEGGHKKKMNRRSSTTWWW2014,
  author = {Herkersdorf, Andreas and Aliee, Hananeh and Engel, Michael and Glaß, Michael and Gimmler-Dumont, Christina and Henkel, Jörg and Kleeberger, Veit B. and Kochte, Michael A. and Kühn, Johannes M. and Mueller-Gritschneder, Daniel and Nassif, Sani R. and Rauchfuss, Holm and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Shafique, Muhammad and Tahoori, Mehdi B. and Teich, Jürgen and Wehn, Norbert and Weis, Christian and Wunderlich, Hans-Joachim },
  title = {{Resilience Articulation Point (RAP): Cross-layer Dependability Modeling for Nanometer System-on-chip Resilience}},
  journal = {Elsevier Microelectronics Reliability Journal},
  year = {2014},
  volume = {54},
  number = {6--7},
  pages = {1066--1074},
  keywords = {Cross-layer SoC resilience, probabilistic dependability modeling, SRAM error models, critical charge, transient soft errors, permanent aging defects, error abstraction, error transformation, system-level failure analysis, resilience articulation point},
  abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro-interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells, voltage variations and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture-level resilience methods. },
  doi = {http://dx.doi.org/10.1016/j.microrel.2013.12.012},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2014/Elsevier_HerkeAEGGHKKKMNRRSSTTWWW2014.pdf}
}
7. SAT-based Code Synthesis for Fault-Secure Circuits
Dalirsani, A., Kochte, M.A. and Wunderlich, H.-J.
Proceedings of the 16th IEEE Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'13), New York City, NY, USA, 2-4 October 2013, pp. 38-44
2013
DOI URL PDF 
Keywords: Concurrent error detection (CED), error control coding, self-checking circuit, totally self-checking (TSC)
Abstract: This paper presents a novel method for synthesizing fault-secure circuits based on parity codes over groups of circuit outputs. The fault-secure circuit is able to detect all errors resulting from combinational and transition faults at a single node. The original circuit is not modified. If the original circuit is non-redundant, the result is a totally self-checking circuit. At first, the method creates the minimum number of parity groups such that the effect of each fault is not masked in at least one parity group. To ensure fault-secureness, the obtained groups are split such that no fault leads to silent data corruption. This is performed by a formal Boolean satisfiability (SAT) based analysis. Since the proposed method reduces the number of required parity groups, the number of two-rail checkers and the complexity of the prediction logic required for fault-secureness decreases as well. Experimental results show that the area overhead is much less compared to duplication and less in comparison to previous methods for synthesis of totally self-checking circuits. Since the original circuit is not modified, the method can be applied for fixed hard macros and IP cores.
BibTeX:
@inproceedings{DalirKW2013,
  author = {Dalirsani, Atefe and Kochte, Michael A. and Wunderlich, Hans-Joachim},
  title = {{SAT-based Code Synthesis for Fault-Secure Circuits}},
  booktitle = {Proceedings of the 16th IEEE Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'13)},
  year = {2013},
  pages = {38--44},
  keywords = {Concurrent error detection (CED), error control coding, self-checking circuit, totally self-checking (TSC)},
  abstract = {This paper presents a novel method for synthesizing fault-secure circuits based on parity codes over groups of circuit outputs. The fault-secure circuit is able to detect all errors resulting from combinational and transition faults at a single node. The original circuit is not modified. If the original circuit is non-redundant, the result is a totally self-checking circuit. At first, the method creates the minimum number of parity groups such that the effect of each fault is not masked in at least one parity group. To ensure fault-secureness, the obtained groups are split such that no fault leads to silent data corruption. This is performed by a formal Boolean satisfiability (SAT) based analysis. Since the proposed method reduces the number of required parity groups, the number of two-rail checkers and the complexity of the prediction logic required for fault-secureness decreases as well. Experimental results show that the area overhead is much less compared to duplication and less in comparison to previous methods for synthesis of totally self-checking circuits. Since the original circuit is not modified, the method can be applied for fixed hard macros and IP cores.},
  url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6653580},
  doi = {http://dx.doi.org/10.1109/DFT.2013.6653580},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/DFTS_DalirKW2013.pdf}
}
6. Module Diversification: Fault Tolerance and Aging Mitigation for Runtime Reconfigurable Architectures
Zhang, H., Bauer, L., Kochte, M.A., Schneider, E., Braun, C., Imhof, M.E., Wunderlich, H.-J. and Henkel, J.
Proceedings of the IEEE International Test Conference (ITC'13), Anaheim, California, USA, 10-12 September 2013
2013
DOI URL PDF 
Keywords: Reliability, online test, fault-tolerance, aging mitigation, partial runtime reconfiguration, FPGA
Abstract: Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) are attractive for realizing complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability of such systems and must be tackled by aging mitigation and application of fault tolerance techniques. This paper presents module diversification, a novel design method that creates different configurations for runtime reconfigurable modules. Our method provides fault tolerance by creating the minimal number of configurations such that for any faulty Configurable Logic Block (CLB) there is at least one configuration that does not use that CLB. Additionally, we determine the fraction of time that each configuration should be used to balance the stress and to mitigate the aging process in FPGA-based runtime reconfigurable systems. The generated configurations significantly improve reliability by fault-tolerance and aging mitigation.
BibTeX:
@inproceedings{ZhangBKSBIWH2013,
  author = {Zhang, Hongyan and Bauer, Lars and Kochte, Michael A. and Schneider, Eric and Braun, Claus and Imhof, Michael E. and Wunderlich, Hans-Joachim and Henkel, Jörg},
  title = {{Module Diversification: Fault Tolerance and Aging Mitigation for Runtime Reconfigurable Architectures}},
  booktitle = {Proceedings of the IEEE International Test Conference (ITC'13)},
  year = {2013},
  keywords = {Reliability, online test, fault-tolerance, aging mitigation, partial runtime reconfiguration, FPGA},
  abstract = {Runtime reconfigurable architectures based on Field-Programmable Gate Arrays (FPGAs) are attractive for realizing complex applications. However, being manufactured in latest semiconductor process technologies, FPGAs are increasingly prone to aging effects, which reduce the reliability of such systems and must be tackled by aging mitigation and application of fault tolerance techniques. This paper presents module diversification, a novel design method that creates different configurations for runtime reconfigurable modules. Our method provides fault tolerance by creating the minimal number of configurations such that for any faulty Configurable Logic Block (CLB) there is at least one configuration that does not use that CLB. Additionally, we determine the fraction of time that each configuration should be used to balance the stress and to mitigate the aging process in FPGA-based runtime reconfigurable systems. The generated configurations significantly improve reliability by fault-tolerance and aging mitigation.},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6651926},
  doi = {http://dx.doi.org/10.1109/TEST.2013.6651926},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/ITC_ZhangBKSBIWH2013.pdf}
}
5. Test Strategies for Reliable Runtime Reconfigurable Architectures
Bauer, L., Braun, C., Imhof, M.E., Kochte, M.A., Schneider, E., Zhang, H., Henkel, J. and Wunderlich, H.-J.
IEEE Transactions on Computers
Vol. 62(8), Los Alamitos, California, USA, August 2013, pp. 1494-1507
2013
DOI URL PDF 
Keywords: FPGA, Reconfigurable Architectures, Online Test
Abstract: FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. The reliability of FPGAs, being manufactured in latest technologies, is threatened by soft errors, as well as aging effects and latent defects.To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the reconfigurable fabric. This can be achieved by periodic or on-demand online testing. This paper presents a reliable system architecture for runtime-reconfigurable systems, which integrates two non-concurrent online test strategies: Pre-configuration online tests (PRET) and post-configuration online tests (PORT). The PRET checks that the reconfigurable hardware is free of faults by periodic or on-demand tests. The PORT has two objectives: It tests reconfigured hardware units after reconfiguration to check that the configuration process completed correctly and it validates the expected functionality. During operation, PORT is used to periodically check the reconfigured hardware units for malfunctions in the programmable logic. Altogether, this paper presents PRET, PORT, and the system integration of such test schemes into a runtime-reconfigurable system, including the resource management and test scheduling. Experimental results show that the integration of online testing in reconfigurable systems incurs only minimum impact on performance while delivering high fault coverage and low test latency.
BibTeX:
@article{BauerBIKSZHW2013,
  author = {Bauer, Lars and Braun, Claus and Imhof, Michael E. and Kochte, Michael A. and Schneider, Eric and Zhang, Hongyan and Henkel, Jörg and Wunderlich, Hans-Joachim},
  title = {{Test Strategies for Reliable Runtime Reconfigurable Architectures}},
  journal = {IEEE Transactions on Computers},
  publisher = {IEEE Computer Society},
  year = {2013},
  volume = {62},
  number = {8},
  pages = {1494--1507},
  keywords = {FPGA, Reconfigurable Architectures, Online Test},
  abstract = {FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. The reliability of FPGAs, being manufactured in latest technologies, is threatened by soft errors, as well as aging effects and latent defects.To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the reconfigurable fabric. This can be achieved by periodic or on-demand online testing. This paper presents a reliable system architecture for runtime-reconfigurable systems, which integrates two non-concurrent online test strategies: Pre-configuration online tests (PRET) and post-configuration online tests (PORT). The PRET checks that the reconfigurable hardware is free of faults by periodic or on-demand tests. The PORT has two objectives: It tests reconfigured hardware units after reconfiguration to check that the configuration process completed correctly and it validates the expected functionality. During operation, PORT is used to periodically check the reconfigured hardware units for malfunctions in the programmable logic. Altogether, this paper presents PRET, PORT, and the system integration of such test schemes into a runtime-reconfigurable system, including the resource management and test scheduling. Experimental results show that the integration of online testing in reconfigurable systems incurs only minimum impact on performance while delivering high fault coverage and low test latency.},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6475939},
  doi = {http://dx.doi.org/10.1109/TC.2013.53},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2013/TC_BauerBIKSZHW2013.pdf}
}
4. Transparent Structural Online Test for Reconfigurable Systems
Abdelfattah, M.S., Bauer, L., Braun, C., Imhof, M.E., Kochte, M.A., Zhang, H., Henkel, J. and Wunderlich, H.-J.
Proceedings of the 18th IEEE International On-Line Testing Symposium (IOLTS'12), Sitges, Spain, 27-29 June 2012, pp. 37-42
2012
DOI PDF 
Keywords: FPGA; Reconfigurable Architectures; Online Test
Abstract: FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. However, the reliability of modern FPGAs is threatened by latent defects and aging effects. Hence, it is mandatory to ensure the reliable operation of the FPGA’s reconfigurable fabric. This can be achieved by periodic or on-demand online testing. In this paper, a system-integrated, transparent structural online test method for runtime reconfigurable systems is proposed. The required tests are scheduled like functional workloads, and thorough optimizations of the test overhead reduce the performance impact. The proposed scheme has been implemented on a reconfigurable system. The results demonstrate that thorough testing of the reconfigurable fabric can be achieved at negligible performance impact on the application.
BibTeX:
@inproceedings{AbdelBBIKZHW2012,
  author = {Abdelfattah, Mohamed S. and Bauer, Lars and Braun, Claus and Imhof, Michael E. and Kochte, Michael A. and Zhang, Hongyan and Henkel, Jörg and Wunderlich, Hans-Joachim},
  title = {{Transparent Structural Online Test for Reconfigurable Systems}},
  booktitle = {Proceedings of the 18th IEEE International On-Line Testing Symposium (IOLTS'12)},
  publisher = {IEEE Computer Society},
  year = {2012},
  pages = {37--42},
  keywords = {FPGA; Reconfigurable Architectures; Online Test},
  abstract = {FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. However, the reliability of modern FPGAs is threatened by latent defects and aging effects. Hence, it is mandatory to ensure the reliable operation of the FPGA’s reconfigurable fabric. This can be achieved by periodic or on-demand online testing. In this paper, a system-integrated, transparent structural online test method for runtime reconfigurable systems is proposed. The required tests are scheduled like functional workloads, and thorough optimizations of the test overhead reduce the performance impact. The proposed scheme has been implemented on a reconfigurable system. The results demonstrate that thorough testing of the reconfigurable fabric can be achieved at negligible performance impact on the application.},
  doi = {http://dx.doi.org/10.1109/IOLTS.2012.6313838},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2012/IOLTS_AbdelBBIKZHW2012.pdf}
}
3. OTERA: Online Test Strategies for Reliable Reconfigurable Architectures
Bauer, L., Braun, C., Imhof, M.E., Kochte, M.A., Zhang, H., Wunderlich, H.-J. and Henkel, J.
Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS'12), Erlangen, Germany, 25-28 June 2012, pp. 38-45
2012
DOI PDF 
Abstract: FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. However, the reliability of FPGAs, which are manufactured in latest technologies, is threatened not only by soft errors, but also by aging effects and latent defects. To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the underlying reconfigurable fabric. This can be achieved by periodic or on-demand online testing. The OTERA project develops and evaluates components and strategies for reconfigurable systems that feature reliable reconfiguration. The research focus ranges from structural online tests for the FPGA infrastructure and functional online tests for the configured functionality up to the resource management and test scheduling. This paper gives an overview of the project tasks and presents first results.
BibTeX:
@inproceedings{BauerBIKZWH2012,
  author = {Bauer, Lars and Braun, Claus and Imhof, Michael E. and Kochte, Michael A. and Zhang, Hongyan and Wunderlich, Hans-Joachim and Henkel, Jörg},
  title = {{OTERA: Online Test Strategies for Reliable Reconfigurable Architectures}},
  booktitle = {Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS'12)},
  publisher = {IEEE Computer Society},
  year = {2012},
  pages = {38--45},
  abstract = {FPGA-based reconfigurable systems allow the online adaptation to dynamically changing runtime requirements. However, the reliability of FPGAs, which are manufactured in latest technologies, is threatened not only by soft errors, but also by aging effects and latent defects. To ensure reliable reconfiguration, it is mandatory to guarantee the correct operation of the underlying reconfigurable fabric. This can be achieved by periodic or on-demand online testing. The OTERA project develops and evaluates components and strategies for reconfigurable systems that feature reliable reconfiguration. The research focus ranges from structural online tests for the FPGA infrastructure and functional online tests for the configured functionality up to the resource management and test scheduling. This paper gives an overview of the project tasks and presents first results.},
  doi = {http://dx.doi.org/10.1109/AHS.2012.6268667},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2012/AHS_BauerBIKZWH2012.pdf}
}
2. Efficient BDD-based Fault Simulation in Presence of Unknown Values
Kochte, M.A., Kundu, S., Miyase, K., Wen, X. and Wunderlich, H.-J.
Proceedings of the 20th IEEE Asian Test Symposium (ATS'11), New Delhi, India, 20-23 November 2011, pp. 383-388
2011
DOI PDF 
Keywords: Unknown values; X propagation; precise fault simulation; symbolic simulation; BDD
Abstract: Unknown (X) values, originating from memories, clock domain boundaries or A/D interfaces, may compromise test signatures and fault coverage. Classical logic and fault simulation
algorithms are pessimistic w.r.t. the propagation of X values in the circuit. This work proposes efficient hybrid logic and stuck-at fault simulation algorithms which combine heuristics and local
BDDs to increase simulation accuracy. Experimental results on benchmark and large industrial circuits show significantly increased fault coverage and low runtime. The achieved simulation
precision is quantified for the first time.
BibTeX:
@inproceedings{KochtKMWW2011,
  author = {Kochte, Michael A. and Kundu, S. and Miyase, Kohei and Wen, Xiaoqing and Wunderlich, Hans-Joachim},
  title = {{Efficient BDD-based Fault Simulation in Presence of Unknown Values}},
  booktitle = {Proceedings of the 20th IEEE Asian Test Symposium (ATS'11)},
  publisher = {IEEE Computer Society},
  year = {2011},
  pages = {383--388},
  keywords = {Unknown values; X propagation; precise fault simulation; symbolic simulation; BDD},
  abstract = {Unknown (X) values, originating from memories, clock domain boundaries or A/D interfaces, may compromise test signatures and fault coverage. Classical logic and fault simulation
algorithms are pessimistic w.r.t. the propagation of X values in the circuit. This work proposes efficient hybrid logic and stuck-at fault simulation algorithms which combine heuristics and local
BDDs to increase simulation accuracy. Experimental results on benchmark and large industrial circuits show significantly increased fault coverage and low runtime. The achieved simulation
precision is quantified for the first time.}, doi = {http://dx.doi.org/10.1109/ATS.2011.52}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2011/ATS_KochtKMWW2011.pdf} }
1. Design and Architectures for Dependable Embedded Systems
Henkel, J., Bauer, L., Becker, J., Bringmann, O., Brinkschulte, U., Chakraborty, S., Engel, M., Ernst, R., Härtig, H., Hedrich, L., Herkersdorf, A., Kapitza, R., Lohmann, D., Marwedel, P., Platzner, M., Rosenstiel, W., Schlichtmann, U., Spinczyk, O., Tahoori, M., Teich, J., Wehn, N. and Wunderlich, H.-J.
Proceedings of the 9th IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS'11), Taipei, Taiwan, 9-14 October 2011, pp. 69-78
2011
DOI URL PDF 
Keywords: Resilience; Fault-Tolerance; Embedded Systems; MPSoCs; Dependability
Abstract: The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a 'dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classification on faults, errors, and failures.
BibTeX:
@inproceedings{HenkeBBBBCEEHHHKLMPRSSTTWW2011,
  author = {Henkel, Jörg and Bauer, Lars and Becker, Joachim and Bringmann, Oliver and Brinkschulte, Uwe and Chakraborty, Samarjit and Engel, Michael and Ernst, Rolf and Härtig, Hermann and Hedrich, Lars and Herkersdorf, Andreas and Kapitza, Rüdiger and Lohmann, Daniel and Marwedel, Peter and Platzner, Marco and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Spinczyk, Olaf and Tahoori, Mehdi and Teich, Jürgen and Wehn, Norbert and Wunderlich, Hans-Joachim},
  title = {{Design and Architectures for Dependable Embedded Systems}},
  booktitle = {Proceedings of the 9th IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS'11)},
  publisher = {ACM},
  year = {2011},
  pages = {69--78},
  keywords = {Resilience; Fault-Tolerance; Embedded Systems; MPSoCs; Dependability},
  abstract = {The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a 'dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classification on faults, errors, and failures.},
  url = {http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=6062320},
  doi = {http://dx.doi.org/10.1145/2039370.2039384},
  file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2011/CODES+ISSS_HenkeBBBBCEEHHHKLMPRSSTTWW2011.pdf}
}
Created by JabRef on 23/05/2017.
Workshop Contributions
Matching entries: 0
settings...
2. Cross-Layer Dependability Modeling and Abstraction in Systems on Chip
Herkersdorf, A., Engel, M., Glaß, M., Henkel, J., Kleeberger, V.B., Kochte, M.A., Kühn, J.M., Nassif, S.R., Rauchfuss, H., Rosenstiel, W., Schlichtmann, U., Shafique, M., Tahoori, M.B., Teich, J., Wehn, N., Weis, C. and Wunderlich, H.-J.
Selse-9: The 9th Workshop on Silicon Errors in Logic - System Effects, Stanford, California, USA, 26-27 March 2013
2013
 
Keywords: Reliability Modeling, Cross-Layer
Abstract: The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.
BibTeX:
@inproceedings{HerkersdorfEGHKKKNRRSSTTWWW2013,
  author = {Herkersdorf, Andreas and Engel, Michael and Glaß, Michael and Henkel, Jörg and Kleeberger, Veit B. and Kochte, Michael A. and Kühn, Johannes M. and Nassif, Sani R. and Rauchfuss, Holm and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Shafique, Muhammad and Tahoori, Mehdi B. and Teich, Jürgen and Wehn, Norbert and Weis, Christian and Wunderlich, Hans-Joachim},
  title = {{Cross-Layer Dependability Modeling and Abstraction in Systems on Chip}},
  booktitle = {Selse-9: The 9th Workshop on Silicon Errors in Logic - System Effects},
  year = {2013},
  keywords = {Reliability Modeling, Cross-Layer},
  abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.}
}
1. Fault Modeling in Testing
Holst, S., Kochte, M.A. and Wunderlich, H.-J.
RAP Day Workshop, DFG SPP 1500, Munich, Germany, 21 December 2012
2012
 
Keywords: Fault modeling, generalized fault models, conditional fault models
BibTeX:
@inproceedings{HolstKW2012,
  author = {Holst, Stefan and Kochte, Michael A. and Wunderlich, Hans-Joachim},
  title = {{Fault Modeling in Testing}},
  booktitle = {RAP Day Workshop, DFG SPP 1500},
  year = {2012},
  keywords = {Fault modeling, generalized fault models, conditional fault models}
}
Created by JabRef on 23/05/2017.
 
Student Thesis
  • Delay Characterization in FPGA-based Reconfigurable Systems, S. Zhang, 03. Juni. 2013 - 03. Dez. 2013 (Master Thesis)
  • Accelerated Computation Using Runtime Partial Reconfiguration, N. Nayak, 27. Mai. 2013 - 26. Nov. 2013 (Master Thesis)
  • Online Self-Test Wrapper for Runtime-Reconfigurable Systems, J. Wang, 3. Dez. 2012 - 2. Juni 2013 (Master Thesis)
  • Evaluation of Advanced Techniques for Structural FPGA Self-Test, M. Abdelfattah, 01.03.2011 - 31.08.2011 (Master Thesis)

 


Project Partner

 


Contacts