OTERA: Online Test Strategies for Reliable Reconfigurable Architectures
Dynamically reconfigurable architectures enable a major acceleration of diverse applications by changing and optimizing the structure of the system at runtime. Permanent and transient faults threaten the correct operation of such an architecture. This project aims to increase dependability of runtime reconfigurable systems by a novel system-level strategy for online tests and online adaptation to an impaired state. This will be achieved by (a) scheduling such that tests for reconfigurable resources are executed with minimal performance impact, (b) resource management such that partially faulty resources are used for components which do not require the faulty elements, and (c) online monitoring and error checking. To ensure reliable runtime reconfiguration, each reconfiguration process is thoroughly tested by a novel and efficient combination of online structural and functional tests. Compared to existing fault-tolerance approaches, our proposal avoids the large hardware overhead of structural redundancy schemes. The saved resources are available for further application acceleration. Still, the proposed scheme covers faults in the fabric, in the reconfigured application logic and errors in the process of reconfiguration.
10.2010 - 06.2017, DFG-Project: WU 245/10-1, 10-2, 10-3
The project in detail:
In the framework of the SPP 1500 priority program, this project contributes to
- Dependable Hardware Architectures,
- Design Methods and
- Operation, Observation and Adaptation.
Dynamically reconfigurable architectures allow to adapt and optimize at runtime according to the current system state, load, and application. This enables a major acceleration of diverse applications with low hardware overhead. To achieve the desired execution of these applications, the underlying reconfigurable fabric must provide a high degree of reliability, i.e. error-free operation. In addition to classical fault models found in static VLSI hardware, new types of faults threaten the operation of such reconfigurable architectures and need to be considered. The reconfigured application Module may not only suffer from permanent faults in the underlying reconfigurable fabric (e.g. due to aging) but also from errors during the reconfiguration process and operation. In particular, an erroneous reconfiguration or a transient fault affecting the configuration memory changes the structure of the reconfigured application Module.
Effective methods for a manufacturing test of FPGA fabric exist, i.e. the tests can be performed in a reasonable time while covering the entire configurable fabric. Reconfigurability of the system can also be exploited to conduct efficient and thorough tests both of the fabric and the application logic. In case of faults, mitigation is possible by adaptation to the impaired state. This requires that the system is capable of detecting and diagnosing faults at runtime, and taking an appropriate adaptation decision.
Altogether, these approaches target non-reconfigurable systems that are implemented using a reconfigurable fabric. The reconfigurability can be used to increase the reliability in these cases. However, missing are techniques that address runtime reconfigurable systems that exploit partial runtime reconfiguration as part of their normal operation. Here, it is crucial to assure a reliable runtime reconfiguration that can be applied efficiently with low overhead, online in the field, with a limited number of hardware resources. Therefore, our novel OTERA project aims to realize a reliable reconfiguration and system adaptation that additionally provides a high resource efficiency, for instance by utilizing partially faulty FPGA structures, i.e. selectively reconfiguring those application Modules to a partially faulty area that do not demand the faulty parts.
This project aims to increase the dependability of reconfigurable systems by a novel online system-level strategy for reliable runtime reconfiguration and system adaptation. The whole approach over all three phases (3 x 2 years) comprises that:
- Errors are detected concurrently and can be contained (do not spread system-wide),
- Faults are detected in the reconfigurable fabric and reconfigured application logic to ensure correct completion of a reconfiguration,
- Root causes of detected errors are determined by diagnosis,
- Potential future errors are predicted (based on recent errors, online monitoring, and system load),
- Reliable system operation is achieved by the runtime system that dynamically schedules test routines (while trading test coverage and system stress due to testing), and
- Adaptation to an impaired state is managed by the runtime system with minimal impact to the application performance.
This work is supported by the German Research Foundation (DFG) under grant WU 245/10-1 (2011-2012), WU 245/10-2 (2013-2014), and WU 245/10-3 (2015-2016).