CA - Current Research Projects
Computer systems have reached a point where significant improvements in computational performance and energy efficiency have become very hard to achieve. The main reason is a power and efficiency wall CMOS technology is facing. Physical limitations such as high power densities and a variety of reliability degradations now enforce larger design margins which reduce efficiency.
Approximate Computing trades off precision against power, energy, storage, bandwidth or performance, and can be applied to hardware, software and algorithms. It enables much more efficient computing by providing additional, adjustable design and runtime parameters to find Pareto optimal solutions. However, its application is still rather limited and a significant extension of the scope of applications is required, including applications that are not necessarily inherently error-tolerant.
The ACCROSS project will tackle this challenge with a cross-layer approach to analysis and optimization, which considers the system stack from the application down to the hardware. At the higher levels, ACCROSS covers the analysis of applications from different computational problem classes, which will act as enablers for mainstream approximate computing. This includes the development of new methods for the analysis of approximation potentials in applications, the adaptation of existing applications to approximation and the quantification of efficiency gains. Moreover, new methods for combining suitable approximation techniques at different system layers during runtime will be provided to maximize efficiency with respect to performance and energy. New error metrics and methods for lightweight runtime monitoring of accuracy will be developed to ensure the usefulness of the targeted applications. At the lower levels, ACCROSS covers the systematic evaluation of the impact of removing design margins which will lead to approximate behavior and improved efficiency. Abstract but accurate models linking the hardware and software will be provided, enabling designers to accurately quantify the error and efficiency impact of approximation across the system stack.
CA - Completed Projects
An important problem in modern technology nodes in nano-electronics are early life failures, which often cause recalls of shipped products and incur high costs. An important root cause of such failures are marginal circuit structures, which pass a conventional manufacturing test, but are not able to cope with the later workload and stress in the field. Such structures can be identified on the basis of non-functional indicators, in particular by testing the timing behavior. For an effective and cost-efficient test of these indicators, the FAST project investigates novel scan designs and built-in self-test strategies for circuits, which can operate at frequencies beyond the functional specification to detect small deviations of the nominal timing behavior and thus potential early life failures.
since 02.2017, DFG-Project: WU 245/19-1
The project in detail:
State-of-the-art nanoscale technologies allow for the integration of billions of transistors with feature sizes of 14 nm or below into a single chip. This enables innovative approaches and solutions in many application domains, but it also comes along with fundamental challenges. Early life failures are particularly critical, as they can cause product recalls associated with a loss of billions of dollars. A major cause of early life failures are "weak" devices that operate correctly during manufacturing test, but cannot stand operational stress in the field. While other failure mechanisms, such as aging or external disturbances, to some extent, may be compensated by a robust design, potential early life failures must be detected by tests, and the respective systems have to be sorted out. This requires specific approaches far beyond today’s state-of-the-art.
As they work properly in the beginning, weak structures must be identified by analyzing the non-functional circuit behavior with the help of appropriate observables. Besides power consumption, the circuit timing is one of the most important reliability indicators. In particular, small delay faults may indicate marginal hardware that can degrade further under stress. However, they can be “hidden” at nominal frequency and only be detected at higher frequencies (“faster-than-at-speed test” / FAST). Therefore, conventional approaches for testing reach their limitations, and new methods must be investigated and developed in the following three domains:
- Specific techniques for „design for test“ (DFT) must be developed to deal with the challenges of testing beyond nominal frequency.
- Strategies for test scheduling must ensure that a maximum fault coverage is achieved with a minimum number of test frequencies and a short test time.
- Appropriate metrics are needed to quantify the coverage of weak devices. Here it is particularly challenging to distinguish the behavior of week devices from variations due to nanoscale integration.
Since FAST imposes extreme requirements on the automatic test equipment (ATE), it is very important to support an efficient implementation as a built-in self-test (BIST).
Within the framework of the project, strategies and solutions will be developed for the problems mentioned above. This way, the enormous cost of a traditional „burn-in“ test can be reduced, thus enabling the introduction of nanoscale technology to new application domains.
since 08.2014, DFG-Project: WU 245/17-1, WU 245/17-2
Please visit our project page for detailed information.
Project Description (Phase 2)
RSNs were initially brought up to manage the extensive amount of instrumentation in modern systems-on-chip to facilitate cost-efficient bring-up and debug, test, diagnosis and maintenance. Recently, the reuse of RSNs at system runtime for online fault classification and fault management moved into the center of research activities. Reasons are not only the increased complexity and dependability requirements in new technologies, but also the emerging application paradigms of self-aware and autonomous systems. Especially in safety-critical applications, online test, system monitoring and fault tolerance at low cost become mandatory. For example, the standard ISO 26262 specifies critical faults to be detected within certain test intervals at runtime and allows only a maximum fault reaction time until the system has to be transferred into a safe state. The periodic test is usually structure oriented and targets stuck-at, transition and delay faults.
It is common practice that the required tasks for initializing the periodic testing, for fault detection and for fault reaction are executed by the system functionality transparent in the background. Disadvantages of this approach are manifold: The periodic test and test evaluation constitute some significant additional workload, reduce performance, consume a large amount of additional power, and may take too much time for avoiding dangerous situations. Guaranteeing deadlines and verifying fault tolerance is extremely difficult as the properties have to be proven in the presence of faults.An alternative is the use of the non-functional infrastructure for concurrent fault detection and fault management, and first approaches to employ RSNs in in-system runtime tests have already been proposed. These first attempts still require a dedicated regular structure of RSNs and its permanent background operation which should be avoided in practice.
The results of the first phase of ACCESS provide an excellent basis for further research on the runtime use of RSNs. Since RSNs are integrated into the chip any way, the required cost of the modifications for runtime use are affordable even for a mass market like automotive. The goal of the second phase of ACCESS is a technique for a robust online use of RSNs to support safety, fault tolerance and reliability management.
- In-system run-time test using RSNs
- System wide collection of diagnosis information
- Online diagnosis of RSNs
- Investigation of robust and fault-tolerant RSNs
This work is supported by the German Research Foundation (DFG) under grant WU 245/17-2 (2019-2021).
In the project „SHIVA: Secure Hardware for Information Processing“, coordinated by Prof. Dr. Wunderlich (Institut für Technische Informatik), novel design and verification methods are researched and developed to increase and assure the security of microelectronic hardware, used for instance in automobile, medical, or industrial applications. These methods will help to achieve increasing security requirements and prevent system manipulation, extraction of critical data or process information, and IP theft.
More information on our german website.
02.2016 - 05.2019, Forschungsprogramm der Baden-Württemberg Stiftung
IKT-Sicherheit für weltweit vernetzte vertrauenswürdige Infrastrukturen
Design and test validation is one of the most important and complex tasks within modern semi-conductor product development cycles. The design validation process analyzes and assesses a developed design with respect to certain validation targets to ensure its compliance with given specifications and customer requirements. Test validation evaluates the defect coverage obtained by certain test strategies and assesses the quality of the products tested and delivered. The validation targets include both, functional and non-functional properties, as well as the complex interactions and interdependencies between them. The validation means rely mainly on compute-intensive simulations which require more and more highly parallel hardware acceleration.
In this project novel methods for versatile simulation-based VLSI design and test validation on high throughput data-parallel architectures will be developed, which enable a wide range of important state-of-the-art validation tasks for large circuits. In general, due to the nature of the design validation processes and due to the massive amount of data involved, parallelism and throughput-optimization are the keys for making design validation feasible for future industrial-sized designs. The main focus and key features lie in the structure of the simulation model, the abstraction level and the used algorithms, as well as their parallelization on data-parallel architectures. The simulation algorithms should be kept simple to run fast, yet accurate enough to produce acceptable and valuable data for cross-layer validation of complex digital systems.
10.2014 - 12.2018, DFG-Projekt: WU 245/16-1
This project aims to find novel abstraction and algorithm mapping methods to allow highly accurate timing and NFP-aware simulation of multi-million gate circuits on data-parallel architectures such as graphics processing units (GPUs). The expected dramatic speedup compared to the existing state-of-the-art allows fault simulation of millions of faults and thousands of patterns. The increased accuracy of the simulation results allow to optimize test patterns w.r.t. test power and small delay defect coverage in presence of power noise, clock skew or even circuit variations.
01.2015 - 12.2016, DAAD/JSPS PPP Japan Project: #57155440
Dynamically reconfigurable architectures enable a major acceleration of diverse applications by changing and optimizing the structure of the system at runtime. Permanent and transient faults threaten the correct operation of such an architecture. This project aims to increase dependability of runtime reconfigurable systems by a novel system-level strategy for online tests and online adaptation to an impaired state. This will be achieved by (a) scheduling such that tests for reconfigurable resources are executed with minimal performance impact, (b) resource management such that partially faulty resources are used for components which do not require the faulty elements, and (c) online monitoring and error checking. To ensure reliable runtime reconfiguration, each reconfiguration process is thoroughly tested by a novel and efficient combination of online structural and functional tests. Compared to existing fault-tolerance approaches, our proposal avoids the large hardware overhead of structural redundancy schemes. The saved resources are available for further application acceleration. Still, the proposed scheme covers faults in the fabric, in the reconfigured application logic and errors in the process of reconfiguration.
10.2010 - 06.2017, DFG-Project: WU 245/10-1, 10-2, 10-3
Since the beginning of the DFG Cluster of Excellence "Simulation Technology" (SimTech) at the University of Stuttgart in 2008, the Institute of Computer Architecture and Computer Engineering (ITI, RA) is an active part of the research within the Stuttgart Research Center for Simulation Technology (SRC SimTech). The institute's research includes the development of fault tolerant simulation algorithms for new, tightly-coupled many-core computer architectures like GPUs, the acceleration of existing simulations on such architectures, as well as the mapping of complex simulation applications to innovative reconfigurable heterogeneous computer architectures.
Within the research cluster, Hans-Joachim Wunderlich acts as a principal investigator (PI) and he co-coordinates the research activities of the SimTech Project Network PN2 "High-Performance Simulation across Computer Architectures". This project network is unique in terms of its interdisciplinary nature and its interfaces between the participating researchers and projects. Scientists from computer science, chemistry, physics and chemical engineering work together to develop and provide new solutions for some of the major challenges in simulation technology. The classes of computational problems treated within project network PN2 comprise quantum mechanics, molecular mechanics, electronic structure methods, molecular dynamics, Markov-chain Monte-Carlo simulations and polarizable force fields.
06.2008 - 10.2017, SimTech Cluster of Excellence
The main objective of the RM-BIST project is to extend Design for Test (DFT) circuitry, which is primarily used during manufacturing test of VLSI chips, to Design for Reliability (DFR) infrastructure. We take advantage of existing built-in self-test (BIST) circuitry and reuse it during lifetime operation to provide system monitoring and perform reliability prediction. Moreover, the modified BIST infrastructure is used to perform targeted reliability improvements. We identify, monitor, predict, and mitigate errors affecting the system reliability at different time scales to handle various reliability detractors (radiation-induced soft errors, intermittent faults due to process and runtime variations, transistor aging and electromigration). The goal is to provide runtime support for reliability screening and improvement by modifying and reusing existing DFT infrastructure with minimum costs.
07.2012 - 06.2015, DFG-Project: WU 245/13-1
The project ROCK targets the analysis and the prototypical development of robust architectures and associated design practices for Networks-on-Chips. Thereby, it meets the challenges of increased susceptibility of on-chip communication infrastructures against the massive influences caused by escalating integration density. ROCK pursues the strategy of conducting fault detection, online diagnosis und specific reconfiguration to tackle faults in a hierarchical manner throughout all network layers, aiming at selecting an optimal combination of activities over all layers. The quality of potential solutions is measured by their energy-minimal compliance to assurances made with respect to the performability of the network. For this purpose, performability will be defined for the research area of NoCs, incorporating communication performance and fault statistics. Any algorithms and architectures for controlling and performing diagnosis and reconfiguration shall themselves be designed as fault tolerant. Furthermore, their operation shall be transparent to the application processes and minimize interference with regular NoC communication. A wide range of architectures will be investigated based on the enabling technology of NoC fault models and high-level NoC fault simulation.
08.2011 - 12.2015, DFG-Project: WU 245/12-1
Microelectronic circuits suffer from life-time limiting aging. In this project, online in-field methods to assess circuit performance and remaining life-time will be developed to predict failures due to aging processes. Sensors and monitoring infrastructure are used to analyze both operating conditions as well as aging indicators so that a system failure can be early indicated and prevented by technical measures. Novel maintenance concepts based on failure prediction allow for a substantial simplification of established structural fault tolerance measures (e.g. redundancy concepts) even in safety-critical applications since specific counter measures can be applied before an actual aging induced failure. With the aid of such an on-line monitoring the effective life-time of a microelectronic product can be significantly increased at low cost.
03.2011 - 12.2014, DFG-Project: WU 245/11-1
Functionality in embedded systems is more and more realized by integrated hardware / software systems. Typically, these systems are strongly coupled with technical processes, as for instance the control of a vehicle, which show time-dependent, discrete-continuous dynamics. Testing for the correct functionality of their according design as well as of the final product contributes large sums to the production costs due to its complexity. An efficient method is required for the integrated test of hardware and software in these systems, which respects all the aspects of validation, debug, test and diadnosis. Model-based development and test gains importance in research and also in industrial practice, as they support the systematic, stepwise refinement of requirements down to the implementation. By using models to describe the functionality of integrated hard- and software systems a higher efficiency of their test can be achieved. The central goal of this project is the generation of tests for the functionality and structure of an embedded hardware / software system from its system model along with an automatic evaluation and failure diagnosis.
10.2010 - 09.2013, DFG-Project: WU 245/9-1
In nanoelectronic circuit technology, circuits exhibit a high susceptibility to soft errors not only in memory arrays, but also in memory elements in random logic. Consequently, a goal of this project is the development of an efficient soft error protection scheme that uses both time and space redundancy.
01.2006 - 07.2013, DFG-Project: WU 245/5-1, 5-2