Research

Our research efforts pay special attention to topics related to test, reliability and fault tolerance of digital systems as well as innovative architectures for approximative and heterogeneous computing. A part of our work is done in cooperation with various partners coming both from national and international universities and from industry.

CA - Current Research Projects

Project Description

Computer systems have reached a point where significant improvements in computational performance and energy efficiency have become very hard to achieve. The main reason is a power and efficiency wall CMOS technology is facing. Physical limitations such as high power densities and a variety of reliability degradations now enforce larger design margins which reduce efficiency.

Approximate Computing trades off precision against power, energy, storage, bandwidth or performance, and can be applied to hardware, software and algorithms. It enables much more efficient computing by providing additional, adjustable design and runtime parameters to find Pareto optimal solutions. However, its application is still rather limited and a significant extension of the scope of applications is required, including applications that are not necessarily inherently error-tolerant.

The ACCROSS project will tackle this challenge with a cross-layer approach to analysis and optimization, which considers the system stack from the application down to the hardware. At the higher levels, ACCROSS covers the analysis of applications from different computational problem classes, which will act as enablers for mainstream approximate computing. This includes the development of new methods for the analysis of approximation potentials in applications, the adaptation of existing applications to approximation and the quantification of efficiency gains. Moreover, new methods for combining suitable approximation techniques at different system layers during runtime will be provided to maximize efficiency with respect to performance and energy. New error metrics and methods for lightweight runtime monitoring of accuracy will be developed to ensure the usefulness of the targeted applications. At the lower levels, ACCROSS covers the systematic evaluation of the impact of removing design margins which will lead to approximate behavior and improved efficiency. Abstract but accurate models linking the hardware and software will be provided, enabling designers to accurately quantify the error and efficiency impact of approximation across the system stack.

An important problem in modern technology nodes in nano-electronics are early life failures, which often cause recalls of shipped products and incur high costs. An important root cause of such failures are marginal circuit structures, which pass a conventional manufacturing test, but are not able to cope with the later workload and stress in the field. Such structures can be identified on the basis of non-functional indicators, in particular by testing the timing behavior. For an effective and cost-efficient test of these indicators, the FAST project investigates novel scan designs and built-in self-test strategies for circuits, which can operate at frequencies beyond the functional specification to detect small deviations of the nominal timing behavior and thus potential early life failures.

since 02.2017, DFG-Project: WU 245/19-1

The project in detail:

State-of-the-art nanoscale technologies allow for the integration of billions of transistors with feature sizes of 14 nm or below into a single chip. This enables innovative approaches and solutions in many application domains, but it also comes along with fundamental challenges. Early life failures are particularly critical, as they can cause product recalls associated with a loss of billions of dollars. A major cause of early life failures are "weak" devices that operate correctly during manufacturing test, but cannot stand operational stress in the field. While other failure mechanisms, such as aging or external disturbances, to some extent, may be compensated by a robust design, potential early life failures must be detected by tests, and the respective systems have to be sorted out. This requires specific approaches far beyond today’s state-of-the-art.

As they work properly in the beginning, weak structures must be identified by analyzing the non-functional circuit behavior with the help of appropriate observables. Besides power consumption, the circuit timing is one of the most important reliability indicators. In particular, small delay faults may indicate marginal hardware that can degrade further under stress. However, they can be “hidden” at nominal frequency and only be detected at higher frequencies (“faster-than-at-speed test” / FAST). Therefore, conventional approaches for testing reach their limitations, and new methods must be investigated and developed in the following three domains:

  1. Specific techniques for „design for test“ (DFT) must be developed to deal with the challenges of testing beyond nominal frequency.
  2. Strategies for test scheduling must ensure that a maximum fault coverage is achieved with a minimum number of test frequencies and a short test time.
  3. Appropriate metrics are needed to quantify the coverage of weak devices. Here it is particularly challenging to distinguish the behavior of week devices from variations due to nanoscale integration.

Since FAST imposes extreme requirements on the automatic test equipment (ATE), it is very important to support an efficient implementation as a built-in self-test (BIST).

Within the framework of the project, strategies and solutions will be developed for the problems mentioned above. This way, the enormous cost of a traditional „burn-in“ test can be reduced, thus enabling the introduction of nanoscale technology to new application domains.

since 08.2014, DFG-Project: WU 245/17-1, WU 245/17-2

Please visit our project page for detailed information.

 

Project Description (Phase 2)

RSNs were initially brought up to manage the extensive amount of instrumentation in modern systems-on-chip to facilitate cost-efficient bring-up and debug, test, diagnosis and maintenance. Recently, the reuse of RSNs at system runtime for online fault classification and fault management moved into the center of research activities. Reasons are not only the increased complexity and dependability requirements in new technologies, but also the emerging application paradigms of self-aware and autonomous systems. Especially in safety-critical applications, online test, system monitoring and fault tolerance at low cost become mandatory. For example, the standard ISO 26262 specifies critical faults to be detected within certain test intervals at runtime and allows only a maximum fault reaction time until the system has to be transferred into a safe state. The periodic test is usually structure oriented and targets stuck-at, transition and delay faults.

It is common practice that the required tasks for initializing the periodic testing, for fault detection and for fault reaction are executed by the system functionality transparent in the background. Disadvantages of this approach are manifold: The periodic test and test evaluation constitute some significant additional workload, reduce performance, consume a large amount of additional power, and may take too much time for avoiding dangerous situations. Guaranteeing deadlines and verifying fault tolerance is extremely difficult as the properties have to be proven in the presence of faults.An alternative is the use of the non-functional infrastructure for concurrent fault detection and fault management, and first approaches to employ RSNs in in-system runtime tests have already been proposed. These first attempts still require a dedicated regular structure of RSNs and its permanent background operation which should be avoided in practice.

The results of the first phase of ACCESS provide an excellent basis for further research on the runtime use of RSNs. Since RSNs are integrated into the chip any way, the required cost of the modifications for runtime use are affordable even for a mass market like automotive. The goal of the second phase of ACCESS is a technique for a robust online use of RSNs to support safety, fault tolerance and reliability management.

This comprises:

  • In-system run-time test using RSNs
  • System wide collection of diagnosis information
  • Online diagnosis of RSNs
  • Investigation of robust and fault-tolerant RSNs

This work is supported by the German Research Foundation (DFG) under grant WU 245/17-2 (2019-2021).

Hans-Joachim Wunderlich (i.R.)
Prof. Dr. rer. nat. habil.

Hans-Joachim Wunderlich (i.R.)

Heading the Research Group Computer Architecture

To the top of the page