ES - Abgeschlossene studentische Arbeiten
- Umair Rasheed
Abstract Workloads for Evaluation of Parallel Simulation Algorithms for Many-Core Systems (Master Thesis)
In parallel simulation of networks on chips (NoCs), the execution performance predominantly depends on the simulation algorithm, which determines how the individual simulator threads synchronize and how the simulation workload is distributed among these threads. The execution time of the simulation is also affected by the simulated system itself. E.g., the number of nodes in the system, the way nodes interact with each other, and the complexity to simulate the individual nodes themselves all have a significant impact on the simulation duration. Unfortunately, parallel algorithms work differently well for different systems.
The main focus of this thesis is to research the relation between the simulated NoCs and the parallel simulation algorithms in an abstract manner. Considering the effects of different topologies, the efficiency of particular algorithms has to be investigated for the possible ranges of computational and communicational workloads that the simulator has to handle. For a given node, the computational workloads reflect the execution time to simulate the behavior of the node, and communicational workload reflects the necessary data accesses. Starting with uniform workloads, base models of simulation performance have to be developed for each individual algorithm to reveal their suitable range of application. Then, a relation to non-uniform workloads shall be established, which eventually has to be evaluated for task graph based application scenarios.
- Florian Reichelt
Fast and Accurate Full-System HW-/SW Co-Simulation by Automated Transformation of Host Compiled Applications (Master Thesis)
Network on Chip (NoC) architectures consist of processing elements (PE), which are connected via a network topology rather than a bus system. In a shared memory NoC, all memory accesses of the PEs are translated to packets, which are sent through the network. In order to minimize this overhead, each PE is connected to its own instruction and data caches. Further cache levels with caches shared among multiple PEs are possible. For a realistic evaluation and efficient design space exploration of such systems, it is important to perform a fast and accurate HW /SW Co Simulation of the whole system. While it is common practice to evaluate NoC using synthetic traffic or traffic traces, this thesis investigates a new method of performing a HW /SW Co Simulation of shared memory NoC systems.
The main idea is to automatically transform a given multithreaded program such that it can be directly integrated into a parallel hardware level NoC simulator. The main benefit of this approach is to simulate the whole system with real world software programs to measure and analyze the resulting performance, network traffic and cache behavior.
For this purpose, threads have to be bound to PEs in the simulation model and the simulator must execute threads as part of PE simulation. Furthermore, the program’s memory accesses have to be automatically relayed to the memory model of the simulator. Simulating a memory accesses implicitly synchronizes application execution with the simulator kernel. By estimating the delay of selected memory accesses instead of simulating the access in full detail, simulation accuracy can be traded of against simulation performance.
- Manuel Strobel
Highly-Parallel NoC Simulation on Many-Core Architectures (Master Thesis)
The steadily increasing number of on-chip components, implemented into recent designs, goes hand in hand with a severe demand for Network on Chip (NoC) architectures that are capable to handle the considerable traffic loads. In order to investigate the eventual behavior of a NoC, thorough analysis through simulation is necessary. Therefore, efficient handling of this significantly time and computational expensive process is subject to intensive research at present. Exploitation of parallel execution models on recent multi- and many-core simulation machines is one example of a promising approach to this end.
In this thesis, a given parallel NoC simulation approach shall be evaluated in terms of performancerelated optimizations for many-core architectures. Promising starting points in this context are workload distribution schemes, efficient cache utilization, and code vectorization. Points that turn out to be worth for further consideration shall be evaluated in detail. For the sake of verification and comparison, an implementation, targeting the Intel Many Integrated Core Architecture (MIC), shall be carried out.
- Dana Damaschin
Fault Tolerant Management of Communication Channels of an NoC Switch (Master Thesis)
In Networks-on-Chip (NoC) with wormhole-based flow control data packets are divided into multiple flow control units (flits). Only the first flit (HEAD) of a packet is routed and exclusively reserves an output channel at each switch that it passes. Thus, it reserves a path through the network from sender to receiver. The remaining flits just follow this reserved path. When the last flit (TAIL) passes a switch it cancels the output reservation. However, if a TAIL flit is lost due to a fault in the network, one or more reservations are not canceled and for this reason the corresponding channels cannot be used in the following for other communication. In case a HEAD flit is lost the remaining flits of the packet cannot be forwarded any further as they have no routing information. If not dropped, these flits block an input channel.
The goal of this Master Thesis is to design a reliable method that guarantees channels are not blocked by reservation or non-routable flits. The design shall be implemented and integrated into a provided VHDL switch.
- Hossam ElAtali
Configurable Shared Cache and Memory Model for Parallel NoC Simulation (Master Thesis)
To evaluate the performance of a certain Network on Chip (NoC) architecture, synthetically generated traffic is often used. However, thorough analysis, whether the NoC will also perform well in the final productive environment, requires simulation of the actual traffic generated by the applications running on the NoC. The interconnection fabric of a network on chip serves two purposes: explicit communication between processing elements and implicit communication with memories, caused by read and write operations or instruction fetching.
In this thesis, a configurable memory model shall be developed to investigate system performance when memory accesses are contributing to NoC traffic. The model shall be developed according to a given design specification and should follow a modular approach. Two components need to be modeled: an internal, hierarchical cache model and an external memory model. The model shall be integrated with an existing NoC model in a parallel simulation environment. Thus, calling semantics of the parallel simulator must be adhered.
- Ibrahim Ahmed
Reliable Routing Table Reconfiguration for On-chip Network Switches (Master Thesis)In case of faults in the communication structure of a on-chip network (NoC) a table based routing has to be reconfigured. A reliable reconfiguration process is important to guarantee the correct behavior of the system after reconfiguration. New table entries are calculated in software and are communicated by means of reconfiguration flits/packets through the network to a switch and used there to reconfigure the switch’s routing table.
However, reconfiguration flits may get affected by faults in terms of data corruption or loss and thus lead to a faulty or incomplete routing table. For that reason it has to be ensured that all table entries are correctly received and are written to the table. In case of loss of reconfiguration flits missing entries have to be requested again.
In this Master Thesis a reconfiguration unit shall be designed and implemented in VHDL that ensures the reliable reconfiguration of a switch’s routing table. If the reconfiguration information sent by the primary source cannot be received correctly, the reconfiguration unit shall request the information at a secondary source.
- Muhammad Afzal
Design and Implementation of a Fault tolerant VHDL Switch with Reconfigurable Routing Tables (Master Thesis)
Networks on Chip (NoCs), mainly composed of switches connected to each other by links, are an intercommunication structure that provides high performance communication for on-chip multi processor systems. A key aspect that has to be considered during NoC design is the influence of faults such as link or switch failures. To provide, that a failure of a single component does not lead to a complete system failure, the remaining functional switches have to have the ability to adapt to the fault situation.
In this master thesis a fault tolerant NoC switch with reconfigurable routing tables for a NoC organized into hierarchical units shall be designed and implemented in VHDL. Special attention shall be paid to the design of the routing tables as well as the reconfiguration mechanism. By means of simulation the switch shall be evaluated.
- Mariem Saied
Dynamically adapting fault tolerant end-to-end protocol for Networks-On-Chip (Master Thesis)
In an On-Chip network (Networks-on-Chip, NoC) external influences (e.g. radiation) or permanent faults in network components (e.g. broken wires) may lead to corruption of data packets or to their loss. To guarantee a reliable intercommunication between network nodes even in presence of faults flow control protocols are used. A common type of flow control protocols used to guarantee reliable intercommunication is the End-to-End (E2E) protocol.
To handle the loss of data packets, E2E protocols make use of timers. If no acknowledgement for a packet is received by the sender before the timer elapses, the packet is considered to be lost and thus it is retransmitted. Static timer values have shown to be appropriate for scenarios with small network load and a low probability of faults. However, if network load increases, this causes the timer to elapse before acknowledgements can arrive at the sender. Thus, packets are unnecessarily retransmitted which in turn leads to an even higher increase of network load.
In this thesis an existing fault tolerant E2E flow control protocol for NoCs shall be extended so that it takes network parameters into account to adjust the timers online. For this an appropriate monitoring shall be designed and implemented that e.g. measures the current network load and provides information to the protocol. It has to be considered, that the overhead caused by transmissions of monitoring data shall be kept as minimal as possible.
- Karim Eissa
Modeling of a multi-core Microblaze system at RTL and TLM abstraction levels in SystemC (Master Thesis)
Transaction Level Modeling (TLM) nowadays becomes a popular approach for modeling contemporary System-on-Chips (SoCs) on a higher abstraction level than Register-Transfer-Level (RTL). In this thesis a multi-core system based on the Xilinx Microblaze microprocessor should be modeled at the RTL and TLM abstraction level in SystemC. At both levels, models should have a cycle accurate timing. The implemented models should be verified against the reference VHDL models using a VHDL / SystemC mixed-language simulation with ModelSim. Finally, performance measurements should be carried out to evaluate simulation speedup at the transaction level.
- Michael Kaufmann
Reliable Communication by Fault-Tolerant Multilayer Routing
Modern supercomputers are highly parallel systems that scale up to several thousands of nodes. To provide fast communication in such systems, microprocessor vendors are integrating messaging units into their chips. These integrated network interfaces enable direct cache-to-cache communication between processor cores, providing low latency transmissions and high data throughput.
Due to the high degree of parallelism, reliability and availability are becoming major concerns in supercomputer systems. Thus, mechanisms to tolerate component failures have to be provided. As the predominant topology of current supercomputers’ interconnection networks is that of a multidimensional torus, fault tolerance is implicitly supported by multiple redundant paths between nodes. This requires dynamic routing functions that can act on detected faults. However, area constraints and high clock frequencies restrict hardware-based routing functions to simple deterministic schemes. To circumvent these limitations, multilayer routing is used. Here, a second routing layer that is implemented in software is put on top of the simpler hardware routing.
When resources like links or nodes fail, this second layer directs messages around faults by routing them over one or more intermediate nodes in software. The intermediate nodes are chosen such that they form a chain of valid hardware routing paths from source to destination. The solution developed here uses a compact representation of detected faults to minimize the overhead in terms of runtime and memory requirements. In addition, the selection process considers the additional load caused by re-routed traffic in order to keep the link load balanced. The implementation has been proven to work successfully on an IBM BlueGene/Q supercomputer.
- Zixuan Cheng
Transaction-Level Instruction Set Simulator of An ATMEL AVR Microcontroller Core (Master Thesis)
Modern design flows require the simulation of software running on a CPU in a larger system context. For this purpose, an instruction set simulator (ISS) specific to the ATMEL AVR processor architecture shall be developed. To interface with the rest of the system simulation model, the ISS shall have a transaction-level interface. To transform AVR assembler code (generated with a given cross compiler from, e.g., C/C++ sources) into a representation suitable for compiled instruction set simulation, a preprocessor has to be developed. As time permits, the implementation of an interface with an IDE / debugger (AVR Studio or GNU gdb) is desirable.
The thesis is performed in our Embedded Systems Lab in close cooperation with ATMEL, Heilbronn, as part of the research project ROBUST. Post-thesis job opportunities with ATMEL exist.
- Nikolaos Batzolis
Fault-tolerant End-to-End Flow Control Protocol for Networks-On-Chip (NoC) (Master Thesis)
On-chip networks (Networks-on-Chip, NoC) are communication networks, which provide predominantly packet-switch communication between processing elements of an embedded system. With the ongoing decrease of feature size, complex systems with hundreds of processing elements can be implemented on a single chip. On the other hand, decreasing feature sizes incurs the serious drawback of higher susceptibility to manufacturing tolerances and external influences, resulting in an increased chip fault probability. The presence of faulty components or communication links inside NoC-enabled chips can lead to data corruption or packet loss.
In the near future, NoCs will be used to implement safety-critical applications. The loss of packets or corruption of data during communication of network elements may cause the system to no longer maintain its correct behavior or even may cause the system to fail its operation completely. Such deviation from the specified behavior can damage devices irreparably or even may result in loss of people's life. For that reason, fault free communication between processing elements is a primary concern, which can be achieved by ensuring that every packet reaches its destination even in presence of permanent errors.
- Adán Kohler
Modellierung und Simulation von Networks-on-Chip auf der Transaktionsebene
Networks-on-Chip (NoC) dienen der Kommunikation zwischen Prozessorelementen von Multiprozessor-Systems-on-Chip (MPSoC). Beim Entwurf von NoCs müssen Netzwerktopologien, Routingmechanismen und weitere Aspekte des Netzwerks so ausgewählt werden, dass die Kommunikationsanforderungen zu implementierender Anwendungen erfüllbar sind. Um dies bewerten zu können, ist eine Simulation des Netzwerks unter Einbeziehung des Kommunikationsverhaltens der Prozessorelemente erforderlich. Für busbasierte Systeme wurde die Transaktionsebenen-Modellierung und -Simulation entwickelt, welches Kommunikationsoperationen zu sogenannten Transaktionen zusammenfasst und durch Abstraktion von Protokolldetails (z.B. einzelne Signale) eine höhere Simulationsperformance erzielt. In dieser Diplomarbeit soll das Transaktionskonzept nun zur Modellierung von NoCs angewandt und, falls erforderlich, angepasst werden. Dabei kann auf die Simulationsbibliothek SystemC sowie die TLM2.0-Bibliothek für die Transaktionsebenensimulation aufgesetzt werden. Es soll ein geeigneter Rahmen, etwa in Form einer NoC-Simulationsbibliothek mit definierten Interfaces, geschaffen werden, der es den Anwendern erlaubt, die Details einer NoC-Architektur (Topologie, Routing etc.) selbst zu definieren.
WS 2007/08 and older
- George Raju
Transaction Level Modelling of H.264 Decoding Processes
The standard H.264 / MPEG-4 part 10 defines an encoded representation of digital video sequences and its decoding process. The decoding process is implemented as software in the JM reference model. Due to its sequential nature, the JM reference is not well-suited as a reference against which a parallel hardware implementation of a H.264 decoder could be verified. The subject of this thesis is the design of a parallel reference model of H.264 decoding in SystemC. The model shall be designed at the Transaction Level of abstraction.
- Ms. Weining Hao
Architecture and Implementation of a H.264 Deblocking Accelerator
The standard H.264 / MPEG-4 part 10 defines an encoded representation of digital video sequences and its decoding process. This process includes a deblocking sub-process to reduce the visual impact of block artefacts. Different to previous video coding standards, H.264 deblocking is part of the decoding loop ("in-loop filter"). The de-blocked video frames serve as a reference for the decoding of other frames that are decoded later. Therefore, the deblocking process is time-critical. Furthermore, deblocking is known to contribute about one third to the performance requirements of H.264 decoding. The subject of this thesis is the design of a hardware accelerator for H.264 deblocking that can speed up the execution of an otherwise software-based decoder.
- Thomas Bruni
A Formalized Approach to Transaction Level Modeling
In transaction level modeling (TLM), high simulation speed is achieved by modeling at higher levels of abstraction than signals and the RTL. The level of abstraction in which modeling is performed depends on the context in which a model is used and the required level of accuracy. The levels of accuracy required in most modeling activities have been identified and proposed by some researchers and institutes active in the TLM field. For example, the OSCI TLM approach proposes PV (Programmer's View), PVT (Programmer's View with Timing), CX (Cycle Approximate) and CA (Cycle Accurate) abstraction levels, in increasing order of precision and decreasing order of simulation speed. However, these definitions of the abstraction levels are informal and the transition from one abstraction level to another is not systematic or automatizable. For example, although transaction level models of a bus at different abstraction levels represent the same underlying communication protocol, the CX, CA and PVT models are often developed independently with little or no reuse. The objective of this Thesis is development of a more formal, generic modeling approach for modeling of buses, so that based on a single formal description (e.g. communicating state machines), models at different abstraction levels can be generated in a systematic and potentially automatizable manner. The proposed approach shall be validated using an existing bus protocol, and the final executable models shall be implemented in SystemC.
- Muhammad Shaharyar Awan
Transaction Level Power and Timing Exploration of Bus Architectures
In modern embedded systems, low power consumption is an increasingly important factor that should be taken into account when exploring the design space. Limited energy resources such as batteries, size constraints and limited cooling possibilities have motivated power aware design techniques, which in addition to performance and timing, take the power consumption limitations into account. Low power design at lower levels (i.e. physical, gate and transistor levels) has been extensively studied and successfully applied to complex integrated circuits such as microprocessors. A recent trend is system-level power aware design, in which power consumption is analyzed and optimized at higher levels. For example, software optimization techniques which reduce cache misses and hence result in fewer external memory accesses and lower power consumption. Another example is power consumption of buses, where factors such as the number of transitions on the address, data and control lines directly affect the power consumption. Therefore, factors such as arbitration policies and address/data coding schemes can be used to control the power consumption associated with a bus. The objective of this thesis is conception and development of an OSCI-TLM based framework for unified power and timing exploration. The focus is on the bus model and the effect of different arbitration policies on timing and power consumption. A model of an existing bus protocol shall be developed. For masters and slaves, generic models with simple power models (e.g. simple traffic pattern generators for masters and memory modules for slaves) shall be implemented and used in the experiments.
- Adán Kohler Studienarbeit
Portierung und Optimierung einer H.264-Dekodier-Software für ein eingebettetes System
Der Standard H.264 / MPEG-4 Part 10 definiert eine kodierte Repräsentation für digitalisierte Videosequenzen und einen dazugehörigen Dekodierprozess für verschiedene Bildauflösungen (Levels) und mit verschiedenen Kombinationen alternativer Kodierverfahren (Profiles). Der Dekodierprozess ist (mit Einschränkungen bezüglich Profiles und Levels) durch die Open Source Software X264 implementiert. Aufgabe dieser Studienarbeit ist es, diese für Desktop-Rechner geschriebene Software auf ein Embedded Development Board (ARM Versatile Platform Board mit ARM926EJ-S Prozessor) zu portieren. Ferner soll eine Beschleunigung der Dekodierung erreicht werden, indem ein Teilprozess - die sogenannte Deblocking-Filterung - an eine anwendungsspezifische integrierte Schaltung delegiert wird.
- Rauf Salimi Khaligh
Transaktionsbasierte Simulation von ARM Plattformen
ARM ist eine Familie von Mikroprozessoren, die häufig in eingebetteten Systemen verwendet werden. Solche Systeme beinhalten Hardware Accelerators, Peripherieeinheiten und Speicher, die mittels eines BUS-Systems an den ARM-Prozessor angeschlossen sind und zusammen eine so genannte Plattform bilden. Thema Ihrer Diplomarbeit wird die Entwicklung eines effizienten Simulationssystems für eine solche Plattform sein, basierend auf Transaction Level Modellierung mit SystemC. Der ARM Instruction-Set-Simulator ("Armulator") soll in das Simulationssystem integriert werden. Eine Bibliothek von Modellen wie z. B. für Speicher und das AMBA Bus-System ist zu entwickeln. Das Simulationssystem soll in einer Beispielanwendung getestet werden.