Architecture and Real-Time Systems (ARTS) Laboratory

 Architecture and Real-Time Systems (ARTS) Laboratoryclock

Real-Time Techniques

Rapids 4.0

Running a real-time application on a target platform and having a tool to provide a detailed pictorial view of the important events in the life-time of the application can help us further understand the interaction of hardware and software in the real-time system.

Basically, RAPIDS 4.0 will provide an integrated platform for the launch and monitoring of a real-time system as well as evaluating its performance. Not only will the collected data be useful in correcting design error, but can also provide feedback for improvement. Apart from simply monitoring the application running on the target platform, the tool will also provide the user with the capability of analyzing the impact of certain system parameters on performance and determining their optimal values. These parameters basically fall into three categories.

Configuration parameters: In addition to launching the application on the target platform, the user can override the configuration of the target platform to test the performance of the application under different configurations within the scope of the resources available. For example, in a bus-based system, the user can restrict the application to run on a subset of the nodes attached to the bus. In a large distributed point-to-point connected system, the user can choose a specific sub-network in which the application must be run. The user will also be able to override the default task management software, i.e., RAPIDS will provide the capability of assigning the various tasks of the application to specific nodes in the target platform based on the user's directive or a specific algorithm. Such a capability can help in finding an appropriate system configuration for the application under consideration.

Task parameters: The tool will provide a detailed pictorial view of various events during the execution of a task. Events include: start of a task, completion of a task, sending/broadcasting a message, receiving a message, preemption, checkpointing etc. In cases when the entire application set is not available, the tool will provide the capability to generate synthetic tasks in addition to the ``real" tasks. Synthetic tasks can also be used to simulate the ambient (or background) workload of the system. The user will be given full control in the description of the synthetic tasks so as to mimic an actual workload. The user can provide information about the synthetic task in terms of a task trace that was generated earlier or through a detailed user interface. The user will also be provided the option of overriding the scheduling algorithm at each of the nodes in the system.

Traditionally, the dependability of a real-time system has been equated to the effectiveness of the fault tolerant mechanisms embedded in the system. Recently, the ability of the system to handle load surges [4] has been pointed out as a better measure for real-time system dependability. RAPIDS will provide the user with the capability of simulating load surges and obtaining values of the surge handling capability of the real-time system.

Fault parameters: Experimental evaluation by fault injection has become an attractive way of validating specific fault handling mechanisms and allowing the estimation of dependability measures such as fault coverage and error latency. The user will be given the option of specifying different types of faults and these faults will then be emulated through software (Software Implemented Fault Injection) to study the real impact. Apart from low-level faults such as those affecting the CPU and memory, simulation of various network-related faults such as message corruption, message reordering and message delaying will also be provided. By logging important events during the fault recovery process, values of important parameters such as rollback overhead, hardware and software reconfiguration penalty can be obtained.

RAPIDS 3.0 takes as input system-related parameters such as preemption cost, checkpoint overhead and reconfiguration penalty from the user and so the accuracy of the simulation depends on the user's experience or knowledge of the target platform. By running the application on the target platform and monitoring important events, more precise values of these parameters can be obtained. Another advantage of monitoring a live system is that it can identify performance bottlenecks. Exposing performance bottlenecks can help the designer in determining the suitability of the system for the application. The designer might then decide to try out a different configuration or modify some parameters/algorithms/mechanisms to ensure the system meets the requirements of the application. In RAPIDS 3.0, all faults affecting the nodes or links are assumed to be benign and the nodes are assumed to be fail-silent. This is not always true unless some special hardware is provided. RAPIDS 4.0 will provide a more complete framework for actual fault injection and subsequent monitoring of the fault recovery process.


Figure 1: Setup of the Tool

The Main Display Node (MDN) houses the Main Monitor Module (MMM). It is not only a central location for the collection of data from all the application nodes, but also a point of data entry by the user. MMM provides a detailed graphical user interface for data input and viewing of the collected data.

Apart from application tasks, the Application Computing Nodes (ACNs) contain a Local Monitoring Module (LMM) and a Micro Kernel Module (MKM). MKM is a small module that works in close conjunction with the kernel and is responsible for task allocation and fault injection. LMM mainly receives task and fault parameters from the MMM, passes them over to the kernel module at the right time and monitors important events local to the node. Whenever possible, events will be logged in fail-safe places at the local machines. Message passing between the ACNs and the MDN can cause high network traffic that may affect the average message latency of the application. In order to circumvent this problem, a technique utilizing postponed message delivery will be used. Event information can be sent to the MMM during the quiescent state of the network, i.e., when the network traffic is low and the impact of the messages generated by the monitoring tool on the transfer time of the actual application messages will be minimal.

In order to facilitate the integration of any application into the RAPIDS framework, a set of link libraries and APIs will be provided. The link libraries will attempt to reduce as much as possible, the amount of interference with the normal working of the application.

Performance Estimation: Another important aspect of RAPIDS 4.0 is that it will provide valuable suggestions to the user on how the performance of the system can be improved. It will do this by analyzing the monitoring data.

In some cases, we would like to study the impact of modifications in the hardware configurations beyond the scope of the resources available in the testbed. We would also like to study how well the application will scale to other systems or how system performance will be affected if more applications are introduced into the system. This can be done by collecting parameters from the testbed and then applying them to the RAPIDS simulation module. With this simulation module, designers can study the impact of parameter changes (e.g., the number of processors, the failure rate and the workload) and architectural choices, on performance and reliability.