clock
blank

Architecture and Real-Time Systems (ARTS) Laboratory

UMASS
blank

The goal of the project is to define and implement new and better measures of computer system reliability. We can divide the work into three tasks, each of which is described below:

Task 1. New Fault-Tolerance and Reliability Measures

  1. Network Measure The choice of an appropriate interconnection network is key to determining the performance of a fault-tolerant system. We will develop a measure of the interconnection network to guide the designer in choosing an appropriate topology and the bandwidths of various links in that topology.

    This measure will quantify the following two properties of the interconnection network:

    • Structural Properties - The structural properties of a network include such things as the number of hops between individual nodes, the way the structure can degrade as a result of node or link failure, and the number of independent paths between any two nodes. The structural property can be used to indicate lower bounds on the time taken to send a message from one node to another, and what options exist for reconfiguring the network in the event of node or link failure.


    • Bandwidth Properties The bandwidth properties of a network indicate how much traffic can be sustained between the source-destination pairs. The bandwidth properties depend on the structural properties, and the bandwidths of the links and the nodal network interfaces.

  2. Computer Measures Currently available measures either treat the computer in purely static, hardware terms (in the case of traditional reliability and its derivative measures), or require exact characterization of the application and the workload (in the case of performability or application-related cost functions). We understand that computer system dependability is a function of many factors. We will develop a measure which can adequately capture the fault-tolerant capacity of a real-time computer in general terms such as its ability to execute critical workload at a given rate. Such a measure will take into account not only such traditional inputs as the processor failure rate, but also the fault-tolerant impact of the recovery procedures (together with the number and spacing of the checkpoints), as well as the overall embedded environment, including the task scheduling and rescheduling algorithms.

    The measures will determine a systems ability to handle a surge in order to determine a system's reliablility. These we term Surge Handling Measures.

  3. Integrated Measure When some information is available about the application, the workload, and the operating environment, we can do better than with the computer measure described above. We need a measure that will take into account all this information, and characterize reliability, within the context of the application. Clearly, the more precisely we can characterize the application, workload and operating environment, the more precise our reliability evaluation will be.

    This measure will assume a much more detailed modeling of the computer system and its environment than in the computer measure. It can be used to fine-tune the machine (both the hardware and the operating system algorithms such as allocation/scheduling or recovery algorithms) to optimize performance, or to minimize the amount of redundancy needed to achieve preset reliability goals.

Task 2. Analytical Models and Algorithms

We will develop algorithms to compute each of the measures developed as part of Task 1. In their most exact form, these algorithms are likely to be suitable for offline use. However, we will also develop approximate, fast, online variations. Such online versions will be used if the operating system needs to estimate reliabilities on the fly. Such a need can arise, for example, when the system has to decide on the optimal new configuration in the event of component failure or a change in mission characteristics (e.g., task loading, task priorities, task deadlines).

Task 3. Reliability Evaluation Tool

A software tool for evaluating these reliability measures will be built according to these specifications. An important feature of this tool will be its acceptance of imperfect or incomplete information. For example, users who don't know the exact failure rates of processors can input ranges and obtain in response a range of reliability estimates.

  1. User Interface Module The graphical user interface will guide the user, and thus reduce the learning time needed to operate it. The interface will provide outputs in a variety of formats.

  2. Computer/Network Module The second module will accept inputs from the user interface, and compute the three measures mentioned in Task 1, using the computational algorithms developed as part of Task 2.

  3. Validation The modules, once written, will be subjected to rigorous testing. The inputs will include the real-time system benchmarks being developed by Honeywell and Mitre Corporations under ARPA sponsorship.

  4. Documentation It is our intention to distribute the source code of our reliability tool so that other workers in the field can, if necessary, customize it to meet their own needs. Detailed user literature will accompany the code to facilitate this technology transfer.

    In particular, we will study the levels of precision that our numerical algorithms can provide, and ensure numerical stability over the range of parameters likely to be encountered by this tool. This task will include portability tests: the software will be tested for portability on Sun, DEC, and PC (Linux) platforms.

Full Timeline for the Project


blank
blank

Current Projects

Power-Aware Systems

Fault-Tolerance Systems

Past Projects

Fault-Tolerance Techniques

Real-Time Techniques

 

Publications

Picture Gallery

Members Area