The goal of the project was to define and implement new and better measures of computer system reliability. We can divide the work into three tasks, each of which was described below:
Task 1. New Fault-Tolerance and Reliability Measures
- Network Measure The choice of an appropriate interconnection
network was key to determining the performance of a fault-tolerant system.
We developed a measure of the interconnection network to guide the
designer in choosing an appropriate topology and the bandwidths of various
links in that topology.
This measure quantified the following two properties of the interconnection network:
- Structural Properties - The structural properties of a network include such things as the number of hops between individual nodes, the way the structure can degrade as a result of node or link failure, and the number of independent paths between any two nodes. The structural property can be used to indicate lower bounds on the time taken to send a message from one node to another, and what options exist for reconfiguring the network in the event of node or link failure.
- Bandwidth Properties The bandwidth properties of a network indicate how much traffic can be sustained between the source-destination pairs. The bandwidth properties depend on the structural properties, and the bandwidths of the links and the nodal network interfaces.
- Computer Measures Currently available measures either treat
the computer in purely static, hardware terms (in the case of traditional
reliability and its derivative measures), or require exact characterization
of the application and the workload (in the case of performability or
application-related cost functions). We understand that computer system
dependability was a function of many factors.
We developed a measure which can adequately capture the fault-tolerant
capacity of a real-time computer in general terms such as its ability
to execute critical workload at a given rate. Such a measure took
into account not only such traditional inputs as the processor failure
rate, but also the fault-tolerant impact of the recovery
procedures (together with the number and spacing of the checkpoints),
as well as the overall embedded environment,
including the task scheduling and rescheduling algorithms.
The measures determined a systems ability to handle a surge in order to determine a system's reliablility. These we term Surge Handling Measures.
- Integrated Measure When some information was available about
the application, the workload, and the operating environment, we can
do better than with the computer measure described above. We need a
measure that took into account all this information, and characterize
reliability, within the context of the application. Clearly, the more
precisely we can characterize the application, workload and operating
environment, the more precise our reliability evaluation be.
This measure assume a much more detailed modeling of the computer system and its environment than in the computer measure. It can be used to fine-tune the machine (both the hardware and the operating system algorithms such as allocation/scheduling or recovery algorithms) to optimize performance, or to minimize the amount of redundancy needed to achieve preset reliability goals.
Task 2. Analytical Models and Algorithms
We developed algorithms to compute each of the measures developed as part of Task 1. In their most exact form, these algorithms are likely to be suitable for offline use. However, we also developed approximate, fast, online variations. Such online versions was used if the operating system needs to estimate reliabilities on the fly. Such a need can arise, for example, when the system has to decide on the optimal new configuration in the event of component failure or a change in mission characteristics (e.g., task loading, task priorities, task deadlines).
Task 3. Reliability Evaluation Tool
A software tool for evaluating these reliability measures was built according to these specifications. An important feature of this tool was its acceptance of imperfect or incomplete information. For example, users who don't know the exact failure rates of processors can input ranges and obtain in response a range of reliability estimates.
- User Interface Module The graphical user interface would guide the user, and thus reduce the learning time needed to operate it. The interface would provide outputs in a variety of formats.
- Computer/Network Module The second module would accept inputs from the user interface, and compute the three measures mentioned in Task 1, using the computational algorithms developed as part of Task 2.
- Validation The modules, once written, would be subjected to rigorous testing. The inputs would include the real-time system benchmarks being developed by Honeywell and Mitre Corporations under ARPA sponsorship.
- Documentation It was our intention to distribute the source code of our reliability tool so that other workers in the field can, if necessary, customize it to meet their own needs. Detailed user literature would accompany the code to facilitate this technology transfer.
In particular, we studied the levels of precision that our numerical algorithms can provide, and ensure numerical stability over the range of parameters likely to be encountered by this tool. This task included portability tests: the software was be tested for portability on Sun, DEC, and PC (Linux) platforms.