|
|
The goal of the project is to define and implement new and better measures
of computer system reliability. We can divide
the work into three tasks, each of which is described below:
Task 1. New Fault-Tolerance and Reliability Measures
- Network Measure The choice of an appropriate interconnection
network is key to determining the performance of a fault-tolerant system.
We will develop a measure of the interconnection network to guide the
designer in choosing an appropriate topology and the bandwidths of various
links in that topology.
This measure will quantify the following two properties of the interconnection
network:
- Structural Properties - The structural properties of a
network include such things as the number of hops between individual
nodes, the way the structure can degrade as a result of node or
link failure, and the number of independent paths between any two
nodes. The structural property can be used to indicate lower bounds
on the time taken to send a message from one node to another, and
what options exist for reconfiguring the network in the event of
node or link failure.
- Bandwidth Properties The bandwidth properties of a network
indicate how much traffic can be sustained between the source-destination
pairs. The bandwidth properties depend on the structural properties,
and the bandwidths of the links and the nodal network interfaces.
- Computer Measures Currently available measures either treat
the computer in purely static, hardware terms (in the case of traditional
reliability and its derivative measures), or require exact characterization
of the application and the workload (in the case of performability or
application-related cost functions). We understand that computer system
dependability is a function of many factors.
We will develop a measure which can adequately capture the fault-tolerant
capacity of a real-time computer in general terms such as its ability
to execute critical workload at a given rate. Such a measure will take
into account not only such traditional inputs as the processor failure
rate, but also the fault-tolerant impact of the recovery
procedures (together with the number and spacing of the checkpoints),
as well as the overall embedded environment,
including the task scheduling and rescheduling algorithms.
The measures will determine a systems ability to handle
a surge in order to determine a system's reliablility. These we
term Surge Handling Measures.
- Integrated Measure When some information is available about
the application, the workload, and the operating environment, we can
do better than with the computer measure described above. We need a
measure that will take into account all this information, and characterize
reliability, within the context of the application. Clearly, the more
precisely we can characterize the application, workload and operating
environment, the more precise our reliability evaluation will be.
This measure will assume a much more detailed modeling of the computer
system and its environment than in the computer measure. It can be used
to fine-tune the machine (both the hardware and the operating system
algorithms such as allocation/scheduling or recovery algorithms) to
optimize performance, or to minimize the amount of redundancy needed
to achieve preset reliability goals.
Task 2. Analytical Models and Algorithms
We will develop algorithms to compute each of the measures developed
as part of Task 1. In their most exact form, these algorithms are likely
to be suitable for offline use. However, we will also develop approximate,
fast, online variations. Such online versions will be used if the operating
system needs to estimate reliabilities on the fly. Such a need can arise,
for example, when the system has to decide on the optimal new configuration
in the event of component failure or a change in mission characteristics
(e.g., task loading, task priorities, task deadlines).
Task 3. Reliability Evaluation Tool
A software tool for evaluating these reliability measures will be built
according to these specifications. An important
feature of this tool will be its acceptance of imperfect or incomplete
information. For example, users who don't know the exact failure rates
of processors can input ranges and obtain in response a range of reliability
estimates.
- User Interface Module The graphical user interface will guide the user, and thus reduce the learning time needed to operate it. The interface will provide outputs in a variety of formats.
- Computer/Network Module The second module will accept inputs from the user interface, and compute the three measures mentioned in Task 1, using the computational algorithms developed as part of Task 2.
- Validation The modules, once written, will be subjected to rigorous testing. The inputs will include the real-time system benchmarks being developed by Honeywell and Mitre Corporations under ARPA sponsorship.
- Documentation It is our intention to distribute the source code of our reliability tool so that other workers in the field can, if necessary, customize it to meet their own needs. Detailed user literature will accompany the code to facilitate this technology transfer.
In particular, we will study the levels of precision that our numerical algorithms can provide, and ensure numerical stability over the range of parameters likely to be encountered by this tool. This task will include portability tests: the software will be tested for portability on Sun, DEC, and PC (Linux) platforms.
Full Timeline for the Project
 |
 |