Architecture and Real-Time Systems (ARTS) Laboratory

 Architecture and Real-Time Systems (ARTS) Laboratoryclock

Recovery Algorithms

RAMP: The Theoretical Background

Given the tasks and system resources, an optimal configuration exists which depends on system parameters such as fault rates, remaining mission time, workload pattern, and task deadlines. From these parameters, we conceive a model of the system in question. As the system operation progresses, some of that parameters, most notably, the remaining mission time, will very, necessitating a dynamic reconfiguration of system resources.

The effect of a system fault can be corrected by a fault recovery procedure, traditionally consisting of a sequence of action phases applied in a fixed order. For a specific fault, not all phases are beneficial and an efficient recovery procedure will consist of a partial sequence of actions. Moreover, different recovery procedures may lead to different system functional states, resulting in different system behavior for the rest of the mission time. An optimal dynamic recovery scheme should, therefore, take into account both recovery efficiency and future system behavior.

In this project, we propose a policy incorporating both dynamic reconfiguration and dynamic recovery with the objective of maximizing the system reliability.

The considered system is a real-time, distributed and mission-oriented system, consisting of multiple types of system resources and real-time tasks. The system resources can be reconfigured during the mission time, with the reconfiguration being either fault-triggered or scheduled.

Due to the timing constraint and the disastrous consequence of missing some critical tasks's deadlines, a decision policy must take care of every critical task in the mission. Therefore, the decisions on selecting recovery and reconfiguration procedures are made dynamically according to not only the system hardware state (such as fault pattern), external parameters (such as task arrival rate), system remaining mission time, and the procedure overhead, but also the current instantaneous workload.

System workload plays a very important role in the decision of whether or not a procedure should be applied. A procedure is selected according to its short term efficiency during the procedure interval, and the system behavior in the succeeding new state. All the recovery and reconfiguration procedures will incur certain processing overhead. At a time instant when system workload is high, a procedure with high overhead is not recommended even though it can change the system state to achieve high system reliability for the remaining mission, since the service delay can cause some critical task to miss its deadlines. In the case of light workload, on the other hand, the above mentioned procedure would be the best choice.

This could result in a prohibitively large state space for any practical analysis. We reduce the size of the analytic state space by identifying the system states in two levels of resolution: configuration states and operational states. The latter reflects, in addition to the system configuration information, system operational information such as workload and fault pattern. By employing steady state probability information, every operational state can be associated with a configuration state, and the operational state space is reduced to the configuration state space. The system reliability is calculated using the stochastic dynamic programming technique and a decision is selected so as to maximize the reliability of the remaining mission time. We call our decision model a Reduced State Space Markov Decision Process (RAMP). Our model can not only describe a realistic system in great detail, but also results in a very efficient and practical computation algorithm.

In a faulty state, some recovery procedure will be chosen and performed. A successful recovery procedure will change the system's state from faulty to normal. We assume that all the available alternative recovery procedures constitute a set of formula The system recovery interval can be divided into three sub-intervals: the fault detection and identification interval, actual recovery procedure execution interval, and the system workload restoration interval.

Some recovery procedures can achieve a better one-time efficiency, while others may bring the system back to a better-behaved state for the remaining mission time. The design objective here is to achieve the best system reliability for the entire mission. Therefore, at any instant in the system's mission, the optimal policy should select the procedure which can achieve the best system behavior for the remaining mission time from that instant on.

The following is the basic equations used in MDP Theory.

For policy pi, let formula denote the expected reward in state s for the remaining mission time formula, where S is the state space and Tau is the total mission time).

It can be expressed as


where v is the immediate reward, Q is the probability distribution function of the next state after transition, z is the next state, and u is the transition time.

The optimal policy Pi^* is determined by


A graphic of the system states and transitions considered is provided here.

Our objective is to maximize the system reliability during a mission time of length Tau. To this end, we developed the following algorithm, based on the MDP theory, to calculate the optimal recovery and reconfiguration policy for the given system. The Algorithm operates as such: