Reliability
                                                                                                                                                                                                                 BACK / HOME
Reliability of any I/O system has become as important as its performance and cost. This part of the tutorial:
  Redundancy in disk arrays is motivated by the need to fight disk failures. Two key factors MTTF(Mean-Time-to-Failure) and MTTR(Mean-Time-to-Repair) are of primary concern in estimating the reliability of any disk. Following are some formulae for the mean time between failures :

RAID level 5

MTTF(disk) 2
------------------
N*(G-1)*MTTR(disk)
 
 
Disk array with two redundant disk per parity group (eg: P+Q redundancy)

MTTF(disk) 3
-------------------------
N*(G-1)*(G-2)* (MTTR(disk) 2 )

N - total number of disks in the system
G - number of disks in the parity group

Factors affecting Reliability

Three factors that can dramatically affect the reliability of disk arrays are:

System Crashes

System crash refers to any event such as a power failure, operator error, hardware breakdown, or software crash that can interrupt an I/O operation to a disk array.

Such crashes can interrupt write operations, resulting in states where the data is updated and the parity is not updated or vice versa. In either case, parity is inconsistent and cannot be used in the event of a disk failure. Techniques such as redundant hardware and power supplies can be applied to make such crashes less frequent.

System crashes can cause parity inconsistencies in both bit-interleaved and block-interleaved disk arrays, but the problem is of practical concern only in block-interleaved disk arrays.

For, reliability purposes, system crashes in block-interleaved disk arrays are similar to disk failures in that they may result in the loss of the correct parity for stripes that were modified during the crash.
 
 
 

Uncorrectable bit-errors

Most uncorrectable bit-errors are generated because data is incorrectly written or gradually damaged as the magnetic media ages. These errors are detected only when we attempt to read the data.

Our interpretation of uncorrectable bit error rates is that they represent the rate at which errors are detected during reads from the disk during the normal operation of the disk drive.

One approach that can be used with or without redundancy is to try to protect against bit errors by predicting when a disk is about to fail. VAXsimPLUS, a product from DEC, monitors the warnings issued by disks and notifies an operator when it feels the disk is about to fail.
 
 
 

Correlated disk failures

Causes: Common environmental and manufacturing factors.

For example, an accident might sharply increase the failure rate for all disks in a disk array for a short period of time. In general, power surges, power failures and simply switching the disks on and off can place stress on the electrical components of all affected disks. Disks also share common support hardware; when this hardware fails, it can lead to multiple, simultaneous disk failures.

Disks are generally more likely to fail either very early or very late in their lifetimes.

Early failuresare frequently caused by transient defects which may not have been detected during the manufacturer's burn-in process.
Late failures occur when a disk wears out.  Correlated disk failures greatly reduce the reliability of disk arrays by making it much more likely that an initial disk failure will be closely followed by additional disk failures before the failed disk can be reconstructed.
 
 
 

Mean-Time-To-Data-Loss(MTTDL)

Following are some formulae to calculate the mean-time-to-data-loss(MTTDL). In a block-interleaved parity-protected disk array, data loss is possible through the following three common ways:

The above three failure modes are the hardest failure combinations, in that we, currently, don't have any techniques to protect against them without sacrificing performance.
 

RAID Level 5

 
Double Disk Failure
MTTF(disk1) * MTTF(disk2) 
----------------------- 
N * (G-1) * MTTR(disk)
System Crash + Disk Failure 
MTTF(system) * MTTF(disk) 
----------------------- 
N * MTTR(system)
Disk Failure + Bit Error
MTTF(disk) 
----------------------- 
N * (1 - ( p(disk)) (G-1) )
Software RAID harmonic sum of the above
Hardware RAID  harmonic sum of above excluding system crash + disk failure
Failure Characteristics for RAID Level 5 Disk Arrays  (source: Reference 1)

P+Q disk Array

 
Triple Disk Failures
MTTF(disk) * (MTTF(disk2) * MTTF(disk3) 
---------------------------------- 
N * (G-1) * (G-2) * MTTR(disk) 2
System Crash + Disk Failure
MTTF(system) * MTTF(disk) 
-------------------------- 
N * MTTR(system)
Double disk failure + Bit error
MTTF(disk) * MTTF(disk2) 
---------------------------------- 
N*(G-1)*(1-(p(disk)) (G-2) )* MTTR(disk)
Software RAID harmonic sum of the above
Hardware RAID harmonic sum excluding system crash +disk failure
Failure characteristics for a P+Q disk array  (source: Reference 1)
 
p(disk) = The probability of reading all sectors on a disk (derived from disk size, sector size, and BER)

Tool for Reliability Using the Above Equations. (source: Reference 3)
 
 

BACK / HOME