RAID level 5
N - total number of disks in the system
G - number of disks in the parity group
Three factors that can dramatically affect the reliability of disk arrays are:
System crash refers to any event such as a power failure, operator error, hardware breakdown, or software crash that can interrupt an I/O operation to a disk array.
Such crashes can interrupt write operations, resulting in states where the data is updated and the parity is not updated or vice versa. In either case, parity is inconsistent and cannot be used in the event of a disk failure. Techniques such as redundant hardware and power supplies can be applied to make such crashes less frequent.
System crashes can cause parity inconsistencies in both bit-interleaved and block-interleaved disk arrays, but the problem is of practical concern only in block-interleaved disk arrays.
For, reliability purposes, system crashes in block-interleaved
disk arrays are similar to disk failures in that they may result
in the loss of the correct parity for stripes that were modified during
the crash.
Most uncorrectable bit-errors are generated because data is incorrectly written or gradually damaged as the magnetic media ages. These errors are detected only when we attempt to read the data.
Our interpretation of uncorrectable bit error rates is that they represent the rate at which errors are detected during reads from the disk during the normal operation of the disk drive.
One approach that can be used with or without redundancy
is to try to protect against bit errors by predicting when
a disk is about to fail. VAXsimPLUS, a product from DEC, monitors the warnings
issued by disks and notifies an operator when it feels the disk is about
to fail.
Causes: Common environmental and manufacturing factors.
For example, an accident might sharply increase the failure rate for all disks in a disk array for a short period of time. In general, power surges, power failures and simply switching the disks on and off can place stress on the electrical components of all affected disks. Disks also share common support hardware; when this hardware fails, it can lead to multiple, simultaneous disk failures.
Disks are generally more likely to fail either very early or very late in their lifetimes.
Early failuresare frequently caused
by transient defects which may not have been detected during the manufacturer's
burn-in process.
Late failures occur when a disk
wears out. Correlated disk failures greatly reduce the reliability
of disk arrays by making it much more likely that an initial disk failure
will be closely followed by additional disk failures before the failed
disk can be reconstructed.
Following are some formulae to calculate the mean-time-to-data-loss(MTTDL). In a block-interleaved parity-protected disk array, data loss is possible through the following three common ways:
Double Disk Failure |
|
System Crash + Disk Failure |
|
Disk Failure + Bit Error |
|
Software RAID | harmonic sum of the above |
Hardware RAID | harmonic sum of above excluding system crash + disk failure |
Triple Disk Failures |
|
System Crash + Disk Failure |
|
Double disk failure + Bit error |
|
Software RAID | harmonic sum of the above |
Hardware RAID | harmonic sum excluding system crash +disk failure |
Tool
for Reliability Using the Above Equations. (source: Reference
3)