D.K. Pradhan (ed.), Fault Tolerant Computer System Design,
Prentice-Hall, 1996.
D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems:
Design and Evaluation, A.K. Peters, 1998.
B.W. Johnson, Design and Analysis of Fault-Tolerant Digital
Systems, Addison-Wesley, 1989.
P. Jalote, Fault Tolerance in Distributed Systems,
PTR Printice Hall, 1994.
Software Fault Tolerance:
N. Leveson, J. Knight, and T. Shimeall, ``The use of self
check and voting in software error detection: An empirical
study,'' IEEE transactions on Software Engineering, April 1990.
A. Avizienis and J. Kelly, ``Fault Tolerance by Design
Diversity: Concepts and Experiments,'' IEEE Computer, August
1984, pp. 67-80.
J.H. Purtilo and P. Jalote, ``An environment for developing
fault-tolerant software,'' IEEE Trans. Software Engg., vol.17,
no.2, pp.153-159, Feb. 1991.
General Fault Tolerance:
A. Agbaria and J.S. Plank. Design, implementation, and performance of checkpointing in NetSolve. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 49-54, June 2000.
Z. Alkhalifa and V. S. Nair, "Design of a portable control-flow checking techniques," Proc. High-Assurance Engineering Workshop, pp. 120-123, 1997.
Z. Al-Ars and A. van de Goor. Static and Dynamic Behavior of Memory Cell Array Opens and Shorts in Embedded DRAMs. In Proceedings of the Design Automation and Test in Europe (DATE), pages 496-503, March 2001.
H. Al-Asaad, J.P. Hayes, and T. Mudge. Modeling and Detecting Control Errors in Microprocessors. In Proceedings ofDYCONS, August 1999.
Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, and J.A. Abraham. Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection. IEEE Transactions on Parallel and Distributed Systems, 10(6):627-641, June 1999.
L. Anghel and M. Nicolaidis. Cost Reduction and Evaluation of a Temporary Faults Detecting Technique. In Proceedings of the Design Automation and Test in Europe (DATE), pages 591-598, March 2000.
T. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 196-207, November 1999.
A. Benso, S. Chiusano, P. Prinetto, and L. Tagliaferri. A C/C++ source-to-source compiler for dependable applications. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 71-78, June 2000.
A. Benso, S.Chiusano, G.Di Natale, and P.Prinetto. An on-line BIST RAM architecture with self-repair capabilities. IEEE Transactions on Reliability, 51(1):123-128, March 2002.
P. Bose. Ensuring dependable processor performance: an experience report on pre-silicon performance validation. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 481-486,July 2000.
D.C. Bossen, A. Kitamom, K.F. Reick, and M.S. Floyd. Fault-Tolerant Design of the IBM pSeries 690 System Using POWER4 Processor Technology. IBM Journal of Research and Development, 46(1):77-86, January 2002.
D.C. Bossen, J.M. Tendler, and K. Reick. POWER4 System Design for High Reliability. IEEE Micro, 22(2): 16-24, March 2002.
S. Chatterjee, C. Weaver, and T. Austin. Efficient Checker Processor Design. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 87-97, December 2000.
C. Chen and A. K. Somani. Fault Containment in Cache Memories for TMR Redundant Processor Systems. IEEE Transactions on Computers, 48(4):386-397, March 1999.
C. L. Chen et al., "Error-correcting codes for Semiconductor Memory Applications:
A-state-of-the-art-review", IBM Journal of Research Development, pp. 124-132, March 1984.
G. Choi, R. K. lyer, and V. Carreno, "FOCUS: An experimental environment for validation of fault-tolerant systems-case study of a jet-engine controller," Proc. Int'l Conf. Computer Design: VLSI in Computers and Processors, pp. 561-564, 1989.
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of Operating Systems Errors. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pages 73-88, October
2001.
F. Como, M. Sonza Reorda, S. Squillero, and M. Violante. On the Test of Microprocessor IP Cores. In Proceedings of the Design Automation and Test in Europe (DATE), pages 209-213, March 2001.
T.J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Serve Main Memory, IBM Microelectronics Division, November 1997.
D. Engler, D.Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pages 57-72, October 2001.
D. Chen et. al. JVM Susceptibility to Memory Errors. In Proceedings of the USENIXJava Virtual Machine Research and Technology Symposium, April 2001.
M. Favalli and C. Metra. Optimization of Error Detecting Codes for the Detection of Crosstalk Originated Errors. In Proceedings of the Design Automation and Test in Europe (DATE), pages 290-296, March 2001.
J. Gaisler, "A portable and fault-tolerant microprocessor based on the SPARC V8 architecture" , Proceedings of the International Conference on Dependable Systems and Networks, pp. 409-415, 2002.
K.K. Goswami. DEPEND: a simulation-based environment for system level dependability analysis. IEEE Transactions on Computers, 46(1): 60-74, January 1997.
M. Hamada and E. Fujiwara, "A class of error control codes for byte organized memory systems-SbEC-(Sb+S)ED codes," IEEE Trans. on Computers, 46(1):105-109, Jan. 1997.
I. Hartanto, S. Venkataraman, W.K. Fuchs, E.M. Rudnick, J.H. Patel, and S. Chakravarty. Diagnostic simulation of stuck-at faults in sequential circuits using compact lists. ACM Transactions on Design Automation of Electronic Systems (TODAES), 6(4):471^89, October 2001.
M. Hiller. Executable Assertions for Detecting Data Errors in Embedded Control Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 24-33, June 2000.
D. E. Hoffman et al., "Deep submicron design techniques for the 500 MHz IBM S/390 G5 custom microprocessor," Proc. Int'l Conf. Computer Design, pp. 258-263, 1998.
Itanium Processor Family Error Handling Guide, August 2001. http://www.intel.com/design/itanium/downloads/249278.htm.
R. Johansson, "On single event upset error manifestation," Proc. European Dependable Computing Conf., pp.217-231, 1994.
T. Juhnke and H. Klar, "Calculation of the soft error rate of submicron CMOS logic circuits," IEEE J. Solid-State Circuits, 30(7):830-834, July 1995.
J. Karlsson et al., "Using heavy-ion radiation to validate fault-handling mechanisms," IEEE Micro, 14(1):8-23, Feb. 1994.
C. K. Kouba and G. Choi, "The single event upset characteristics of the 486-DX4 microprocessor," IEEE Radiation Effects Data Workshop, pp. 48-52, 1997.
S. Kim and A.K. Somani. An Adaptive Write Error Detection Technique in On-Chip Caches of Multi-Level Caching Systems. Journal of Microprocessors and Microsystems, 22(9):561-570, March 1999.
S. Kim and A.K. Somani. Area Efficient Architectures for Information Integrity in Cache Memories. In Proceedings of the International Symposium of Computer Architecture (ISCA), pages 246-255, May 1999.
S. Kim and A.K. Somani. On-Line Integrity Monitoring of Microprocessor Control Logic. Microelectronics Journal, 32(12):999-1007, December 2001.
S. Kim and A. K. Somani, "SSD: an affordable fault tolerance for superscalar processors," Int'l Proc. Pacific Rim Symp. Dependable Computing, 2001.
S. Kim and A. K. Somani, "Adaptive write error detection technique in on-chip caches of multi-level caching systems," Microprocessors and Microsystems Journal, 22(9):561-570, March 1999.
S. Kim and A. K. Somani, "On-line integrity monitoring of microprocessor control logic," Proc. Int'l Conf. Computer Design, pp. 314-319, 2001.
S. Kim and A. K. Somani, "An affordable transient fault tolerance for superscalar processors," Fast Abstract in conjunction with IEEE DSN-2001, pp. 10-11, Goteborg, June 2001.
S. Kim and A. K. Somani, "On-line integrity monitoring of microprocessor control logic," Microelectronics Journal, 32(12):999-1007, Nov. 2001.
S.W. Kwak, B.J. Choi, and B.K. Kim. An optimal checkpointing-strategy for real-time control systems under transient faults. IEEE Transactions on Reliability, 50(3):293-301, September 2001.
J.L. Lawall and G. Muller. Efficient incremental checkpointing of Java programs. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 61-70, June 2000.
P. Liggesmeyer and 0. Maeckel. Quantifying the reliability of embedded systems by automated analysis. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 89-94, July 2000.
J. Li and E. E. Swartzlander, "Concurrent error detection in ALUs by recomputing with rotated operands," Proc. Int'l Workshop Defect and Fault Tolerance in VLSI Systems, pp. 109-116, 1992.
C-Y. Lin, S-Y. Kuo, and Y. Huang. A checkpointing tool for Palm operating system. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 71-76, July 2000.
A. Maamar, G. Russell, "A 32 bit RISC processor with concurrent error detection," Proc. Euromicro Conf., pp. 461-467, 1998.
A. Mahmood and E.J. McCluskey. Concurrent Error Detection Using Watchdog Processors-A Survey. IEEE Transactions on Computers, 37(2): 160-174, February 1988.
I. Majzik, W. Hohl, A. Pataricza, and V. Sieh, "Multiprocessor checking using watchdog processors," Computer Systems Science and Engineering, 11(5):301-310, Sept. 1996.
A. Mendelson and N. Suri, "Designing high-performance and reliable superscalar architectures-the out of order reliable superscalar (03RS) approach," Proc. Int'l Conf. Dependable Systems and Networks, pp. 473-481, 2000.
A. Messer, P. Bemadat, F. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D.D. Mannaru, A. Riska, and D.Milojicic. Susceptibility of Modem Systems and Software to Soft Errors. Technical Report HPL-2001-43, HP Laboratories, March 2001.
D. Milojicic, A. Messer, J. Shau, G. Fu, and A. Munoz. Increasing Relevance of Memory Hardware Errors - A Case for Recoverable Programming Models. In Proceedings of the ACM SIGOPS European Workshop, September 2000.
G. Miremadi and J. Torin, "Effects of physical injection of transient faults on control flow and evaluation of some software-implemented error detection techniques," Proc. Dependable Computing for Critical Applications 4, PP- 435-457, 1995.
G. Miremadi and J. Torin, "Evaluating processor- behavior and three error-detection mechanisms using physical fault-injection," IEEE Trans. Reliability, 44(3):441-454, Sept. 1995.
W. A. Moreno et al., "First test results of system level fault tolerant design validation through laser fault injection," Proc. Int'l Conf. Computer Design: VLSI in Computers and Processors, pp. 544-548, 1997.
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multi-threading alternatives", Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 99-110, 2002.
J. B. Nickel and A. K. Somani, "REESE: A method of soft error detection in microprocessors," Proc. Int'l Conf. Dependable Systems and networks, pp. 401-410, 2001.
M. Nicolaidis, "Design for soft-error robustness to rescue deep submicron scaling," Proc. Int'l Test Conf., pp. 1140, 1998.
N. Oh, P.P. Shirvani, and E.J. McCluskey. Control Flow Checking by Software Signatures. IEEE Transactions on Reliability, 51(1):111-122, March 2002.
N. Oh, P.P. Shirvani, and E.J. McCluskey. Error Detection by Duplicating Instructions in Super-scalar Processors. IEEE Transactions on Reliability, 51(1):63-75, March 2002.
J. Ohisson and M. Rimen, "Implicit signature checking," Int'l Symp. Fault-Tolerant Computing, pp. 218-227, 1995.
J. Opiinger and M. Lam. Enhancing Software Reliability using Speculative Threads. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002.
S. J. Patel, Z. Kalbarczyk, R. K. lyer, W. Magda, and N. Nakka. A Processor-Level Framework for High-Performance and High-Dependability. In Proceedings of the Workshop on Evaluating and Architecting Systems for Dependability, 2001.
J. H. Patel and L. Y. Fung, "Concurrent error detection in ALU's by recomputing with shifted operands," IEEE Trans. Computers, C-32, April 1983.
K. Prager, M. Vahey, W. Farwell, J. Whitney, and J. Lieb. A fault tolerant signal processing computer. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 169-174, June 2000.
N. Quach, "High availability and reliability in the Itanium processor," IEEE Micro, 20(5):61-69, Sept.-Oct. 2000.
F. Rashid, K.K. Saluja, and P. Ramanathan. Fault tolerance through re-execution in multiscalar architecture. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 482-491, June 2000.
J. Ray, J. Hoe, and B. Falsafi. Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 214-224, December 2001.
S. K. Reinhardt and S. S. Mukherjee, "Transient fault detection via simultaneous multithreading," Proc, Int'l Symp. Computer Architecture, pp. 25-36, 2000.
D.A. Rennels and R. Hwang. Recovery in fault-tolerant distributed microcontrollers. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 475-480, July 2000.
A. Rosing, A. Richardson, and A. Dorey. A Fault Simulation Methodology for MEMS. In Proceedings of the Design Automation and Test in Europe (DATE), pages 476-483, March 2000.
E. Rotenberg, "AR-SMT: A microarchitectural approach to fault tolerance in microprocessor," Proc. Int'l Symp. Fault-Tolerant Computing, 1999.
K. Saab, N. Ben-Hamida, and B. Kaminska. Parametric Fault Simulation and Test Vector Generation. In Proceedings of the Design Automation and Test in Europe (DATE), pages 650-656, March 2000.
R. A. Sahner, K. S. Trivedi, and A. Puliafito. Performance and Reliability Analysis of Computer Systems:
An Example-Based Approach Using the SHARP E Software Package. Kluwer Academic Publishers, 1995.
T. Sato and I. Arita. Tolerating Transient Faults through an Instruction Reissue Mechanism. In Proceedings of the International Conference on Parallel and Distributed Computing Systems (PDCS), pages 240-247, August 2001.
N. R. Saxena et al., "Fault-tolerant features in the HaL memory management unit," IEEE Trans. on Computers, 44(2);170-180, Feb. 1995.
M. A. Schuette and J. P. Shen, "Exploiting instruction-level parallelism for integrated control-flow monitoring," IEEE Trans. Computers, vol. 43, no. 2, pp. 129-140, Feb. 1994.
G. Sohi and et. al. "A study of time-redundant fault tolerance techniques for high-performance pipelined computers," Int'l Symp. Fault-Tolerant Computing, 1989.
L. Spainhower and T. A. Gregg, "G4: a fault-tolerant CMOS mainframe," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 432-440, 1998.
L. Schaelicke and M. Parker. ML-RSIM.
N. Seifert et al., "Historical trend in alpha-particle induced soft error rates of the Alpha(tm) microprocessor," Proc. Int'l Symp. Reliability Physics, pp. 259-265, 2001.
P.P Shirvani and E.J. McCluskey. PADded Cache: A New Fault Tolerance Technique for Cache Memories. In Proceedings of the IEEE VLSI Test Symposium, pages 440-445, April 1999.
P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on Soft Error Rate of Combinational Logic. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2002.
A. K. Somani and K. S. Trivedi. A Cache Error Propagation Model. In Proceedings of the Pacific Rim International Symposium on Fault Tolerant Systems, pages 15-21, December 1997.
A. K. Somani and S. Kim, "Transient Fault Detection in Cache Memories by Employing a Small Shadow Cache," Proc. of Dependable Computing for Critical Applications 6, pp 19-39, 1998.
J. Sosnowski, "Detection of control flow errors using signature and checking instructions," Proc. Int'l Test Conf., pp. 81-99, 1988.
J. Sosnowski, "Transient fault tolerance in digital systems," IEEE Micro, 14(1):24-35, Feb. 1994.
G.R. Srinivasan. Modeling the Cosmic-Ray-Induced Soft-Error Rate in Integrated Circuits: An Overview. IBM Journal of Research and Development, 40(1):77-89, January 1996.
M. Turmon, R. Granat, and D. Katz. Software-implemented fault detection for high-performance space applications. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 107-116, June 2000.
T.N. Vijaykumar, I. Pomeranz, and Karl Cheng. Transient-Fault Recovery via Simultaneous Multithreading. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 87-98, May 2002.
C. Weaver and T. Austin. A fault tolerant approach to microprocessor design. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 411-420, July 2000.
K. Wilken and J. Shen, "Concurrent error detection using signature monitoring and encryption," Proc. Dependable Computing for Critical Applications 1, 1989.
K. D. Wilken, J. P. Shen, "Embedded signature monitoring: analysis and technique," Proc. Int'l Test Conf., pp. 324-333, 1987.
K. D. Wilken ."Optimal signature placement for processor-error detection using signature monitoring," Int'l Symp. Fault-Tolerant Computing, pp. 326-333, 1991.
K. Wu and R. Karri. Exploiting Idle Cycles for Algorithm Level Re-computing. In Proceedings of the International Conference on Design Automation and Test in Europe, pages 842-846, 2002.
J. Xu, S. Chen, Z. Kalbarczyk, and R.K. lyer. An experimental study of security vulnerabilities caused by errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 421-430,July 2000.
S. S. Yau and F. C. Chen, "An approach to concurrent control flow checking," IEEE Trans. Software Engineering, vol. SE-5, no.2, pp. 126-137, 1980.
W. Zhang, M. Kandemir, A. Sivasubramaniam, and S. Gurumurthi. ICR: In-Cache Replication for Enhancing Data Cache Reliability. Technical Report CSE-02-020, The Pennsylvania State University, December 2002.
J.f. Zeigler. Terrestrial Cosmic Rays. IBM Journal of Research and Development, 40(1): 19-39, January 1996.
J. F. Ziegler et al., "IBM experiments in soft fails in computer electronics," IBM J. Res. Develop., 40(1):3-18, 1996.
Created by Prof. Israel Koren, koren@ecs.umass.edu