Architecture and Real-Time Systems (ARTS) Laboratory

 Architecture and Real-Time Systems (ARTS) Laboratoryclock

Net FT Project

The NetFT project is exploring novel fault tolerance techniques for microprocessor-based network interfaces. The key idea behind the project is that the presence of a network processor in the node of a distributed system provides new avenues for network-level and network-assisted fault tolerance. Compute resources available on the network interface help to incorporate fault tolerance in a minimally intrusive manner to the nodes/application. The synergetic and mutually assistive use of compute resources in the network interfaces lead to a networked system with a fault tolerance level that is far greater than what can be achieved with an implementation that confines fault tolerance mechanisms within the host system.

Our research mainly focused on the Myrinet network technology. Our initial implementation on an in-house Myrinet cluster showed very promising results. We were able to detect network interface hangs in less than a millisecond and recover from the fault in under two seconds (which included reloading of Myrinet Control Program (MCP)). Further details on this initial implementation can be obtained from the recently accepted DSN'03 paper Low overhead fault tolerant networking in Myrinet.

We extended our fault tolerance techniques to more sophisticated network processor technologies such as the IXP1200. The design of the IXP1200 provides further capabilities and avenues for network-level fault tolerance research.

Our research was supported by the National Science Foundation. We expected all results and new technologies developed from this project to be publicly available and benefit both the academic and corporate world.