Our project is divided into two phases. Phase I relates to the development of RAPIDS 4.0 into a minimally-intrusive monitoring tool. In Phase II, RAPIDS will be shaped into a complete performance evaluation tool. Each milestone below represents the completion of certain key modules.
The MPI Wrapper: Here we assume that the applications would be using MPI for communication. The design of the LMM would require some modifications to MPI or the development of a "wrapper" to be used on top of MPI.
The Monitoring Module: The Main Monitoring module collects information as the application is running and displays it in real-time. This will mark the completion of Phase I.
Fault-tolerant Synthetic Workloads: These synthetic applications will be generic and tunable by the user's input.
Fault Injection and Recovery Monitoring: User-specified faults will be injected via a fault injector and the response of the system monitored. For applications that do not have a built-in recovery scheme, recovery techniques such as checkpointing or ALFT will be investigated.
Allocation and Scheduling Algorithms: Within the scope of the resources available, these system-level algorithms will be made changeable by the user.
Integration: All the above modules will be integrated into one complete, user-friendly, system. This will mark the end of Phase II.