While designing the system we wanted to develop a system that would require no changes in the checkpointed programs. We also wanted to avoid any changes in the kernel code, especially as we had no access to it. Our system has to support 32 and 64 bit processes and it should cope with single and multithreaded processes. One of our assumptions was not to put all procedures/functions in the kernel module because of low error tolerance of the kernel code and resource usage limits. All functionality that does not require access to the kernel data should be removed from the kernel module. Our kernel module is designed for a 64bit kernel, because most systems used for HPC are 64bit. We rely on the procfs system and our own dynamically loaded kernel module. This approach has the following advantages:
C/R utility is implemented as a pair of cooperating tools designed for 64-bit Solaris 8 (it also runs on Solaris 9). The first is a user-level program that makes extensive use of several kernel services in order to stop, gain control and access process information. This tool uses the procfs module which is accessible through /proc filesystem (by default). The second is our own kernel module which is accessible as a standard character device through the /dev/checkpoint file. This file can be opened and closed, but cannot be written or read; all functions are accessible through ioctl operations. We have implemented the kernel module only for the 64 bit kernel, but C/R utility can handle 32 and 64 bit processes.
- There is no need for any changes in the processes, so we do not have to change the sources of the checkpointed processes.
- There are no assumptions on the programming language, libraries versions etc. We are performing full checkpoint. all data including shared libraries are saved and restored without involving any runtime linker.
- There is no need for kernel changes. the kernel part of C/R utility is implemented as a dynamically loaded module that can be added (and removed) any time, with no need for the system reboot.
- There is no runtime overhead. We do not trace any kernel mechanisms while the process is running, so we do not consume any kernel resources (logging resource usage for each process in the system may impact the overall system efficiency). All information about process is gathered during the checkpoint.
The presentation published at Sun Microsystems HPC Consortium Meeting Phoenix November 2003.
How to download
Before downloading, please read the Copyright and Licence below and please register.
Please select the interesting package available below. Those are the first version of package free available. If you have found any bugs please contact with us.