The chpox - Checkpointing Utility and How to Use It


by Matt Rechenburg

First thanks to Olexander Sudakov and Eugeniy Meshcheryakov for the chpox utility and for contributing to the openMosix community:) I have been "waiting" for this feature for quite some time and many other people have asked for this.

Since checkpointing is a very interesting and useful issue in high performance computing I immediately started to test and use chpox. Here I will give you a quick + short introduction and example of my testing and how you could use it. Hope it is helpful for somebody.

How to Use chpox:

1) Install
The installation is quite easy. You just need to have a configured kernel-source dir from the kernel you are running and follow exactly the README file which is included in the sources.

I used clusterKNOPPIX for this testing which includes the chpox module so there is everything pre-configured. On a another system with a harddisk-installation the usual "./configure && make && make install" worked well for me too. You should execute "depmod -a" to make the new chpox module available and install it in your running kernel by:

insmod chpox_mod

On my node:

root@0[root]# insmod chpox_mod
Using /lib/modules/2.4.22-openmosix-1/misc/chpox_mod.o

After loading the module you can continue with registering your processes.

2) My test applications for chpox
For my testing I used the latest clusterKNOPPIX release which includes the chpox module and commandline tools.

As 2 different test-applications I used:

- A simple shell script

#!/bin/bash
LOOP=1
> /tmp/chpoxloop.txt
while (true); do
echo "loop $LOOP"
echo "loop $LOOP" >> /tmp/chpoxloop.txt
LOOP=$((LOOP+1))
sleep 1
done

- The openMosixcollector
(Also included in clusterKNOPPIX and started by default)

BTW: You should be able to reproduce all tests here by booting the latest clusterKNOPPIX and "cut and paste".

3) Register Processes
There are two flavors;) /proc-interface or chpox-commandline tools.
To register a process you just need to write to the /proc/chpox/register file
e.g.
echo "[PID]:31:1:/tmp/proc-dump" > /proc/chox/register

The same registration can be also executed by the "chpoxctl" util:

chpoxctl add [PID] 31 1 /tmp/proc-dump

This registers PID and enables the possibility to checkpoint it.

On my node i registered the openMosixcollector with pid 312:

root@0[root]# ps ax | grep openmosixcollector
312 ? S 0:07 openmosixcollector -d
root@0[root]# chpoxctl add 312 31 1 /tmp/proc-dump
root@0[root]# cat /proc/chpox/info
312:31:1 [C|0] -> /tmp/proc-dump [0.0]
root@0[root]#

4) Add required libs for your process
Do not forget to register the required libs for your process(es). Restoring the registered and checkpointed process will only work if you tell chpox which libraries are required for restoring, starting and running the process. Again use the chpoxctl for registering the needed libs:

chpoxctl addlib [filename]

On my node I want to register/checkpoint the openMosixcollector so I first needed find out which libraries it requires using the ldd util.
On my node:

root@0[root]# ldd /usr/bin/openmosixcollector
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0x40023000)
libm.so.6 => /lib/libm.so.6 (0x400db000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x400fe000)
libc.so.6 => /lib/libc.so.6 (0x40106000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
root@0[root]#

Then I added those libs to chpox by:

root@0[root]# chpoxctl addlib /usr/lib/libstdc++.so.5
root@0[root]# chpoxctl addlib /lib/libm.so.6
root@0[root]# chpoxctl addlib /lib/libgcc_s.so.1
root@0[root]# chpoxctl addlib /lib/libc.so.6
root@0[root]# chpoxctl addlib /lib/ld-linux.so.2

Then I list the added libs:

root@0[root]# chpoxctl listlibs
/lib/ld-linux.so.2
/lib/libc.so.6
/lib/libgcc_s.so.1
/lib/libm.so.6
/usr/lib/libstdc++.so.5
root@0[root]#

5) Checkpoint
To checkpoint a process, just use the "kill" command and send signal 31:

kill -31 [PID]

This will "dump" the current state of the process PID to the /tmp/proc-dump file which will be used by the "restore" later.

On my node:

root@0[root]# cat /proc/chpox/info
312:31:1 [C|0] -> /tmp/proc-dump [0.0]
root@0[root]# kill -31 312
root@0[root]# cat /proc/chpox/info
312:31:1 [O|1] -> /tmp/proc-dump [1064153742.922500]
root@0[root]# kill -31 312
root@0[root]# cat /proc/chpox/info
312:31:1 [O|2] -> /tmp/proc-dump [1064153760.345908]
root@0[root]#

The first + second checkpoint:)
This "checkpoint" can be executed at any time running the "kill -31 [PID]" command.
You may want to separate process-dump files by using time-stamps.

6) Restore
(Maybe the most interesting section of this document)
To restore a process just pick its latest checkpoint-dump file of the registered process and execute:

ld-chpox [process-dump-file]

On my node:
Ok, I want to restore the openMosixcollector from its last checkpoint...
... so I have to stop it first.

root@0[root]# ps ax | grep openmosixcollector
312 ? S 0:09 openmosixcollector -d
root@0[root]# kill 312
root@0[root]# ps ax | grep openmosixcollector
root@0[root]#

Ok, it is stopped.
Now I restore/restart it from its last checkpoint:

root@0[root]# ld-chpox /tmp/proc-dump &
root@0[root]# ps ax | grep openmosixcollector
2328 pts/1 S 0:00 openmosixcollector -d
root@0[root]#

... Awesome! And it is working again:)

I also tested it with the simple shell script from point 2) and it continues at the same "place" were the last checkpoint was written to the dump-file.

7) Serious
Be sure to add all required libs to chpox, otherwise restoring the process won't work.
I started testing with some "unsuccessful attempts" until I noticed that.

8) Known Limitations
In my testing chpox was always able to checkpoint and restore all my tested processes. It might be problematic for parallel applications which are spawning and running process on remote hosts. chpox is (currently?) limited to working with non-interactive applications only (e.g. daemons). The chpox developers are working on support for sockets, shared-memory, IPC and threads.

9) My Summary
chpox is very useful for e.g. checkpointing "long running" applications/simulations.
I had no problems with it in my testing yet and I will use it:)

You will find the chpox-docu in the included README file or on the chpox-website at:

http://www.cluster.kiev.ua/tasks/chpx_eng.html

(Parts of this document are taken from there)
I just wanted to give a short introduction and example how to use this very useful piece of software.

Many thanks to Olexander and Eugeniy! Great Work!

Matt