the openMosix-API

by Matt Rechenburg

Overview

General description

This documentation about the openMosix API (application programming interface) will explain in detail the functionalities of the openMosix-structure and how to access it in the kernel during runtime. It will help (and encourage) you to use this API with your applications and in programs you develop your own.

The example section of this documentation provides several functions for some "common-languages" (shell, perl, php, c/c++) which you can directly use in your code.
Also it will enhance your knowledge about your cluster and ease up administration.

Please contribute your own ideas to the openMosix-community either by posting it to the openMosix-mailing list or to the "Wiki-area" on the openMosix-website

http://www.openmosix.org/

Detailed description

openMosix is a Linux-kernel enhancement and most of the functionality is located in the kernel. To configure, administer and to access statistical information from the kernel there need to be a well defined interface.

The most common way to provide this is to create a proc-interface during kernel-bootup or module-load. The interface has to provide a possibility to "get" and "set" values in the kernel-space from the user-level.

This is exactly the way openMosix does it. The directory /proc/hpc provides the access to the openMosix-structure in the kernel. It contains files which are used either for "local-configuration" or information about all "remote" nodes ("local" means the nodes you are currently logged on, "remote" means every other nodes in your cluster).

openMosix holds all information on all nodes in the /proc/hpc directory and these values will be updated within the "decay-interval" (which is also configurable during runtime).

So in your cluster each node will know exactly the values from all other "remote" nodes. This is required for calculating how to balance the load and the processes across the clusters. This calculation is called "the openMosix-load-balancing algorithm" and is invented by Moshe Bar. Because this mechanism is organized in a decentralized way each node decides itself if a process should be migrated to another node (it minimized the overhead and provides linear-scalability with up to 1000 or more nodes)

Administrators and applications can directly interact with an openMosix-cluster through the /proc/hpc interface and change the whole configuration of the super-computer easy from any system.

The /proc/hpc-interface

getting information

Most of the files in /proc/hpc are pure text files. You can use the standard commandline-utilities like "cat" or you can open them with your favorite editor to "read" the current
values from the openMosix /proc -interface.
In you application you can use the regular file-related function to read values from
/proc/hpc (those functions are all using the "fopen", "read" and "fclose" syscalls).
There are only some "binary" files (marked in the list below) which cannot be read without parsing them.

set values

As you can "read" from the files in /proc/hpc there are also several files you can write to and directly "communicate" with the kernel and your openMosix-cluster. The values you "write" into those files are influencing the behavior of your cluster.
e.g.
echo 1 > /proc/hpc/admin/block
-blocks the arrival of remote processes for this node

echo 1 > /proc/hpc/admin/bring
-bring all migrated processes home for this node

Local information from /proc/hpc/admin/

(flat files)
The files in /proc/hpc/admin/ are containing information about the "local"-configuration
of the openMosix node you are logged on.

block allow/forbid arrival of remote processes
bring bring home all migrated processes
dfsalinks list of current symbolic dfsa-links
expel sending guest processes home
gateways maximum number of gateways
lstay local processes should stay
mospe contains the openMosix node id
nomfs disables/enables MFS
overheads for tuning
quiet stop collecting load-balancing information
decayinterval interval for collecting information about load-balancing
slowdecay default 975
fastdecay default 926
speed speed relative to PIII/1GHz)
stay enables/disables automatic process migration

(binary files)
config the main configuration file (written by the setpe util)

examples

Writing a 1 to the following files in /proc/hpc/decay/
clear clears the decay statistics
   echo '1' > /proc/hpc/decay/clear

cpujob tells openMosix that the process is cpu-bound
   echo '1' > /proc/hpc/decay/cpujob

iojob tells openMosix that the process is io-bound
   echo '1' > /proc/hpc/decay/iojob

slow tells openMosix to decay its statistics slow
   echo '1' > /proc/hpc/decay/slow

fast tells openMosix to decay its statistics fast
   echo '1' > /proc/hpc/decay/fast

Information about the (remote) nodes

The directories in /proc/hpc/nodes/[openMosix_ID] are containing information about all other nodes in your openMosix-cluster (text-files). Each directory belongs to the openMosix node with the same node-ID than the directory-name e.g. the file /proc/hpc/nodes/4/load contains the load-value (openMosix-load value) from node 4. These directories are equal (updated in the decay-interval) on all nodes so you can get information about each node in your cluster from each node.

/proc/hpc/nodes/[openMosix_ID]/cpus how many cpu's the node has
/proc/hpc/nodes/[openMosix_ID]/load the openMosix load of this node
/proc/hpc/nodes/[openMosix_ID]/mem available memory as openMosix believes
/proc/hpc/nodes/[openMosix_ID]/rmem available memory as Linux believes
/proc/hpc/nodes/[openMosix_ID]/speed speed of the node relative to PIII/1GHz
/proc/hpc/nodes/[openMosix_ID]/status status of the node
/proc/hpc/nodes/[openMosix_ID]/tmem available memory
/proc/hpc/nodes/[openMosix_ID]/util utilization of the node

These values are extremely useful for monitoring applications. They are very easy to access because they are "clusters-wide" (you can access all information from each node in your cluster as explained before).

Additional Information about local and remote processes

These following files in the /proc-directories are additional process-information provided by openMosix. Applications can read them and make them available e.g. for process-statistics.

local process-information:
(values to read/get)
/proc/[PID]/cantmove reason why a process cannot be migrated
/proc/[PID]/lock if a process is locked to its home node
/proc/[PID]/nmigs how many times the process migrated
/proc/[PID]/where where the process is currently being computed

(values to write/set)
/proc/[PID]/lock if a process is locked to its home node you can
write a 0 into it. If the process can migrate it will
be unlocked then.
/proc/[PID]/goto write the node-ID into the file to tell the process
to migrate to the requested node.
/proc/[PID]/migrate same as goto remote processes

remote process-information:
(values to read/get)
/proc/hpc/remote/from the home node of the process
/proc/hpc/remote/identity additional information about the process
/proc/hpc/remote/statm memory statistic of the process
/proc/hpc/remote/stats cpu statistics of the process

The openMosix Filesystem MFS

Besides the /proc/hpc-interface every node, administrator or an application can access every filesystem of each node in an openMosix-cluster if the MFS-filesystem is mounted.
(read more about the openMosix-filesystem internals and configuration in the openMosix-HowTo).
This documentation will describe the fundamental function and use of MFS in your cluster and applications. Below the directory /mfs you will find several directories which are now discussed in detail.

/mfs/here -> / filesystem of the current node where your process runs.
/mfs/home -> / filesystem of the home node.
/mfs/magic -> / filesystem of the current node when used by the "creat" system
call (or an "open" with the "O_CREAT" option) - otherwise, the
last node on which an MFS magical file was successfully created.
/mfs/lastexec -> / filesystem of the node on which the process last issued a
successful "execve" system-call.
/mfs/selected -> / filesystem of the node you selected by either your process itself
or one of its ancestors (before forking this process), writing a
number into "/proc/self/selected".

Additional to these MFS-directories there is one directory for each node in your cluster. It is named after the openMosix-node ID it belongs to and contains the complete / filesystem of the remote node (without /proc to avert endless loops).
These directories in /mfs are very useful for distributing files to all nodes or creating a single-system-image filesystem.
In the example section you will find out how to use and take advantage of MFS.

The functions from libmos

openMosix also provides a programming-library which is useful to include because it gives access to the openMosix-information very easy. This library, called 'libmos' is included in the openMosix user-tools and is normally installed automatically.
The following list contains these functions from libmosix.h and will explain in detail:

int msx_readval(char *path, int *val);
-reads 'val' from 'path'
('path' is here always the path to the file to read/write in the /proc/hpc-interface

int msx_readval2(char *path, int *val1, int *val2);
-reads 'val1' and 'val2' from 'path'

int msx_write(char *path, int val);
-writes 'val' to 'path'

int msx_write2(char *path, int val1, int val2);
-writes 'val1' and 'val2' to 'path'

int msx_readnode(int node, char *item);
-reads 'item' from 'node'
('item' can be e.g. load, speed, cpus, util, status, mem, rmem, tmem)

int msx_readproc(int pid, char *item);
-read 'item' from a specific 'pid'
( these function reads information about a 'pid' from /proc/[pid]/[item],
'item' can be e.g. block, bring, stay.. )

int msx_read(char *path);
-reads value from 'path'

int msx_writeproc(int pid, char *item, int val);
-writes 'val' to 'item' for process 'pid'

int msx_readdata(char *fn, void *into, int max, int size);
(no information yet)

int msx_writedata(char *fn, char *from, int size);
(no information yet)

int msx_replace(char *fn, int val);
(no information yet)

int msx_count_ints(char *fn);
(no information yet)

int msx_fill_ints(char *fn, int *, int);
(no information yet)

The libmosix-API also contains functions to control or change the openMosix-behavior:

int msxctl(msx_cmd_t cmd, int arg, void *resp, int len);
#define msxctl1(x) (msxctl((x), 0, NULL, 0))
#define msxctl2(x,y) (msxctl((x), (y), NULL, 0))
#define msxctl3(x,y,z) (msxctl((x), (y), (z), sizeof(*(z))))

e.g. 'mosclt' is using this function to let you administer your cluster.
The possible commands (msx_cmd cmd) are listed and explained below:
(explanation from libmosix.h)

D_STAY, /* Disable automatic migration from here */
D_NOSTAY, /* Allow automatic migrations from here */
D_LSTAY, /* Disable automatic mig. of local processes */
D_NOLSTAY, /* Allow automatic mig. of local processes */
D_BLOCK, /* Block automatic migration to here */
D_NOBLOCK, /* Enable automatic migration to here */
D_EXPEL, /* Expel all processes to remote processors */
D_BRING, /* Bring back all processes */
D_GETLOAD, /* Get current load */
D_QUIET, /* Stop internal load-balancing activity */
D_NOQUIET, /* Resume internal load-balancing activity */
D_TUNE, /* Enter tuning mode */
D_NOTUNE, /* Exit tuning mode */
D_NOMFS, /* Disallow MFS access to this node */
D_MFS, /* Reallow MFS access to this node */
D_SETSSPEED, /* Set the standard speed, affecting D_GETLOAD */
D_GETSSPEED, /* Get the standard speed (default=1000) */
D_GETSPEED, /* Get machine's speed */
D_SETSPEED, /* Set machine's speed */
D_MOSIX_TO_IP, /* Convert openMosix to IP address */
D_IP_TO_MOSIX, /* Convert IP to openMosix address */
D_GETNTUNE, /* get number of kernel tuning parameters */
D_GETTUNE, /* get kernel tuning parameters */
D_GETSTAT, /* get openMosix status */
D_GETMEM, /* get current memory (free and total) */
D_GETDECAY, /* get decay parameters */
D_SETDECAY, /* set decay parameters */
D_GETRMEM, /* get OS' idea of memory (free and total) */
D_GETUTIL, /* get CPU utilizability % */
D_SETWHERETO, /* send a process somewhere */
D_GETPE, /* get node number */
D_GETCPUS, /* get number of CPUs */

To use the libmosix-API in your application you need to include 'libmosix.h' in your header files and compile with the -lmos option.
e.g. add the following line at the top of your .h file(s):

#include <libmosix.h>

and compile with the command:

gcc -o your_program -lmos your_program.c

Then you can use the above explained functions in your source-code e.g like the following small example:

int nodeid=1;
int omstatus=0;
struct mosix_info info;
omstatus=msx_readnode(nodeid, "status");
info.status = omstatus;
printf("status of node %d is %d", nodeid, omstatus);

You will find more code-snippets how to use the libmos-functions in the example-section.

Examples using /proc/hpc

    1) cpucount -> functions returning the number of nodes
    2) foreach node -> do "something" for each node
    3) unlock self -> ensure that the created process is unlocked
    4) get ip-address -> which ip-address has node N
    5) get node Ids -> converts ip-addresses into node IDs
    6) migrate all -> migrate all possible processes

Examples using /mfs

7) distributing a file -> copy a file to each node

Examples using libmos

8) get load of a node -> function which return the openMosix-load
9) get speed of a node -> function which return the openMosix-speed

1) functions to get the number of nodes in an openMosix-cluster

shell
######################### cpucount.sh ####################################
#!/bin/bash

function cpucount() {
HOWMANY=0
for n in `ls /proc/hpc/nodes`
do
let "TMP=`cat /proc/hpc/nodes/$n/cpus`"
if (( $TMP!="-101" ));
then
let "HOWMANY=HOWMANY+`cat /proc/hpc/nodes/$n/cpus`"
fi
done;
echo $HOWMANY
}

cpucount

#######################################################################
perl
########################## cpucount.pl###################################
#!/usr/bin/perl

sub cpucount {
$CLUSTERDIR="/proc/hpc/nodes/";
$howmany=0;
opendir($nodes, $CLUSTERDIR);
while(readdir($nodes)) {
$howmany++;
}

$howmany--;
$howmany--;
closedir ($nodes);
print "$howmany\n";
}

cpucount;

#######################################################################
php
########################## cpucount.php #################################
<?php

function cpucount() {
$CLUSTERDIR="/proc/hpc/nodes/";
$howmany=0;
exec("ls ".$CLUSTERDIR ,$ls);
for ($p=0;$p $cpus = file($CLUSTERDIR.$ls[$p]."/cpus");
if ($cpus[0] != -101) {
$howmany++;
}
}
echo $howmany;
}

cpucount();
?>

#######################################################################
c/c++
########################## cpucount.c ####################################
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <dirent.h>
#define clusterdir "/proc/hpc/nodes"

int cpucount() {
DIR *dirhpc;
struct dirent *dir_info;
int howmany=0;
int cpus=0;
FILE *fp;
char tmpdirname[200];
char tmpfilename[200];
strcpy(tmpdirname, "");
if ((dirhpc=opendir(clusterdir))!=NULL) {
while ((dir_info = readdir(dirhpc))!=NULL) {
strcpy(tmpdirname, dir_info->d_name);
strcpy(tmpfilename, "/proc/hpc/nodes/");
strcat(tmpfilename, tmpdirname);
strcat(tmpfilename, "/cpus");
if (!strchr(tmpdirname, '.')) {
fp=fopen(tmpfilename, "r");
if (fp) {
fscanf(fp, "%d", &cpus);
if(cpus>0) {
howmany=howmany+cpus;
}
fclose(fp);
}
}
}
closedir(dirhpc);
printf("%d\n", howmany);
return howmany;
}
}

int main() {
int processors=0;
processors=cpucount();
// fork as many computing-processes
// as "processors"
}

#######################################################################

2) foreach node -> do "something" for each node

shell
######################### foreach.sh ####################################
#!/bin/bash

function foreach() {
for n in `ls /proc/hpc/nodes`
do
let "TMP=`cat /proc/hpc/nodes/$n/cpus`"
if (( $TMP!="-101" ));
then
# execute something for node $n
echo "execute something for node $n"
fi
done;
}
foreach

#######################################################################
perl
########################## for_each.pl###################################
#!/usr/bin/perl

sub for_each {
$CLUSTERDIR="/proc/hpc/nodes/";
opendir($nodes, $CLUSTERDIR);
while($nodeid=readdir($nodes)) {
if (($nodeid!='.') and ($nodeid!='..')){
open(INFILE, "$CLUSTERDIR$nodeid/cpus");
$cpus = ;
if ($cpus!="-101") {
print "do something for node $nodeid\n";
}
close(INFILE);
}
}
closedir ($nodes);
}
for_each;

#######################################################################
php
########################## foreach.php #################################
<?php

function for_each() {
$CLUSTERDIR="/proc/hpc/nodes/";
exec("ls ".$CLUSTERDIR ,$ls);
for ($p=0;$p $cpus = file($CLUSTERDIR.$ls[$p]."/cpus");
if ($cpus[0] != -101) {
echo "do something with node $ls[$p]
";
}
}
}
for_each();

?>

#######################################################################

3) unlock self -> ensure that the created process is unlocked

shell
######################### unlock.sh ####################################
#!/bin/bash

function unlock() {
echo "0" > /proc/self/lock
}
unlock

#######################################################################
perl
########################## unlock.pl###################################
#!/usr/bin/perl

sub unlock {
open (OUTFILE,">/proc/self/lock") ||
die "Could not unlock myself!\n";
print OUTFILE "0";
}
unlock;

#######################################################################
php
########################## unlock.php #################################
<?php

function unlock() {
$fd = fopen("/proc/self/lock", "w");
fputs($fd, "0", 1);
fclose($fd);
}
unlock();
?>

(!! requires that your webserver is running as user "root" which is a security risk)

#######################################################################
c/c++
########################## unlock.c ####################################
#include <stdio.h>
#include <stdlib.h>

int unlock() {
FILE *fp;
fp=fopen("/proc/self/lock", "r");
if (fp) {
fputc('0', fp);
fclose(fp);
return 1;
} else {
printf("could not unlock myself\n");
return 0;
}
}
// usage in application source-code
int main() {
if (unlock()) {
// do "something" here
printf("hallo\n");
}
}

#######################################################################

4) get ip-address -> which ip-address has node N

shell
######################### whichip.sh ####################################
#!/bin/bash

NODE_ID="1"
IPADDRESS=`mosctl whois $NODE_ID`
echo "node $NODE_ID has the ip-address $IPADDRESS"

#######################################################################
perl
########################## whichip.pl###################################
#!/usr/bin/perl

my $id=1;
sub whichip {
$ipadress=system("mosctl whois $id");
return $ipaddress;
}
whichip;

#######################################################################
php
########################## whichip.php #################################
<?php
function whichip($id) {
$ipaddress=exec("mosctl whois $id");
return $ipaddress;
}
$nodeip=whichip(1);
echo "$nodeip";
?>

#######################################################################

5) get node Ids -> converts ip-addresses into node IDs

c/c++
########################## ip2id.c #######################################
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <time.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>

int main(int argc, char *argv[]) {
unsigned long ip;
long nodeid;

if (argc != 2) {
printf("Usage: ip2id [ip_address]\n");
exit(1);
}

ip = inet_addr(argv[1]);
nodeid = ntohl(ip) & 0xffff;
printf("ip = %d\n", nodeid);
return 0;
}

#######################################################################

6) migrate all -> migrate all possible processes

shell
######################### migrateall.sh ###################################
#!/bin/bash

function migrateall() {
for n in `ps -eo pid | sed -e "s/PID//"`;
do
migrate $n balance;
done;
}
migrateall

#######################################################################

7) distributing a file -> copy a file to each node

shell
######################### distributefile.sh #################################
#!/bin/bash

function distributefile() {
for n in `ls /proc/hpc/nodes`
do
let "TMP=`cat /proc/hpc/nodes/$n/cpus`"
if (( $TMP!="-101" ));
then
/bin/cp /tmp/test.dat /mfs/$n/tmp/test.dat
fi
done;
}
distributefile

#######################################################################

8) get load of a node -> function which return the openMosix-load

c/c++
########################## getload.c #####################################
#include <stdio.h>
#include <stdlib.h>
#include <libmosix.h>

int main() {
int nodeid=1;
int load=0;
struct mosix_info info;
load=msx_readnode(nodeid, "load");
if (load >= 0)
info.load = load;
printf("load of node %d is %d", nodeid, load);
}

#######################################################################

9) get speed of a node -> function which return the openMosix-speed

c/c++
##########################getspeed.c ####################################
#include <stdio.h>
#include <stdlib.h>
#include <libmosix.h>

int main() {
int nodeid=1;
int speed=0;
struct mosix_info info;
speed = msxctl(D_GETSPEED, 0, NULL, 0);
if (speed >= 0)
info.speed = speed;
printf("speed of node %d is %d\n", nodeid, speed);
}

#######################################################################

Using in applications

e.g. povray (graphic rendering)
With the the pvm-patched povray you can set the number of processes at the commandline with the -NT option. Using the cpucount.sh from the example 1 above you can calculate how many processes should start automatically (one process for each processor in your cluster).

Many applications are using commandline-parameters to give the number of processes to fork. For all those programs you can easily use `cpucount.sh` instead of a static number (or edit the program-sources and include the cpucount-function from cpucount.c

(example 1) at the point where the application forks n-number of computing-processes.
############################## pov.sh ###################################
#!/bin/bash

# start pvm in the background
pvm &
# unlock myself
echo "0" > /proc/self/lock
# start rendering
/usr/bin/x-pvmpov -NT`cpucount.sh` +I/root/mypovpic.pov +O/root/mypovpic.tga +L/usr/local/povray31/include +W1024 +H768 antialias=on +P +V +D

#######################################################################

e.g. bladenc (converting audio into mp3)

This short example converts an audio-CD into mp3-files. All processes will start parallel on the system with the CD inserted and on which the scripts is executed. openMosix will care to balance to load across your cluster-nodes automatically.

############################## mp3rip.sh ################################
#!/bin/bash

cdparanoia -B
for n in `ls *.wav`;
do
bladeenc -quit -quiet $n -256 -copy -crc &
done;

#######################################################################

Other examples of applications which are using the openMosix-API are:
- mon (openMosix-utility) -> uses libmosix
- mosctl (openMosix-utility) -> uses libmosix
- openMosixview/Mosixview -> uses the /proc/hpc-interface

Summary

'You do not have to change you application' is one of the slogans of openMosix. That is and will stay true for sure. This document describes the well defined openMosix-API which you could use in or with your applications. The explained functions are useful to automate monitoring and administration of your cluster.

Of course you can also use this functions for creating "pure" openMosix-applications (e.g. you can "hardcode" in your application to run on specific nodes or "do something" if the load increases/decreases). It is an additional option and not a must because openMosix provides this functionality transparently in the kernel.

Disclaimer

All code-examples without guarantee.
Use the information in this document at your own risk and feel free to contribute your own ideas.

Matt Rechenburg (mosixview@t-online.de)

Additional sources

Many Thanks to Kris Buytaert for :
The openMosix HOWTO by Kris Buytaert (buytaert@be.stone-it.com)
http://howto.ipng.be/openMosix-HOWTO/
(some of the explanations of the /proc/-interface are from this great howto)
Also many thanks to Bruce Knox for updating this page to a much more nicer outfit.

This page is: http://www.openmosixview.com/docs/openMosixAPI.html