Q U I C K S T A R T
Create an .rhosts
file in your home directory with
permissions set to 600 (-rw------- ) that allows to to
rsh without a password to nova, plab-01 up to plab-34, and plab-151 up to
plab-168.
Then connect to nova and type qsub -l nodes=1:m1GB -I .
This will allocate a dedicated 1GB server for you and connect you
to a shell on that server. To terminate, type exit .
Do not log on directly to these servers; your processes may
be killed.
G E N E R A L
The school of computer science maintains a cluster of 52
dual-processor linux computers for parallel computing and other
compute-intensive computations.
The cluster is divided into two groups of computers:
- 34 computers that can also be used as workstations. These computers
have 256MB of main memory and 2 Pentium-III processors running at
500 or 550MHz. They can run Windows under VMware.
These machines are located in the student lab in room 004.
- 18 dedicated compute servers.
These computers have 1 or 2GB of main memory, 2 Pentium-III
processors running at 600MHz, and fast SCSI disks. They have
a lot of available scratch disk space under /tmp.
These machines are located in the computer room in the basement.
Parallel and computational-intensive jobs are submitted to the
cluster using a job queueing system called
PBS.
PBS allocates computers to jobs according to availability and
requested features (such as memory size).
It is possible to run interactive programs and shells under PBS.
R U L E S !
-
The 18 dedicated servers should be used only through PBS,
since we want to ensure that they are not time-shared among several
compute-intesive jobs. (avoiding time
sharing ensures that users get all the physical memory of the machine
and allows reliable performance measurements.)
Do not log on to them directly using rsh, rlogin, telnet or ssh
except to kill jobs, delete files from /tmp, and so on.
-
Jobs that use the 34 servers in room 004 should not use more
than one physical processor and should not use more than 128MB
of main memory, since we want to leave enough resources for
interactive use.
Do not run multithreaded processes on them and do not run more
than one process.
-
The cluster was funded by the Israeli Academy of Sciences (ISF),
by the Vaada le-Tiktzuv ve-Tikhnun, and by the University.
Research articles
that describe research that was conducted using
the cluster (either the workstations in 004 or the dedicated servers)
must acknowledge the ISF using the phrase "This research was
supported in part by THE ISRAEL SCIENCE FOUNDATION founded by the
Israel Academy of Sciences and Humanities." You must also provide
me (Sivan) with the full citation of all such articles in both
email and hardcopy.
H O W T O
R U N P R O G R A M S
We programs using a job-sumission system called PBS.
The programs can be compiled on nova or any of the interactive plab computers
(plab-02 up to plab-34; some may be down).
Programs are submitted from nova (only; not from plabs) and they
run on plab computers. Here are instructions that explain how
to compile MPI programs and how to use PBS.
I recommend that you download the sample files below on a
Unix or Linux machine, as opposed to downloading on a Windows machine and
transferring them to your account.
In one case that I have seen downloading the samples on a Windows
machine caused some hidden control characters to be inserted into the files
and this prevented PBS from running them properly.
- To compile a parallel MPI program, use the command
mpicc .
This command accepts the same arguments as gcc .
If you intend to run sequential jobs compile your
programs normally on nova. You can also run existing sequential
programs, such as Matlab.
- We run programs using a job submission system called PBS.
To run a program,
you submit a script to PBS wait until it completes.
You can only use PBS on nova!
-
Before you try to use PBS, make sure that you can use rsh
to/from any of the plabs without typing your password,
otherwise PBS and MPI won't work. To ensure that rsh works,
copy this file to your home directory
under the name
.rhosts (don't forget the period),
change in the file username to your user name,
and give it permissions 600 (use chmod 600 ~/.rhosts ).
Check that it works by running rsh from nova to one of the plabs
and from one of the plabs to another and making sure you are
not requested to type your password.
-
Here is a sample PBS script called
script.pbs.
The MPI program that you want to run and its arguments is specified
in the last line.
In our case, the program is hello.c,
and we compile it using
mpicc -O3 -o hello hello.c
-
We run the program using the command
qsub -l nodes=3 script.pbs (3 is the number
of computers that will run the program in parallel).
We get back from qsub the job identifier; in my case it
was 11.nova.math.tau.ac.il .
-
We can find out whether the job is running or not and
what other jobs are running using the command
qstat .
The important states that a job can be in are queued (Q; waiting),
running (R), and exiting (E).
-
You can find out more details about about your job using
the command
qstat -f 11 (here 11 is the job id).
-
You can cancel the job, whether it is still waiting or running
using the command
qdel 11 (here 11 is the job id).
-
When the job completes, its standard output and standard error
are copied into the directory that contains the script under
the names .o (output) and
.e (error). In my case, these files
were called script.pbs.o11 and script.pbs.e11.
-
PBS will send you mail when the job starts running, exits,
or is aborted. (Due to the line
#PBS -m abe
in the script.)
-
The script requests 5 minutes CPU time (total over all
processors) using the line
#PBS -l cput=05:00 .
PBS will kill the job when it exceeds its CPU time.
PBS will also kill the job when it runs for too long even
if does not consume CPU time. The resource name that controls
how long a job can run is walltime .
If you need to run long jobs, talk to me.
-
The graphical tool xpbsmon is useful for determinig the status
of the entire system.
Here is a screenshot (click to zoom in)
You can see that I am running one job that uses 4 nodes (brown).
There are also free nodes
(green), inaccesible ones (black), and nodes that are down (red).
To set the tool up click on its Pref.. button
and add nova as a server (delete other ones). To update
the view, click Pref and Redisplay (or set up the auto update
feature).
- You can also run MPI programs interactively using mpirun directly
(not in a PBS script). Please don't do it unless it's an
emergency and PBS does not work.
- You can also use PBS to run sequential jobs. Here is a script,
matjob.pbs that runs a matlab
program,
sample.m .
-
You can run interactive
programs under PBS using
qsub -I .
-
To request a dedicated node with 1GB, use
qsub -l nodes=1:m1GB . For a 2GB dedicated node, use
qsub -l nodes=1:m2GB .
-
You can request any dedicated node using
qsub -l nodes=1:s600MHz but we request that
you avoid this since you might be allocated a 2GB node when you
can use a 1GB one.
- It is also possible to use shared-memory as a communication medium
for MPI (we normally use TCP/IP). This works on all the dual and
quad processors in the school and on larger machines in the computation
center and HPCU.
- The HPCU
has several large parallel computers that
can run MPI programs. If you need access, talk to me.
|