Contents
Sun Grid Engine
About
Sun Grid Engine is the scheduling system that Beocat uses.
Getting Started with SGE
Submitting Simple Jobs
There are two major classes of jobs in SGE. Batch and Interactive. The type of job is determined by the command you use to submit the job. qsub will submit a "Batch" job. Batch jobs run without interactivity from you. They should be self-contained. They are by far the most common type of job. qrsh will submit an "Interactive" job. This type of job will log you into a node directly, and will allow you to troubleshoot or test a script that you are trying to run. In the following examples, qrsh and qsub can be substituted at will, as the options are generally the same for them.
Resource requests
To specify the resources you need, you do it as arguments your job submission command. If I were to need 1G of memory, I would specify it like so:
mozes@loki ~ $ qsub -l mem=1G
To request 4G of memory (per core), infiniband, a max runtime of 2hrs, and two cores across 2 machines it is very similar
mozes@loki ~ $ qsub -l mem=4G,ib=TRUE,h_rt=2:00:00 -pe mpi-1 2
The default resource request is mem=1G and h_rt=1:00:00. I have found that this is very close to what most of our users need.
Requestable Resources
A full listing of the objects you can request is available from the following command:
mozes@loki ~ $ qconf -sc | awk '{ if ($5 != "NO") { print }}'Sample Output:
name |
shortcut |
type |
relop |
requestable |
consumable |
default |
arch |
a |
RESTRING |
== |
YES |
NO |
NONE |
calendar |
c |
RESTRING |
== |
YES |
NO |
NONE |
cpu |
cpu |
DOUBLE |
>= |
YES |
NO |
0 |
cuda |
cuda |
INT |
<= |
YES |
JOB |
0 |
display_win_gui |
dwg |
BOOL |
== |
YES |
NO |
0 |
exclusive |
excl |
BOOL |
EXCL |
YES |
YES |
0 |
h_core |
h_core |
MEMORY |
<= |
YES |
NO |
0 |
h_cpu |
h_cpu |
TIME |
<= |
YES |
NO |
0:0:0 |
h_data |
h_data |
MEMORY |
<= |
YES |
NO |
0 |
h_fsize |
h_fsize |
MEMORY |
<= |
YES |
NO |
0 |
h_rss |
h_rss |
MEMORY |
<= |
YES |
NO |
0 |
h_rt |
h_rt |
TIME |
<= |
FORCED |
NO |
0:0:0 |
h_stack |
h_stack |
MEMORY |
<= |
YES |
NO |
0 |
h_vmem |
h_vmem |
MEMORY |
<= |
YES |
NO |
0 |
hostname |
h |
HOST |
== |
YES |
NO |
NONE |
infiniband |
ib |
BOOL |
== |
YES |
NO |
FALSE |
m_core |
core |
INT |
<= |
YES |
NO |
0 |
m_socket |
socket |
INT |
<= |
YES |
NO |
0 |
m_thread |
thread |
INT |
<= |
YES |
NO |
0 |
m_topology |
topo |
RESTRING |
== |
YES |
NO |
NONE |
m_topology_inuse |
utopo |
RESTRING |
== |
YES |
NO |
NONE |
mem_free |
mf |
MEMORY |
<= |
YES |
NO |
0 |
mem_total |
mt |
MEMORY |
<= |
YES |
NO |
0 |
mem_used |
mu |
MEMORY |
>= |
YES |
NO |
0 |
memory |
mem |
MEMORY |
<= |
FORCED |
YES |
0 |
myrinet |
mx |
BOOL |
== |
YES |
NO |
FALSE |
num_proc |
p |
INT |
== |
YES |
NO |
0 |
qname |
q |
RESTRING |
== |
YES |
NO |
NONE |
s_core |
s_core |
MEMORY |
<= |
YES |
NO |
0 |
s_cpu |
s_cpu |
TIME |
<= |
YES |
NO |
0:0:0 |
s_data |
s_data |
MEMORY |
<= |
YES |
NO |
0 |
s_fsize |
s_fsize |
MEMORY |
<= |
YES |
NO |
0 |
s_rss |
s_rss |
MEMORY |
<= |
YES |
NO |
0 |
s_rt |
s_rt |
TIME |
<= |
YES |
NO |
0:0:0 |
s_stack |
s_stack |
MEMORY |
<= |
YES |
NO |
0 |
s_vmem |
s_vmem |
MEMORY |
<= |
YES |
NO |
0 |
slots |
s |
INT |
<= |
YES |
YES |
1 |
swap_free |
sf |
MEMORY |
<= |
YES |
NO |
0 |
swap_rate |
sr |
MEMORY |
>= |
YES |
NO |
0 |
swap_rsvd |
srsv |
MEMORY |
>= |
YES |
NO |
0 |
swap_total |
st |
MEMORY |
<= |
YES |
NO |
0 |
swap_used |
su |
MEMORY |
>= |
YES |
NO |
0 |
virtual_free |
vf |
MEMORY |
<= |
YES |
NO |
0 |
virtual_total |
vt |
MEMORY |
<= |
YES |
NO |
0 |
virtual_used |
vu |
MEMORY |
>= |
YES |
NO |
0 |
Parallel Jobs
Please note, as this seems to be a common misconception, BEOCAT WILL NOT MAGICALLY THREAD YOUR APPLICATIONS. It is up to you to make sure that the application can make use of multiple cores, or even multiple machines, if necessary.
SGE's Multi-threaded job requests are managed with what SGE terms Parallel Environments. In the SGE world, you simply request a number of processors, and ask for an environment that will allocate them.
Parallel Environments
To use more than one core, you need to request a parallel environment. Currently, there are 11 of them depending on what you want to do.
mpi-fill
- This environment will use as many slots on each node as it can until it reaches the number of cores you have requested.
mpi-spread
- This environment will spread itself out over as many nodes as possible until it reaches the number of cores you have requested.
single
This environment will fit itself into a single node, allocating as many cores as requested. The number of cores must be <= 80 because our largest nodes have 80 cores. This is useful for jobs that cannot work with more than one machine. If you are not using mpirun you most likely want to use this environment.
mpi-1
- This environment will allocate the slots you've requested 1 per node.
mpi-2
- This environment will allocate the slots you've requested 2 per node. You must request cores as a multiple of 2
mpi-4
- This environment will allocate the slots you've requested 4 per node. You must request cores as a multiple of 4
mpi-8
- This environment will allocate the slots you've requested 8 per node. You must request cores as a multiple of 8
mpi-10
- This environment will allocate the slots you've requested 10 per node. You must request cores as a multiple of 10
mpi-12
- This environment will allocate the slots you've requested 12 per node. You must request cores as a multiple of 12
mpi-16
- This environment will allocate the slots you've requested 16 per node. You must request cores as a multiple of 16
mpi-80
- This environment will allocate the slots you've requested 80 per node. You must request cores as a multiple of 80
For example, if you wanted to get 8 cores on one node, you could do that in two different ways (ideally you would use the first):
mozes@loki ~ $ qsub -pe single 8 mpijob.sh
mozes@loki ~ $ qsub -pe mpi-8 8 mpijob.sh
One nice feature of the scheduler is the ability to request a range of processors. It will give you as many of them as it can. This can be done using the following syntax: "2|3|5-8|10|16". This would give you an effective request of either 16, 10, 8, 7, 6, 5, 3, or 2 cores, depending on how many cores are available when you submit.
Memory Requests with Multi-core jobs
Memory requests in the SGE environment are per-core. For example:
mozes@loki ~ $ qrsh -l mem=1G -pe mpi-spread 32
This would have requested 1G of memory per slot, ending up with a total request of 32G. If you actually wanted 1G for the entire job, you would need to compute 1G/32cores == 1024M/32cores == 32M/core. This means that you should make the following request when submitting the job:
mozes@loki ~ $ qrsh -l mem=32M -pe mpi-spread 32
Node Group Requests
Sometimes it is beneficial to run jobs on a specific set of nodes. This can be done like this:
mozes@loki ~ $ qsub -q '*@@scouts' test.sh
Please note the "@@" syntax, there are two '@' symbols, as hostgroups always begin with @. A group listing can be obtained from:
mozes@loki ~ $ qconf -shgrpl
Sample Output:
@fiends |
@ib |
@paladins |
@rangers |
@rogues |
@scouts |
@scouts-rack1 |
@scouts-rack2 |
@scouts-rack3 |
@titans |
To see the machines in each group you would do something like:
mozes@loki ~ $ qconf -shgrp @rogues
Sample Output:
group_name @rogues |
hostlist rogue1.beocat rogue2.beocat rogue3.beocat rogue4.beocat rogue5.beocat \ |
rogue6.beocat rogue7.beocat rogue8.beocat rogue9.beocat \ |
rogue10.beocat rogue11.beocat rogue12.beocat rogue13.beocat \ |
rogue14.beocat rogue15.beocat rogue16.beocat |
Monitoring your job
You have now submitted your job, and you want to see whether or not it has started.
The status command
'status' is a new program that gives an easy overview of your jobs.
With no other arguments, status shows all of your jobs both in the queue and being processed. In the case of queued jobs, it will show when the job was submitted. In the case of running jobs, it will show both the time the job began running and the queue in which it is running.
In most cases, you don't need to know the time the job was submitted, you need to know how long it has been since the job was submitted or how long it has been running. To find this out, use the '-r' switch:
status -r
Instead of showing that your job was submitted on the first of Octember, it will tell you it was submitted 4 days and 3 hours ago.
For more complex uses of this command, you can type
man status
for more information.
The qstat command
To check running jobs, use can also use SGE's own qstat command:
mozes@loki ~ $ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
101 0.00000 sge_test.s mozes qw 04/27/2009 10:00:36 1 The useful information for us is the piece under the item "state." "qw" means that the job is queued and waiting. If the job were running, the output would be more like this:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
101 0.50000 sge_test.s mozes r 04/27/2009 10:00:50 batch.q@titan4.beocat 1 If that doesn't provide you with enough useful information, you could always use "qstat -ext":
job-ID prior ntckts name user project department state cpu mem io tckts ovrts otckt ftckt stckt share queue slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
102 0.00000 0.00000 sge_test.s mozes CIS defaultdep qw 0 0 0 0 0 0.00 1 or
job-ID prior ntckts name user project department state cpu mem io tckts ovrts otckt ftckt stckt share queue slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
102 0.50000 0.50000 sge_test.s mozes CIS defaultdep r 0:00:02:36 96.56516 0.00000 0 0 0 0 0 0.00 batch.q@titan4.beocat 1 One of the niceties of SGE is that memory utilization and actual CPU time are reported to the scheduler. Thus we have almost realtime statistics on per job utilization.
I now want to know what the states are for all of my jobs, and what resources I request for them:
mozes@loki ~ $ qstat -f -u mozes -r -ne
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
batch.q@titan4.beocat BIPC 0/1/16 2.95 lx24-amd64
102 0.50000 sge_test.s mozes r 04/27/2009 10:05:35 1
Full jobname: sge_test.sub
Master Queue: batch.q@titan4.beocat
Hard Resources: h_rt=3600 (0.000000)
memory=4G (0.000000)
Soft Resources:
Hard requested queues: *@titan4.beocatThe -f means you want the full output. Unfortunately, this means that it will print information for all hosts, too. The -ne suppresses the output from hosts that are not used. The -r tells qstat that you want information about the resources requested, and the -u mozes says you want info just about mozes' jobs.
Diagnosing jobs
What if you have submitted a job, and it just won't start? Begin with the same command shown above, and couple that with the output from 'qstat -j'
mozes@loki ~ $ qstat -f -u mozes -r -ne
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
104 0.50000 sge_test.s mozes qw 04/27/2009 11:16:48 32
Full jobname: sge_test.sub
Requested PE: single 32
Hard Resources: h_rt=3600 (0.000000)
memory=4G (0.000000)
Soft Resources:
Hard requested queues: *@@titans
mozes@loki ~ $ qstat -j
scheduling info: queue instance "batch.q@brute4.beocat" dropped because it is overloaded: np_load_avg=1.341250 (no load adjustment) >= 1.25
Jobs can not run because queue instance is not contained in its hard queue list
104
Jobs can not run because available slots combined under PE are not in range of job
104Well, the last few lines of this command show that job 104, my job, cannot start because I requested more slots on a single machine than there are anywhere. To fix this, we could do something like this:
mozes@loki ~ $ qalter -pe mpi-fill 32 104 modified parallel environment of job 104 modified slot range of job 104
This modifies the job, telling it to use a different parallel environment. When we run the qstat -j command, we see that:
mozes@loki ~ $ qstat -j
scheduling info: queue instance "batch.q@titan1.beocat" dropped because it is full
queue instance "batch.q@titan2.beocat" dropped because it is full
queue instance "batch.q@titan1.beocat" dropped because it is overloaded: memory=0.000000 (no load value) <= 0G
queue instance "batch.q@titan2.beocat" dropped because it is overloaded: memory=0.000000 (no load value) <= 0G
mozes@loki ~ $ qstat -f -u mozes -r -ne
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
batch.q@titan1.beocat BIPC 0/16/16 0.86 lx24-amd64 a
104 0.50000 sge_test.s mozes r 04/27/2009 11:26:35 16
Full jobname: sge_test.sub
Master Queue: batch.q@titan1.beocat
Requested PE: mpi-fill 32
Granted PE: mpi-fill 32
Hard Resources: h_rt=3600 (0.000000)
memory=4G (0.000000)
Soft Resources:
Hard requested queues: *@@titans
---------------------------------------------------------------------------------
batch.q@titan2.beocat BIPC 0/16/16 0.15 lx24-amd64 a
104 0.50000 sge_test.s mozes r 04/27/2009 11:26:35 16
Full jobname: sge_test.sub
Master Queue: batch.q@titan2.beocat
Requested PE: mpi-fill 32
Granted PE: mpi-fill 32
Hard Resources: h_rt=3600 (0.000000)
memory=4G (0.000000)
Soft Resources:
Hard requested queues: *@@titansThe job seems to be running.
One more note on 'qstat -j'... Until we get the power and air conditioning upgraded, we only keep our most efficient nodes powered on to keep maximum performance within our power budget. For those machines we have powered off, you will see several messages that "queue instance (queue@nodename) dropped because it is temporarily not available." This is nothing to be concerned about. You might, however want to run the output through a pager so you can quickly skim over that information:
qstat -j JobNumber | less
Deleting Jobs
For those rare occasions that require you to delete a job, this can be done with qdel
qdel <Job-ID>
Array Jobs
One of SGE's useful options is the ability to run "Array Jobs"
It can be used with the following option to qsub.
-t n[-m[:s]]
Submits a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Grid
Engine almost like a series of jobs. The option argument to -t specifies the number of array job tasks and the index number which will be
associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SGE_TASK_ID. The option arguments
n, m and s will be available through the environment variables SGE_TASK_FIRST, SGE_TASK_LAST and SGE_TASK_STEPSIZE.
Following restrictions apply to the values n and m:
1 <= n <= 1,000,000
1 <= m <= 1,000,000
n <= m
The task id range specified in the option argument may be a single number, a simple range of the form n-m or a range with a step size.
Hence, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each
with the environment variable SGE_TASK_ID containing one of the 5 index numbers.
Array jobs are commonly used to execute the same type of operation on varying input data sets correlated with the task index number. The
number of tasks in a array job is unlimited.
STDOUT and STDERR of array job tasks will be written into different files with the default location
<jobname>.['e'|'o']<job_id>'.'<task_id>
Example 1
Array Jobs have a variety of uses, one of the easiest to comprehend is the following:
I have an application, app1 I need to run the exact same way, on the same data set, with only the size of the run changing.
My original script looks like this:
1 #!/bin/bash
2 RUNSIZE=50
3 #RUNSIZE=100
4 #RUNSIZE=150
5 #RUNSIZE=200
6 app1 $RUNSIZE dataset.txt
For every run of that job I have to change the RUNSIZE variable, and submit each script. This gets tedious.
With Array Jobs the script can be written like so:
1 #!/bin/bash
2 #$ -t 50:200:50
3 RUNSIZE=$SGE_TASK_ID
4 app1 $RUNSIZE dataset.txt
I then submit that job, and SGE understands that it needs to run it 4 times, once for each task. It also knows that it can and should run these tasks in parallel.
Example 2
A slightly more complex use of Array Jobs is the following:
I have an application, app2, that needs to be run against every line of my dataset. Every line changes how app2 runs slightly, but I need to compare the runs against each other.
Originally I had to take each line of my dataset and generate a new submit script and submit the job. This was done with yet another script:
1 #!/bin/bash
2 DATASET=dataset.txt
3 scriptnum=0
4 while read LINE
5 do
6 echo "app2 $LINE" > ${scriptnum}.sh
7 qsub ${scriptnum}.sh
8 scriptnum=$(( $scriptnum + 1 ))
9 done < $DATASET
Not only is this needlessly complex, it is also slow, as qsub has to verify each job as it is submitted. This can be done easily with array jobs, as long as you know the number of lines in the dataset. This number can be obtained like so: wc -l dataset.txt in this case lets call it 5000.
1 #!/bin/bash
2 #$ -t 1:5000
3 app2 `sed -n "${SGE_TASK_ID}p" dataset.txt`
This uses a subshell via `, and has the sed command print out only the line number $SGE_TASK_ID out of the file dataset.txt.
Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so qsub doesn't have to verify as many.
To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.
Other qsub features
Output file control
Current Working Directory
qsub by default will place the output files in your home directory. If you feel that they should be organized differently, an easy way to do that is to use the -cwd option. cwd stands for current working directory, this means that the file will be placed in whatever directory you are currently in.
Combining STDOUT and STDERR
In your qsub command, you can add -j yes to add the errors directly into the output file.
Job Naming
Sometimes it is nice to be able to specify the job name so that you can better see what is happening in the queue. To name your job, add -N $WhatEverNameYouWant to the qsub command.
Other SGE Features
Job limits
Because of increased ability of SGE to handle parallel jobs, and some users affinity for filling the queue with long jobs, we have implemented a couple of methods for handling this.
Max Threads/Jobs per user
There is a hard limit on the number of cores any one person can utilize at any point in time. It is set to 500 currently, which I think is reasonable. If, for some reason, a user were to need this limitation raised, the administrator would need to be e-mailed with the request, and a detailed explanation as to why you need that many cores. To determine how many cores you have utilized, and what the limit is, you would need to have a job running and run the qquota command:
mozes@loki ~ $ qquota resource quota rule limit filter -------------------------------------------------------------------------------- max_slots_per_user/1 slots=4/500 users *
Time Limits
Roughly 1/4 of the nodes cannot be used for longer than 72 hours. This means that there should always be cores available for short jobs. You do not need to know which machines are setup this way, the scheduler will handle the changes on it's own. qstat -j will tell you why a job is not starting.
Checkpointing
What is Checkpointing
Checkpointing is the ability for the running job to be "snapshotted" at a scheduled interval. If the machine the job is running on were to power off unexpectedly or your job gets suspended for any reason, the job could be resumed from the "snapshot." Checkpointing is turned off by default, so you must enable a checkpointing method for it to be enabled.
DMTCP
DMTCP is a new method of checkpointing on Beocat.
MPI checkpointing is currently broken, we are working to resolve the issue. When we get this fixed, your mpirun command should look like this mpirun --mca btl tcp,self <app>
- Unlike the old BLCR method, there are no compilation requirements.
- Infiniband is still not supported. The DMTCP developers are working on this for an upcoming release.
Checkpoints are taken only once every 12 hours by default. As this is a blocking checkpoint, if you are using a lot of memory (>=64GB) you may want to think about increasing the checkpoint interval. This can be done via the qsub option -c <interval> i.e. qsub -ckpt dmtcp -c 36:00:00 script.qsub
Will Checkpointing work for me?
Well, it depends.
- Does your application use Infiniband? If so, none of our current checkpointing software will work.
Does your application use OpenMPI? If so, is it a multi node job? If you answered yes to both of these questions, DMTCP checkpointing may work for you.
Are you using more than one application? If so, are they running at the same time? If you answered yes to both of the questions, checkpointing MAY work for you.
- For all other use cases, checkpointing method should work.
If you still have problems with checkpointing, e-mail the administrator (beocat@cis.ksu.edu), and he should be able to help you.
To enable DMTCP checkpointing for your application, add the -ckpt dmtcp option to your job submission script:
mozes@loki ~ $ qsub -ckpt dmtcp checkpoint.sh
In this submission, -ckpt dmtcp indicates that you want to use DMTCP checkpointing.
Accounting
Sometimes it is useful to know about the resources you have utilized. This can be done through the qacct command:
mozes@loki ~ $ qacct -o $USER OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW ====================================================================================================================== mozes 21856 11525.840 572.390 14276.910 32520.257 0.000 0.000
It may be more useful if you specify a timelimit for this query, for example the last day (-d 1):
mozes@loki ~ $ qacct -o $USER -d 1 OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW ====================================================================================================================== mozes 3832 2962.040 145.480 3234.560 7785.764 0.000 0.000
If you work with a group of people on Beocat, and need to account for everyone's time: NOTE This only works if you have a group setup in Beocat.
mozes@loki ~ $ qacct -P `qconf -suser $USER | grep default_project | awk '{ print $2 }'`
PROJECT WALLCLOCK UTIME STIME CPU MEMORY IO IOW
========================================================================================================================
CIS 12917 7656.330 360.300 9121.270 20576.334 0.000 0.000
Job script options
If you were fortunate enough to use the Torque/Maui setup, you may have used an option of specifying resource requests within the job script. This was done via "#PBS $OPTION". This can also be done in SGE:
mozes@loki ~ $ vim sge_test.sub #!/bin/bash #$ -S /usr/local/bin/sh # Specify my shell as sh #$ -l h_rt=2:00:00 # Give me a 2 hour limit to finish the job echo "Running osm2navit" /usr/bin/env cr_run ~/navit/navit/osm2navit Blah.bin < ~/osm/blah.osm echo "finished osm2navit with exit code $?"
As you can see, I used the construct "#$ $OPTION" to implement submit options in the job script.
SGE Environment variables
Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that SGE makes available to you.
HOSTNAME=titan1.beocat SGE_TASK_STEPSIZE=undefined SGE_INFOTEXT_MAX_COLUMN=5000 SHELL=/usr/local/bin/sh NHOSTS=2 SGE_O_WORKDIR=/homes/mozes TMPDIR=/tmp/105.1.batch.q SGE_O_HOME=/homes/mozes SGE_ARCH=lx24-amd64 SGE_CELL=default RESTARTED=0 ARC=lx24-amd64 USER=mozes QUEUE=batch.q PVM_ARCH=LINUX64 SGE_TASK_ID=undefined SGE_BINARY_PATH=/opt/sge/bin/lx24-amd64 SGE_STDERR_PATH=/homes/mozes/sge_test.sub.e105 SGE_STDOUT_PATH=/homes/mozes/sge_test.sub.o105 SGE_ACCOUNT=sge SGE_RSH_COMMAND=builtin JOB_SCRIPT=/opt/sge/default/spool/titan1/job_scripts/105 JOB_NAME=sge_test.sub SGE_NOMSG=1 SGE_ROOT=/opt/sge REQNAME=sge_test.sub SGE_JOB_SPOOL_DIR=/opt/sge/default/spool/titan1/active_jobs/105.1 ENVIRONMENT=BATCH PE_HOSTFILE=/opt/sge/default/spool/titan1/active_jobs/105.1/pe_hostfile SGE_CWD_PATH=/homes/mozes NQUEUES=2 SGE_O_LOGNAME=mozes SGE_O_MAIL=/var/mail/mozes TMP=/tmp/105.1.batch.q JOB_ID=105 LOGNAME=mozes PE=mpi-fill SGE_TASK_FIRST=undefined SGE_O_HOST=loki SGE_O_SHELL=/bin/bash SGE_CLUSTER_NAME=beocat REQUEST=sge_test.sub NSLOTS=32 SGE_STDIN_PATH=/dev/null
Sometimes it is nice to know what hosts you have access to during a PE job. You would checkout the PE_HOSTFILE to know that. If your job has been restarted, it is nice to be able to change what happens rather than redoing all of your work. If this is the case, RESTARTED would equal 1. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.
OpenMP Jobs
OpenMP is a set of directives for Fortran, C and C++. It allows you to define certain aspects of an application as parallelizable and the compiler will handle the threading/parallelizing aspect of it.
hello-openmp.c
Here is an example openmp source. Compile it with gcc -fopenmp -o hello-openmp hello-openmp.c
1 #include <omp.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4 int main (int argc, char *argv[]) {
5
6 int nthreads, tid;
7
8 /* Fork a team of threads giving them their own copies of variables */
9 #pragma omp parallel private(nthreads, tid)
10 {
11
12 /* Obtain thread number */
13 tid = omp_get_thread_num();
14 printf("Hello World from thread = %d\n", tid);
15
16 /* Only master thread does this */
17 if (tid == 0)
18 {
19 nthreads = omp_get_num_threads();
20 printf("Number of threads = %d\n", nthreads);
21 }
22
23 } /* All threads join master thread and disband */
24
25 }
Running OpenMP applications
The example above will run on all cores on a node. This is rarely the ideal, as you are usually not given an entire node to yourself. To make use of only the cores assigned to you, you must first make sure you have requested the 'single' parallel environment and in your job script you will need something like the following (before the application you are trying to run):
bash, sh, zsh
export OMP_NUM_THREADS=${NSLOTS}
csh or tcsh
setenv OMP_NUM_THREADS ${NSLOTS}
Non-mpi multi-node jobs
Some people have the need to run multi-node jobs that are not MPI aware. If this is the case, most of them use ssh or rsh to connect to other nodes to do this. Under Torque/Maui ssh access was allowed to nodes that had one of your jobs allocated to it. This does not function in the same way with SGE. SGE provides the utility qrsh to manage this sort of thing. Earlier in this document I stated that qrsh is used for interactive jobs. While this is true, it also allows access to currently running jobs provided you have given it the correct options. Here is an example:
# Let's say I am running a script that needs two threads, each on one node.
# Since I already know that I will have 2 nodes, I need to figure out the name of the other host I have access to
OTHERHOST="`cat $PE_HOSTFILE | awk '{print $1}' | grep -v $HOSTNAME`"
# Now I am going to fire up a server on the other host
qrsh -inherit -nostdin $OTHERHOST $SERVERAPPS &
# Now I continue on with my script, confident the server is running on the other host.
$CLIENTAPPS
PE Hostfile syntax
Some of you will need to know what hosts, and what processors on those hosts, a PE job has access to. In the job an environment variable called PE_HOSTFILE gets set, pointing to a file with the following syntax:
hostname.domainname #ofprocessors queue@hostname.domainname <NULL>
Sample:
scout41.beocat 1 batch.q@scout41.beocat <NULL> scout42.beocat 1 batch.q@scout42.beocat <NULL> scout43.beocat 1 batch.q@scout43.beocat <NULL> scout44.beocat 1 batch.q@scout44.beocat <NULL> titan8.beocat 1 batch.q@titan8.beocat <NULL>
This can be read into a loop rather simply with the following:
1 i=1
2 c=1
3 while read name cores other
4 do
5 # skip over blank lines
6 if [ "$name" == '' ]; then
7 continue;
8 fi
9 j=0;
10 while [ $j -lt $cores ]; do
11 qrsh -inherit -nostdin $name "/some/command/that/you/want/to/run" &
12 j=$[$j+1];
13 i=$(($i+$c));
14 done
15 done < <(cat $PE_HOSTFILE)
To generate a standard machinefile for various MPI wrappers, you can use the following:
1 awk '/beocat/ {for (i=1; i<=$2; i++) print $1}' ${PE_HOSTFILE} > ${TMPDIR}/machines
Memory Request Table
Here is a table of some common memory-core combinations:
Memory/Cores |
512M |
1G |
2G |
3G |
4G |
8G |
16G |
32G |
64G |
1 Core |
512M |
1G |
2G |
3G |
4G |
8G |
16G |
32G |
64G |
2 Cores |
256M |
512M |
1G |
1536M |
2G |
4G |
8G |
16G |
32G |
3 Cores |
171M |
342M |
683M |
1G |
1350M |
2700M |
5400M |
10800M |
21600M |
4 Cores |
128M |
256M |
512M |
768M |
1G |
2G |
4G |
8G |
16G |
8 Cores |
64M |
128M |
256M |
384M |
512M |
1G |
2G |
4G |
8G |
16 Cores |
32M |
64M |
128M |
192M |
256M |
512M |
1G |
2G |
4G |
32 Cores |
16M |
32M |
64M |
96M |
128M |
256M |
512M |
1G |
2G |
64 Cores |
8M |
16M |
32M |
48M |
64M |
128M |
256M |
512M |
1G |
The math to get these numbers is simple:
Memory Needed per job
---------------------
Number of CoresSimple conversions between Megabyte and Gigabyte are done like so:
1 Terabyte |
1024 Gigabytes |
1 Gigabyte |
1024 Megabytes |
1 Megabyte |
1024 Kilobytes |
1 Kilobyte |
1024 Bytes |
Hadoop With SGE
Apache Hadoop 0.20.2, 1.0.4, and 1.1.1 have been added to the cluster.
Basics
Hadoop, unfortunately isn't easily integrated with SGE, so there are a couple of rules you MUST follow. Currently, our startup scripts for Hadoop only work with bash, as such, please do *NOT* try to use any other shell within your job script.
SGE Options
First of all, you must request the mpi-fill parallel environment. This can be done either from your job script, with #$ -pe mpi-fill $CORES or via an option to qsub: qsub -pe mpi-fill $CORES $SCRIPT . Second, you need to get exclusive access to whatever nodes you are on. To do this, either put #$ -l exclusive=TRUE in your job script, or add the option to qsub: qsub -l exclusive=TRUE
Skeleton Job Script
For simplicity's sake, please use the following as a base "job script." This will take care of starting Hadoop on all of the nodes you have access to. You can remove anything in between hadoop_start and hadoop_end, but leave those two commands. If necessary, change the version
1 #!/bin/bash
2 #$ -pe mpi-fill 4
3 #$ -l exclusive=TRUE
4
5 #HADOOP_HOME=/opt/beocat/hadoop/0.20.2
6 #HADOOP_HOME=/opt/beocat/hadoop/1.0.4
7 #HADOOP_HOME=/opt/beocat/hadoop/1.1.1
8 HADOOP_HOME=/opt/beocat/hadoop/current
9 . $HADOOP_HOME/conf/hadoop-sge.sh
10 . /opt/sge/default/common/settings.sh
11
12 # This configures the environment and starts the Hadoop Cluster.
13 hadoop_start
14
15
16 ##############################
17 # Place your hadoop job here #
18 ##############################
19
20 # Copy the input file into the HDFS filesystem
21 #hadoop fs -put file.txt file.txt
22
23 # Running the hadoop task(s) here. I am specifying the jar, class, input, and output:
24 #hadoop jar ./wordcount.jar org.myorg.WordCount file.txt output
25
26 # Copying the output files from the HDFS filesystem
27 #hadoop fs -get output hadoop-output.$JOB_ID
28
29
30 # Stops the Hadoop cluster.
31 hadoop_end
Running a different version
Running a different version is simple with the script I prepared above. Simply remove the # at the beginning of the correct HADOOP_HOME line, then delete the HADOOP_HOME=/opt/beocat/hadoop/current. For instance to run 1.0.4 it would look like this:
#HADOOP_HOME=/opt/beocat/hadoop/0.20.2 HADOOP_HOME=/opt/beocat/hadoop/1.0.4 #HADOOP_HOME=/opt/beocat/hadoop/1.1.1
GPU Computing (CUDA)
We have 16 machines with nVidia Tesla m2050 cards in them. These cards are wonderful at Floating Point math. In fact, each card can do 1 TeraFLOP of Single-Precision operations.
Introductory CUDA Material
CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y
CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY
CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4
Writing CUDA Applications
CUDA applications can be written in C/C++ (on Beocat).
As CUDA is an nVidia technology your best bet for apis and development resources is to check out nVidia's GPU Development page.
The programming guide for the CUDA Toolkit 4.0 is available here.
nVidia Tesla m2050 Specifications
- 448x 1.15 GHz Cores
- 3GB of memory (12.5% is used for ECC)
- Architecture sm_20
Requesting CUDA
CUDA is requested simply by specifying -l cuda=TRUE on your normal qsub line. You get exclusive access to the CUDA card on that node when your job gets scheduled.
Compiling CUDA Applications
nvcc is your compiler for CUDA applications. When compiling your applications manually you will need to keep 3 things in mind:
- The CUDA development headers are located here: /opt/cuda/sdk/C/common/inc
- The CUDA architecture is: sm_20
- The CUDA SDK is currently not available on the headnode. (compile on the nodes with CUDA, either in your jobscript or via qrsh -l cuda=TRUE)
Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.
Putting it all together you can compile CUDA applications as follows:
nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 <source>.cu -o <output>
Logging in, Compiling, and Running your CUDA Application
Copy the following Application into Beocat as vecadd.cu
Sample Cuda Application -- vecadd.cu
1 // Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
2 __global__ void vecAdd(float* A, float* B, float* C)
3 {
4 // threadIdx.x is a built-in variable provided by CUDA at runtime
5 int i = threadIdx.x;
6 A[i]=0;
7 B[i]=i;
8 C[i] = A[i] + B[i];
9 }
10
11 #include <stdio.h>
12 #define SIZE 10
13 int main()
14 {
15 int N=SIZE;
16 float A[SIZE], B[SIZE], C[SIZE];
17 float *devPtrA;
18 float *devPtrB;
19 float *devPtrC;
20 int memsize= SIZE * sizeof(float);
21
22 cudaMalloc((void**)&devPtrA, memsize);
23 cudaMalloc((void**)&devPtrB, memsize);
24 cudaMalloc((void**)&devPtrC, memsize);
25 cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);
26 cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);
27 // __global__ functions are called: Func<<< Dg, Db, Ns >>>(parameter);
28 vecAdd<<<1, N>>>(devPtrA, devPtrB, devPtrC);
29 cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);
30
31 for (int i=0; i<SIZE; i++)
32 printf("C[%d]=%f\n",i,C[i]);
33
34 cudaFree(devPtrA);
35 cudaFree(devPtrA);
36 cudaFree(devPtrA);
37
38 }
Gain Access to a CUDA Capable Node
1 qrsh -l cuda=TRUE
Running that line will log you into a CUDA node (if one is available)
Compile your application
1 nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 vecadd.cu -o vecadd
The preceding line will compile vecadd.cu into the binary vecadd.
Run the application
To run your application, you would do the same as you would normally:
1 ./vecadd
The output should be:
C[0]=0.000000 C[1]=1.000000 C[2]=2.000000 C[3]=3.000000 C[4]=4.000000 C[5]=5.000000 C[6]=6.000000 C[7]=7.000000 C[8]=8.000000 C[9]=9.000000