Site moved

Beocat Docs have now moved to http://support.beocat.cis.ksu.edu

Sun Grid Engine

About

Sun Grid Engine is the scheduling system that Beocat uses.

Getting Started with SGE

Submitting Simple Jobs

There are two major classes of jobs in SGE. Batch and Interactive. The type of job is determined by the command you use to submit the job. qsub will submit a "Batch" job. Batch jobs run without interactivity from you. They should be self-contained. They are by far the most common type of job. qrsh will submit an "Interactive" job. This type of job will log you into a node directly, and will allow you to troubleshoot or test a script that you are trying to run. In the following examples, qrsh and qsub can be substituted at will, as the options are generally the same for them.

Resource requests

To specify the resources you need, you do it as arguments your job submission command. If I were to need 1G of memory, I would specify it like so:

mozes@loki ~ $ qsub -l mem=1G

To request 4G of memory (per core), infiniband, a max runtime of 2hrs, and two cores across 2 machines it is very similar

mozes@loki ~ $ qsub -l mem=4G,ib=TRUE,h_rt=2:00:00 -pe mpi-1 2

The default resource request is mem=1G and h_rt=1:00:00. I have found that this is very close to what most of our users need.

Requestable Resources

A full listing of the objects you can request is available from the following command:

mozes@loki ~ $ qconf -sc | awk '{ if ($5 != "NO") { print }}'

Sample Output:

name

shortcut

type

relop

requestable

consumable

default

arch

a

RESTRING

==

YES

NO

NONE

calendar

c

RESTRING

==

YES

NO

NONE

cpu

cpu

DOUBLE

>=

YES

NO

0

cuda

cuda

INT

<=

YES

JOB

0

display_win_gui

dwg

BOOL

==

YES

NO

0

exclusive

excl

BOOL

EXCL

YES

YES

0

h_core

h_core

MEMORY

<=

YES

NO

0

h_cpu

h_cpu

TIME

<=

YES

NO

0:0:0

h_data

h_data

MEMORY

<=

YES

NO

0

h_fsize

h_fsize

MEMORY

<=

YES

NO

0

h_rss

h_rss

MEMORY

<=

YES

NO

0

h_rt

h_rt

TIME

<=

FORCED

NO

0:0:0

h_stack

h_stack

MEMORY

<=

YES

NO

0

h_vmem

h_vmem

MEMORY

<=

YES

NO

0

hostname

h

HOST

==

YES

NO

NONE

infiniband

ib

BOOL

==

YES

NO

FALSE

m_core

core

INT

<=

YES

NO

0

m_socket

socket

INT

<=

YES

NO

0

m_thread

thread

INT

<=

YES

NO

0

m_topology

topo

RESTRING

==

YES

NO

NONE

m_topology_inuse

utopo

RESTRING

==

YES

NO

NONE

mem_free

mf

MEMORY

<=

YES

NO

0

mem_total

mt

MEMORY

<=

YES

NO

0

mem_used

mu

MEMORY

>=

YES

NO

0

memory

mem

MEMORY

<=

FORCED

YES

0

myrinet

mx

BOOL

==

YES

NO

FALSE

num_proc

p

INT

==

YES

NO

0

qname

q

RESTRING

==

YES

NO

NONE

s_core

s_core

MEMORY

<=

YES

NO

0

s_cpu

s_cpu

TIME

<=

YES

NO

0:0:0

s_data

s_data

MEMORY

<=

YES

NO

0

s_fsize

s_fsize

MEMORY

<=

YES

NO

0

s_rss

s_rss

MEMORY

<=

YES

NO

0

s_rt

s_rt

TIME

<=

YES

NO

0:0:0

s_stack

s_stack

MEMORY

<=

YES

NO

0

s_vmem

s_vmem

MEMORY

<=

YES

NO

0

slots

s

INT

<=

YES

YES

1

swap_free

sf

MEMORY

<=

YES

NO

0

swap_rate

sr

MEMORY

>=

YES

NO

0

swap_rsvd

srsv

MEMORY

>=

YES

NO

0

swap_total

st

MEMORY

<=

YES

NO

0

swap_used

su

MEMORY

>=

YES

NO

0

virtual_free

vf

MEMORY

<=

YES

NO

0

virtual_total

vt

MEMORY

<=

YES

NO

0

virtual_used

vu

MEMORY

>=

YES

NO

0

Parallel Jobs

Please note, as this seems to be a common misconception, BEOCAT WILL NOT MAGICALLY THREAD YOUR APPLICATIONS. It is up to you to make sure that the application can make use of multiple cores, or even multiple machines, if necessary.
SGE's Multi-threaded job requests are managed with what SGE terms Parallel Environments. In the SGE world, you simply request a number of processors, and ask for an environment that will allocate them.

Parallel Environments

To use more than one core, you need to request a parallel environment. Currently, there are 11 of them depending on what you want to do.

  1. mpi-fill

    1. This environment will use as many slots on each node as it can until it reaches the number of cores you have requested.
  2. mpi-spread

    1. This environment will spread itself out over as many nodes as possible until it reaches the number of cores you have requested.
  3. single

    1. This environment will fit itself into a single node, allocating as many cores as requested. The number of cores must be <= 80 because our largest nodes have 80 cores. This is useful for jobs that cannot work with more than one machine. If you are not using mpirun you most likely want to use this environment.

  4. mpi-1

    1. This environment will allocate the slots you've requested 1 per node.
  5. mpi-2

    1. This environment will allocate the slots you've requested 2 per node. You must request cores as a multiple of 2
  6. mpi-4

    1. This environment will allocate the slots you've requested 4 per node. You must request cores as a multiple of 4
  7. mpi-8

    1. This environment will allocate the slots you've requested 8 per node. You must request cores as a multiple of 8
  8. mpi-10

    1. This environment will allocate the slots you've requested 10 per node. You must request cores as a multiple of 10
  9. mpi-12

    1. This environment will allocate the slots you've requested 12 per node. You must request cores as a multiple of 12
  10. mpi-16

    1. This environment will allocate the slots you've requested 16 per node. You must request cores as a multiple of 16
  11. mpi-80

    1. This environment will allocate the slots you've requested 80 per node. You must request cores as a multiple of 80

For example, if you wanted to get 8 cores on one node, you could do that in two different ways (ideally you would use the first):

  1. mozes@loki ~ $ qsub -pe single 8 mpijob.sh

  2. mozes@loki ~ $ qsub -pe mpi-8 8 mpijob.sh

One nice feature of the scheduler is the ability to request a range of processors. It will give you as many of them as it can. This can be done using the following syntax: "2|3|5-8|10|16". This would give you an effective request of either 16, 10, 8, 7, 6, 5, 3, or 2 cores, depending on how many cores are available when you submit.

Memory Requests with Multi-core jobs

Memory requests in the SGE environment are per-core. For example:

mozes@loki ~ $ qrsh -l mem=1G -pe mpi-spread 32

This would have requested 1G of memory per slot, ending up with a total request of 32G. If you actually wanted 1G for the entire job, you would need to compute 1G/32cores == 1024M/32cores == 32M/core. This means that you should make the following request when submitting the job:

mozes@loki ~ $ qrsh -l mem=32M -pe mpi-spread 32

Node Group Requests

Sometimes it is beneficial to run jobs on a specific set of nodes. This can be done like this:

 mozes@loki ~ $ qsub -q '*@@scouts' test.sh

Please note the "@@" syntax, there are two '@' symbols, as hostgroups always begin with @. A group listing can be obtained from:

 mozes@loki ~ $ qconf -shgrpl

Sample Output:

@paladins

@scouts

@elves

@mages

To see the machines in each group you would do something like:

mozes@loki ~ $ qconf -shgrp @rogues

Sample Output:

group_name @rogues

hostlist rogue1.beocat rogue2.beocat rogue3.beocat rogue4.beocat rogue5.beocat \

rogue6.beocat rogue7.beocat rogue8.beocat rogue9.beocat \

rogue10.beocat rogue11.beocat rogue12.beocat rogue13.beocat \

rogue14.beocat rogue15.beocat rogue16.beocat

Monitoring your job

You have now submitted your job, and you want to see whether or not it has started.

The status command

'status' is a new program that gives an easy overview of your jobs.

With no other arguments, status shows all of your jobs both in the queue and being processed. In the case of queued jobs, it will show when the job was submitted. In the case of running jobs, it will show both the time the job began running and the queue in which it is running.

In most cases, you don't need to know the time the job was submitted, you need to know how long it has been since the job was submitted or how long it has been running. To find this out, use the '-r' switch:

status -r

Instead of showing that your job was submitted on the first of Octember, it will tell you it was submitted 4 days and 3 hours ago.

For more complex uses of this command, you can type

man status

for more information.

The qstat command

To check running jobs, use can also use SGE's own qstat command:

mozes@loki ~ $ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    101 0.00000 sge_test.s mozes        qw    04/27/2009 10:00:36                                    1        

The useful information for us is the piece under the item "state." "qw" means that the job is queued and waiting. If the job were running, the output would be more like this:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    101 0.50000 sge_test.s mozes        r     04/27/2009 10:00:50 batch.q@titan4.beocat              1        

If that doesn't provide you with enough useful information, you could always use "qstat -ext":

job-ID  prior   ntckts  name       user         project          department state cpu        mem     io      tckts ovrts otckt ftckt stckt share queue                          slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    102 0.00000 0.00000 sge_test.s mozes        CIS              defaultdep qw                                   0     0     0     0     0 0.00                                     1        

or

job-ID  prior   ntckts  name       user         project          department state cpu        mem     io      tckts ovrts otckt ftckt stckt share queue                          slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    102 0.50000 0.50000 sge_test.s mozes        CIS              defaultdep r     0:00:02:36 96.56516 0.00000     0     0     0     0     0 0.00  batch.q@titan4.beocat              1        

One of the niceties of SGE is that memory utilization and actual CPU time are reported to the scheduler. Thus we have almost realtime statistics on per job utilization.
I now want to know what the states are for all of my jobs, and what resources I request for them:

mozes@loki ~ $ qstat -f -u mozes -r -ne
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
batch.q@titan4.beocat          BIPC  0/1/16         2.95     lx24-amd64    
    102 0.50000 sge_test.s mozes        r     04/27/2009 10:05:35     1        
       Full jobname:     sge_test.sub
       Master Queue:     batch.q@titan4.beocat
       Hard Resources:   h_rt=3600 (0.000000)
                         memory=4G (0.000000)
       Soft Resources:   
       Hard requested queues: *@titan4.beocat

The -f means you want the full output. Unfortunately, this means that it will print information for all hosts, too. The -ne suppresses the output from hosts that are not used. The -r tells qstat that you want information about the resources requested, and the -u mozes says you want info just about mozes' jobs.

Diagnosing jobs

What if you have submitted a job, and it just won't start? Begin with the same command shown above, and couple that with the output from 'qstat -j'

mozes@loki ~ $ qstat -f -u mozes -r -ne

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    104 0.50000 sge_test.s mozes        qw    04/27/2009 11:16:48    32        
       Full jobname:     sge_test.sub
       Requested PE:     single 32
       Hard Resources:   h_rt=3600 (0.000000)
                         memory=4G (0.000000)
       Soft Resources:   
       Hard requested queues: *@@titans
mozes@loki ~ $ qstat -j
scheduling info:            queue instance "batch.q@brute4.beocat" dropped because it is overloaded: np_load_avg=1.341250 (no load adjustment) >= 1.25

Jobs can not run because queue instance is not contained in its hard queue list
        104

Jobs can not run because available slots combined under PE are not in range of job
        104

Well, the last few lines of this command show that job 104, my job, cannot start because I requested more slots on a single machine than there are anywhere. To fix this, we could do something like this:

mozes@loki ~ $ qalter -pe mpi-fill 32 104
modified parallel environment of job 104
modified slot range of job 104

This modifies the job, telling it to use a different parallel environment. When we run the qstat -j command, we see that:

mozes@loki ~ $ qstat -j
scheduling info:            queue instance "batch.q@titan1.beocat" dropped because it is full
                            queue instance "batch.q@titan2.beocat" dropped because it is full
                            queue instance "batch.q@titan1.beocat" dropped because it is overloaded: memory=0.000000 (no load value) <= 0G
                            queue instance "batch.q@titan2.beocat" dropped because it is overloaded: memory=0.000000 (no load value) <= 0G
mozes@loki ~ $ qstat -f -u mozes -r -ne
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
batch.q@titan1.beocat          BIPC  0/16/16        0.86     lx24-amd64    a
    104 0.50000 sge_test.s mozes        r     04/27/2009 11:26:35    16        
       Full jobname:     sge_test.sub
       Master Queue:     batch.q@titan1.beocat
       Requested PE:     mpi-fill 32
       Granted PE:       mpi-fill 32
       Hard Resources:   h_rt=3600 (0.000000)
                         memory=4G (0.000000)
       Soft Resources:   
       Hard requested queues: *@@titans
---------------------------------------------------------------------------------
batch.q@titan2.beocat          BIPC  0/16/16        0.15     lx24-amd64    a
    104 0.50000 sge_test.s mozes        r     04/27/2009 11:26:35    16        
       Full jobname:     sge_test.sub
       Master Queue:     batch.q@titan2.beocat
       Requested PE:     mpi-fill 32
       Granted PE:       mpi-fill 32
       Hard Resources:   h_rt=3600 (0.000000)
                         memory=4G (0.000000)
       Soft Resources:   
       Hard requested queues: *@@titans

The job seems to be running.

One more note on 'qstat -j'... Until we get the power and air conditioning upgraded, we only keep our most efficient nodes powered on to keep maximum performance within our power budget. For those machines we have powered off, you will see several messages that "queue instance (queue@nodename) dropped because it is temporarily not available." This is nothing to be concerned about. You might, however want to run the output through a pager so you can quickly skim over that information:

qstat -j JobNumber | less

Deleting Jobs

For those rare occasions that require you to delete a job, this can be done with qdel

qdel <Job-ID>

Array Jobs

One of SGE's useful options is the ability to run "Array Jobs"

It can be used with the following option to qsub.

  -t n[-m[:s]]
     Submits  a  so  called  Array  Job,  i.e. an array of identical tasks being differentiated only by an index number and being treated by  Grid
     Engine almost like a series of jobs. The option argument to -t specifies the number of array job tasks and the index  number  which  will  be
     associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SGE_TASK_ID. The option arguments
     n, m and s will be available through the environment variables SGE_TASK_FIRST, SGE_TASK_LAST and  SGE_TASK_STEPSIZE.

     Following restrictions apply to the values n and m:

            1 <= n <= 1,000,000
            1 <= m <= 1,000,000
            n <= m

     The task id range specified in the option argument may be a single number, a simple range of the form n-m or  a  range  with  a  step  size.
     Hence,  the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each
     with the environment variable SGE_TASK_ID containing one of the 5 index numbers.

     Array  jobs  are  commonly  used to execute the same type of operation on varying input data sets correlated with the task index number. The
     number of tasks in a array job is unlimited.

     STDOUT and STDERR of array job tasks will be written into different files with the default location

     <jobname>.['e'|'o']<job_id>'.'<task_id>

Example 1

Array Jobs have a variety of uses, one of the easiest to comprehend is the following:

I have an application, app1 I need to run the exact same way, on the same data set, with only the size of the run changing.

My original script looks like this:

   1 #!/bin/bash
   2 RUNSIZE=50
   3 #RUNSIZE=100
   4 #RUNSIZE=150
   5 #RUNSIZE=200
   6 app1 $RUNSIZE dataset.txt

For every run of that job I have to change the RUNSIZE variable, and submit each script. This gets tedious.

With Array Jobs the script can be written like so:

   1 #!/bin/bash
   2 #$ -t 50:200:50
   3 RUNSIZE=$SGE_TASK_ID
   4 app1 $RUNSIZE dataset.txt

I then submit that job, and SGE understands that it needs to run it 4 times, once for each task. It also knows that it can and should run these tasks in parallel.

Example 2

A slightly more complex use of Array Jobs is the following:

I have an application, app2, that needs to be run against every line of my dataset. Every line changes how app2 runs slightly, but I need to compare the runs against each other.

Originally I had to take each line of my dataset and generate a new submit script and submit the job. This was done with yet another script:

   1 #!/bin/bash
   2 DATASET=dataset.txt
   3 scriptnum=0
   4 while read LINE
   5 do
   6     echo "app2 $LINE" > ${scriptnum}.sh
   7     qsub ${scriptnum}.sh
   8     scriptnum=$(( $scriptnum + 1 ))
   9 done < $DATASET

Not only is this needlessly complex, it is also slow, as qsub has to verify each job as it is submitted. This can be done easily with array jobs, as long as you know the number of lines in the dataset. This number can be obtained like so: wc -l dataset.txt in this case lets call it 5000.

   1 #!/bin/bash
   2 #$ -t 1:5000
   3 app2 `sed -n "${SGE_TASK_ID}p" dataset.txt`

This uses a subshell via `, and has the sed command print out only the line number $SGE_TASK_ID out of the file dataset.txt.

Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so qsub doesn't have to verify as many.

To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.

Other qsub features

Output file control

Current Working Directory

qsub by default will place the output files in your home directory. If you feel that they should be organized differently, an easy way to do that is to use the -cwd option. cwd stands for current working directory, this means that the file will be placed in whatever directory you are currently in.

Combining STDOUT and STDERR

In your qsub command, you can add -j yes to add the errors directly into the output file.

Job Naming

Sometimes it is nice to be able to specify the job name so that you can better see what is happening in the queue. To name your job, add -N $WhatEverNameYouWant to the qsub command.

Other SGE Features

Job limits

Because of increased ability of SGE to handle parallel jobs, and some users affinity for filling the queue with long jobs, we have implemented a couple of methods for handling this.

Max Threads/Jobs per user

There is a hard limit on the number of cores any one person can utilize at any point in time. It is set to 500 currently, which I think is reasonable. If, for some reason, a user were to need this limitation raised, the administrator would need to be e-mailed with the request, and a detailed explanation as to why you need that many cores. To determine how many cores you have utilized, and what the limit is, you would need to have a job running and run the qquota command:

mozes@loki ~ $ qquota
resource quota rule limit                filter
--------------------------------------------------------------------------------
max_slots_per_user/1 slots=4/500        users *

Time Limits

Roughly 1/4 of the nodes cannot be used for longer than 72 hours. This means that there should always be cores available for short jobs. You do not need to know which machines are setup this way, the scheduler will handle the changes on it's own. qstat -j will tell you why a job is not starting.

Checkpointing

What is Checkpointing

Checkpointing is the ability for the running job to be "snapshotted" at a scheduled interval. If the machine the job is running on were to power off unexpectedly or your job gets suspended for any reason, the job could be resumed from the "snapshot." Checkpointing is turned off by default, so you must enable a checkpointing method for it to be enabled.

DMTCP

DMTCP is a new method of checkpointing on Beocat.

  • MPI checkpointing is currently broken, we are working to resolve the issue. When we get this fixed, your mpirun command should look like this mpirun --mca btl tcp,self <app>

  • Unlike the old BLCR method, there are no compilation requirements.
  • Infiniband is still not supported. The DMTCP developers are working on this for an upcoming release.
  • Checkpoints are taken only once every 12 hours by default. As this is a blocking checkpoint, if you are using a lot of memory (>=64GB) you may want to think about increasing the checkpoint interval. This can be done via the qsub option -c <interval> i.e. qsub -ckpt dmtcp -c 36:00:00 script.qsub

Will Checkpointing work for me?

Well, it depends.

  1. Does your application use Infiniband? If so, none of our current checkpointing software will work.
  2. Does your application use OpenMPI? If so, is it a multi node job? If you answered yes to both of these questions, DMTCP checkpointing may work for you.

  3. Are you using more than one application? If so, are they running at the same time? If you answered yes to both of the questions, checkpointing MAY work for you.

  4. For all other use cases, checkpointing method should work.

If you still have problems with checkpointing, e-mail the administrator (beocat@cis.ksu.edu), and he should be able to help you.

To enable DMTCP checkpointing for your application, add the -ckpt dmtcp option to your job submission script:

mozes@loki ~ $ qsub -ckpt dmtcp checkpoint.sh

In this submission, -ckpt dmtcp indicates that you want to use DMTCP checkpointing.

Accounting

Sometimes it is useful to know about the resources you have utilized. This can be done through the qacct command:

mozes@loki ~ $ qacct -o $USER
OWNER     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
======================================================================================================================
mozes         21856     11525.840       572.390     14276.910          32520.257              0.000              0.000

It may be more useful if you specify a timelimit for this query, for example the last day (-d 1):

mozes@loki ~ $ qacct -o $USER -d 1
OWNER     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
======================================================================================================================
mozes          3832      2962.040       145.480      3234.560           7785.764              0.000              0.000

If you work with a group of people on Beocat, and need to account for everyone's time: NOTE This only works if you have a group setup in Beocat.

mozes@loki ~ $ qacct -P `qconf -suser $USER | grep default_project | awk '{ print $2 }'`
PROJECT     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
========================================================================================================================
CIS             12917      7656.330       360.300      9121.270          20576.334              0.000              0.000

Job script options

If you were fortunate enough to use the Torque/Maui setup, you may have used an option of specifying resource requests within the job script. This was done via "#PBS $OPTION". This can also be done in SGE:

mozes@loki ~ $ vim sge_test.sub
#!/bin/bash
#$ -S /usr/local/bin/sh  # Specify my shell as sh
#$ -l h_rt=2:00:00  # Give me a 2 hour limit to finish the job
echo "Running osm2navit"
/usr/bin/env
cr_run ~/navit/navit/osm2navit Blah.bin < ~/osm/blah.osm
echo "finished osm2navit with exit code $?"

As you can see, I used the construct "#$ $OPTION" to implement submit options in the job script.

SGE Environment variables

Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that SGE makes available to you.

HOSTNAME=titan1.beocat
SGE_TASK_STEPSIZE=undefined
SGE_INFOTEXT_MAX_COLUMN=5000
SHELL=/usr/local/bin/sh
NHOSTS=2
SGE_O_WORKDIR=/homes/mozes
TMPDIR=/tmp/105.1.batch.q
SGE_O_HOME=/homes/mozes
SGE_ARCH=lx24-amd64
SGE_CELL=default
RESTARTED=0
ARC=lx24-amd64
USER=mozes
QUEUE=batch.q
PVM_ARCH=LINUX64
SGE_TASK_ID=undefined
SGE_BINARY_PATH=/opt/sge/bin/lx24-amd64
SGE_STDERR_PATH=/homes/mozes/sge_test.sub.e105
SGE_STDOUT_PATH=/homes/mozes/sge_test.sub.o105
SGE_ACCOUNT=sge
SGE_RSH_COMMAND=builtin
JOB_SCRIPT=/opt/sge/default/spool/titan1/job_scripts/105
JOB_NAME=sge_test.sub
SGE_NOMSG=1
SGE_ROOT=/opt/sge
REQNAME=sge_test.sub
SGE_JOB_SPOOL_DIR=/opt/sge/default/spool/titan1/active_jobs/105.1
ENVIRONMENT=BATCH
PE_HOSTFILE=/opt/sge/default/spool/titan1/active_jobs/105.1/pe_hostfile
SGE_CWD_PATH=/homes/mozes
NQUEUES=2
SGE_O_LOGNAME=mozes
SGE_O_MAIL=/var/mail/mozes
TMP=/tmp/105.1.batch.q
JOB_ID=105
LOGNAME=mozes
PE=mpi-fill
SGE_TASK_FIRST=undefined
SGE_O_HOST=loki
SGE_O_SHELL=/bin/bash
SGE_CLUSTER_NAME=beocat
REQUEST=sge_test.sub
NSLOTS=32
SGE_STDIN_PATH=/dev/null

Sometimes it is nice to know what hosts you have access to during a PE job. You would checkout the PE_HOSTFILE to know that. If your job has been restarted, it is nice to be able to change what happens rather than redoing all of your work. If this is the case, RESTARTED would equal 1. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.

SGE Email notifications

Frequently, you will want to know when a job has started, finished, or given an error. There are two directives you can give SGE to control these settings.

The first, '-M' gives the email address which should be notified. Please be sure to use a full email address, rather than just your login name.

The second '-m' tells when to send notifications.

  • b - Mail is sent at the beginning of the job.
  • e - Mail is sent at the end of the job.
  • a - Mail is sent when the job is aborted or rescheduled.
  • s - Mail is sent when the job is suspended.
  • n - No mail is sent.

So, in a typical configuration, you might specify

 -M MyEID@ksu.edu -m abe

This will send an email to MyEID@ksu.edu whenever a job is (a)borted, (b)eginning, or (e)nding.

OpenMP Jobs

OpenMP is a set of directives for Fortran, C and C++. It allows you to define certain aspects of an application as parallelizable and the compiler will handle the threading/parallelizing aspect of it.

hello-openmp.c

Here is an example openmp source. Compile it with gcc -fopenmp -o hello-openmp hello-openmp.c

   1 #include <omp.h>
   2 #include <stdio.h>
   3 #include <stdlib.h>
   4 int main (int argc, char *argv[]) {
   5 
   6 int nthreads, tid;
   7 
   8 /* Fork a team of threads giving them their own copies of variables */
   9 #pragma omp parallel private(nthreads, tid)
  10  {
  11 
  12  /* Obtain thread number */
  13  tid = omp_get_thread_num();
  14  printf("Hello World from thread = %d\n", tid);
  15 
  16  /* Only master thread does this */
  17  if (tid == 0) 
  18  {
  19   nthreads = omp_get_num_threads();
  20  printf("Number of threads = %d\n", nthreads);
  21  }
  22 
  23  } /* All threads join master thread and disband */
  24 
  25 }

Running OpenMP applications

The example above will run on all cores on a node. This is rarely the ideal, as you are usually not given an entire node to yourself. To make use of only the cores assigned to you, you must first make sure you have requested the 'single' parallel environment and in your job script you will need something like the following (before the application you are trying to run):

bash, sh, zsh

export OMP_NUM_THREADS=${NSLOTS}

csh or tcsh

setenv OMP_NUM_THREADS ${NSLOTS}

Non-mpi multi-node jobs

Some people have the need to run multi-node jobs that are not MPI aware. If this is the case, most of them use ssh or rsh to connect to other nodes to do this. Under Torque/Maui ssh access was allowed to nodes that had one of your jobs allocated to it. This does not function in the same way with SGE. SGE provides the utility qrsh to manage this sort of thing. Earlier in this document I stated that qrsh is used for interactive jobs. While this is true, it also allows access to currently running jobs provided you have given it the correct options. Here is an example:

# Let's say I am running a script that needs two threads, each on one node.
# Since I already know that I will have 2 nodes, I need to figure out the name of the other host I have access to
OTHERHOST="`cat $PE_HOSTFILE | awk '{print $1}' | grep -v $HOSTNAME`"
# Now I am going to fire up a server on the other host
qrsh -inherit -nostdin $OTHERHOST $SERVERAPPS &
# Now I continue on with my script, confident the server is running on the other host.
$CLIENTAPPS

PE Hostfile syntax

Some of you will need to know what hosts, and what processors on those hosts, a PE job has access to. In the job an environment variable called PE_HOSTFILE gets set, pointing to a file with the following syntax:

hostname.domainname #ofprocessors queue@hostname.domainname <NULL>

Sample:

scout41.beocat 1 batch.q@scout41.beocat <NULL>
scout42.beocat 1 batch.q@scout42.beocat <NULL>
scout43.beocat 1 batch.q@scout43.beocat <NULL>
scout44.beocat 1 batch.q@scout44.beocat <NULL>
titan8.beocat 1 batch.q@titan8.beocat <NULL>

This can be read into a loop rather simply with the following:

   1 i=1
   2 c=1
   3 while read name cores other
   4 do
   5     # skip over blank lines
   6     if [ "$name" == '' ]; then
   7         continue;
   8     fi 
   9     j=0;
  10     while [ $j -lt $cores ]; do
  11         qrsh -inherit -nostdin $name "/some/command/that/you/want/to/run" &
  12         j=$[$j+1];
  13         i=$(($i+$c));
  14     done
  15 done < <(cat $PE_HOSTFILE)

To generate a standard machinefile for various MPI wrappers, you can use the following:

   1 awk '/beocat/ {for (i=1; i<=$2; i++) print $1}' ${PE_HOSTFILE} > ${TMPDIR}/machines

Memory Request Table

Here is a table of some common memory-core combinations:

Memory/Cores

512M

1G

2G

3G

4G

8G

16G

32G

64G

1 Core

512M

1G

2G

3G

4G

8G

16G

32G

64G

2 Cores

256M

512M

1G

1536M

2G

4G

8G

16G

32G

3 Cores

171M

342M

683M

1G

1350M

2700M

5400M

10800M

21600M

4 Cores

128M

256M

512M

768M

1G

2G

4G

8G

16G

8 Cores

64M

128M

256M

384M

512M

1G

2G

4G

8G

16 Cores

32M

64M

128M

192M

256M

512M

1G

2G

4G

32 Cores

16M

32M

64M

96M

128M

256M

512M

1G

2G

64 Cores

8M

16M

32M

48M

64M

128M

256M

512M

1G

The math to get these numbers is simple:

    Memory Needed per job
    ---------------------
       Number of Cores

Simple conversions between Megabyte and Gigabyte are done like so:

1 Terabyte

1024 Gigabytes

1 Gigabyte

1024 Megabytes

1 Megabyte

1024 Kilobytes

1 Kilobyte

1024 Bytes

Site moved

Beocat Docs have now moved to http://support.beocat.cis.ksu.edu/BeocatDocs/index.php/Hadoop

Hadoop on Beocat

Cloudera Hadoop has been installed on a subset of the cluster.

Basics

Hadoop, unfortunately isn't easily integrated with SGE, so we have setup a small hadoop cluster for you to use.

Hadoop Headnode

To get to the hadoop install, from the headnode,  ssh theia 

Filesystem access

The Hadoop and Beocat homedirs are on different filesystems. You must do an hadoop fs put and hadoop fs get to place data into and bring it back out of the Hadooop filesystem.

The Hadoop filesystem is relatively small. Please do not leave lots of data in there that will not be in use for some time. I make no guarantees about data viability of data in HDFS.

Running a Hadoop Job

hadoop -jar path/to/file.jar

GPU Computing (CUDA)

We have 16 machines with nVidia Tesla m2050 cards in them. These cards are wonderful at Floating Point math. In fact, each card can do 1 TeraFLOP of Single-Precision operations.

Introductory CUDA Material

CUDA Programming Model Overview: http://www.youtube.com/watch?v=aveYOlBSe-Y

CUDA Programming Basics Part I (Host functions): http://www.youtube.com/watch?v=79VARRFwQgY

CUDA Programming Basics Part II (Device functions): http://www.youtube.com/watch?v=G5-iI1ogDW4

Writing CUDA Applications

CUDA applications can be written in C/C++ (on Beocat).

nVidia Tesla m2050 Specifications

  • 448x 1.15 GHz Cores
  • 3GB of memory (12.5% is used for ECC)
  • Architecture sm_20

Requesting CUDA

CUDA is requested simply by specifying -l cuda=TRUE on your normal qsub line. You get exclusive access to the CUDA card on that node when your job gets scheduled.

Compiling CUDA Applications

nvcc is your compiler for CUDA applications. When compiling your applications manually you will need to keep 3 things in mind:

  • The CUDA development headers are located here: /opt/cuda/sdk/C/common/inc
  • The CUDA architecture is: sm_20
  • The CUDA SDK is currently not available on the headnode. (compile on the nodes with CUDA, either in your jobscript or via qrsh -l cuda=TRUE)
  • Do not run your cuda applications on the headnode. I cannot guarantee it will run, and it will give you terrible results if it does run.

Putting it all together you can compile CUDA applications as follows:

 nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 <source>.cu -o <output>

Logging in, Compiling, and Running your CUDA Application

Copy the following Application into Beocat as vecadd.cu

Sample Cuda Application -- vecadd.cu

   1  //  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
   2 __global__  void vecAdd(float* A, float* B, float* C)
   3 {
   4             // threadIdx.x is a built-in variable  provided by CUDA at runtime
   5             int i = threadIdx.x;
   6        A[i]=0;
   7        B[i]=i;
   8        C[i] = A[i] + B[i];
   9 }
  10 
  11 #include  <stdio.h>
  12 #define  SIZE 10
  13 int  main()
  14 {
  15    int N=SIZE;
  16    float A[SIZE], B[SIZE], C[SIZE];
  17    float *devPtrA;
  18    float *devPtrB;
  19    float *devPtrC;
  20    int memsize= SIZE * sizeof(float);
  21 
  22    cudaMalloc((void**)&devPtrA, memsize);
  23    cudaMalloc((void**)&devPtrB, memsize);
  24    cudaMalloc((void**)&devPtrC, memsize);
  25    cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);
  26    cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);
  27    // __global__ functions are called:  Func<<< Dg, Db, Ns  >>>(parameter);
  28    vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);
  29    cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);
  30 
  31    for (int i=0; i<SIZE; i++)
  32         printf("C[%d]=%f\n",i,C[i]);
  33 
  34   cudaFree(devPtrA);
  35   cudaFree(devPtrA);
  36   cudaFree(devPtrA);
  37 
  38 }

Gain Access to a CUDA Capable Node

   1 qrsh -l cuda=TRUE

Running that line will log you into a CUDA node (if one is available)

Compile your application

   1 nvcc -I /opt/cuda/sdk/C/common/inc -arch sm_20 vecadd.cu -o vecadd

The preceding line will compile vecadd.cu into the binary vecadd.

Run the application

To run your application, you would do the same as you would normally:

   1 ./vecadd

The output should be:

C[0]=0.000000
C[1]=1.000000
C[2]=2.000000
C[3]=3.000000
C[4]=4.000000
C[5]=5.000000
C[6]=6.000000
C[7]=7.000000
C[8]=8.000000
C[9]=9.000000