Other SGE Features

Job limits

Because of increased ability of SGE to handle parallel jobs, and some users affinity for filling the queue with long jobs, we have implemented a couple of methods for handling this.

Max Threads/Jobs per user

There is a hard limit on the number of cores any one person can utilize at any point in time. It is set to 500 currently, which I think is reasonable. If, for some reason, a user were to need this limitation raised, the administrator would need to be e-mailed with the request, and a detailed explanation as to why you need that many cores. To determine how many cores you have utilized, and what the limit is, you would need to have a job running and run the qquota command:

mozes@loki ~ $ qquota
resource quota rule limit                filter
--------------------------------------------------------------------------------
max_slots_per_user/1 slots=4/500        users *

Time Limits

Roughly 1/4 of the nodes cannot be used for longer than 72 hours. This means that there should always be cores available for short jobs. You do not need to know which machines are setup this way, the scheduler will handle the changes on it's own. qstat -j will tell you why a job is not starting.

Checkpointing

What is Checkpointing

Checkpointing is the ability for the running job to be "snapshotted" at a scheduled interval. If the machine the job is running on were to power off unexpectedly or your job gets suspended for any reason, the job could be resumed from the "snapshot." Checkpointing is turned off by default, so you must enable a checkpointing method for it to be enabled.

DMTCP

DMTCP is a new method of checkpointing on Beocat.

  • MPI checkpointing is currently broken, we are working to resolve the issue. When we get this fixed, your mpirun command should look like this mpirun --mca btl tcp,self <app>

  • Unlike the old BLCR method, there are no compilation requirements.
  • Infiniband is still not supported. The DMTCP developers are working on this for an upcoming release.
  • Checkpoints are taken only once every 12 hours by default. As this is a blocking checkpoint, if you are using a lot of memory (>=64GB) you may want to think about increasing the checkpoint interval. This can be done via the qsub option -c <interval> i.e. qsub -ckpt dmtcp -c 36:00:00 script.qsub

Will Checkpointing work for me?

Well, it depends.

  1. Does your application use Infiniband? If so, none of our current checkpointing software will work.
  2. Does your application use OpenMPI? If so, is it a multi node job? If you answered yes to both of these questions, DMTCP checkpointing may work for you.

  3. Are you using more than one application? If so, are they running at the same time? If you answered yes to both of the questions, checkpointing MAY work for you.

  4. For all other use cases, checkpointing method should work.

If you still have problems with checkpointing, e-mail the administrator (beocat@cis.ksu.edu), and he should be able to help you.

To enable DMTCP checkpointing for your application, add the -ckpt dmtcp option to your job submission script:

mozes@loki ~ $ qsub -ckpt dmtcp checkpoint.sh

In this submission, -ckpt dmtcp indicates that you want to use DMTCP checkpointing.

Accounting

Sometimes it is useful to know about the resources you have utilized. This can be done through the qacct command:

mozes@loki ~ $ qacct -o $USER
OWNER     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
======================================================================================================================
mozes         21856     11525.840       572.390     14276.910          32520.257              0.000              0.000

It may be more useful if you specify a timelimit for this query, for example the last day (-d 1):

mozes@loki ~ $ qacct -o $USER -d 1
OWNER     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
======================================================================================================================
mozes          3832      2962.040       145.480      3234.560           7785.764              0.000              0.000

If you work with a group of people on Beocat, and need to account for everyone's time: NOTE This only works if you have a group setup in Beocat.

mozes@loki ~ $ qacct -P `qconf -suser $USER | grep default_project | awk '{ print $2 }'`
PROJECT     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
========================================================================================================================
CIS             12917      7656.330       360.300      9121.270          20576.334              0.000              0.000

Job script options

If you were fortunate enough to use the Torque/Maui setup, you may have used an option of specifying resource requests within the job script. This was done via "#PBS $OPTION". This can also be done in SGE:

mozes@loki ~ $ vim sge_test.sub
#!/bin/bash
#$ -S /usr/local/bin/sh  # Specify my shell as sh
#$ -l h_rt=2:00:00  # Give me a 2 hour limit to finish the job
echo "Running osm2navit"
/usr/bin/env
cr_run ~/navit/navit/osm2navit Blah.bin < ~/osm/blah.osm
echo "finished osm2navit with exit code $?"

As you can see, I used the construct "#$ $OPTION" to implement submit options in the job script.

SGE Environment variables

Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that SGE makes available to you.

HOSTNAME=titan1.beocat
SGE_TASK_STEPSIZE=undefined
SGE_INFOTEXT_MAX_COLUMN=5000
SHELL=/usr/local/bin/sh
NHOSTS=2
SGE_O_WORKDIR=/homes/mozes
TMPDIR=/tmp/105.1.batch.q
SGE_O_HOME=/homes/mozes
SGE_ARCH=lx24-amd64
SGE_CELL=default
RESTARTED=0
ARC=lx24-amd64
USER=mozes
QUEUE=batch.q
PVM_ARCH=LINUX64
SGE_TASK_ID=undefined
SGE_BINARY_PATH=/opt/sge/bin/lx24-amd64
SGE_STDERR_PATH=/homes/mozes/sge_test.sub.e105
SGE_STDOUT_PATH=/homes/mozes/sge_test.sub.o105
SGE_ACCOUNT=sge
SGE_RSH_COMMAND=builtin
JOB_SCRIPT=/opt/sge/default/spool/titan1/job_scripts/105
JOB_NAME=sge_test.sub
SGE_NOMSG=1
SGE_ROOT=/opt/sge
REQNAME=sge_test.sub
SGE_JOB_SPOOL_DIR=/opt/sge/default/spool/titan1/active_jobs/105.1
ENVIRONMENT=BATCH
PE_HOSTFILE=/opt/sge/default/spool/titan1/active_jobs/105.1/pe_hostfile
SGE_CWD_PATH=/homes/mozes
NQUEUES=2
SGE_O_LOGNAME=mozes
SGE_O_MAIL=/var/mail/mozes
TMP=/tmp/105.1.batch.q
JOB_ID=105
LOGNAME=mozes
PE=mpi-fill
SGE_TASK_FIRST=undefined
SGE_O_HOST=loki
SGE_O_SHELL=/bin/bash
SGE_CLUSTER_NAME=beocat
REQUEST=sge_test.sub
NSLOTS=32
SGE_STDIN_PATH=/dev/null

Sometimes it is nice to know what hosts you have access to during a PE job. You would checkout the PE_HOSTFILE to know that. If your job has been restarted, it is nice to be able to change what happens rather than redoing all of your work. If this is the case, RESTARTED would equal 1. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.

SGE Email notifications

Frequently, you will want to know when a job has started, finished, or given an error. There are two directives you can give SGE to control these settings.

The first, '-M' gives the email address which should be notified. Please be sure to use a full email address, rather than just your login name.

The second '-m' tells when to send notifications.

  • b - Mail is sent at the beginning of the job.
  • e - Mail is sent at the end of the job.
  • a - Mail is sent when the job is aborted or rescheduled.
  • s - Mail is sent when the job is suspended.
  • n - No mail is sent.

So, in a typical configuration, you might specify

 -M MyEID@ksu.edu -m abe

This will send an email to MyEID@ksu.edu whenever a job is (a)borted, (b)eginning, or (e)nding.