Other SGE Features
Job limits
Because of increased ability of SGE to handle parallel jobs, and some users affinity for filling the queue with long jobs, we have implemented a couple of methods for handling this.
Max Threads/Jobs per user
There is a hard limit on the number of cores any one person can utilize at any point in time. It is set to 500 currently, which I think is reasonable. If, for some reason, a user were to need this limitation raised, the administrator would need to be e-mailed with the request, and a detailed explanation as to why you need that many cores. To determine how many cores you have utilized, and what the limit is, you would need to have a job running and run the qquota command:
mozes@loki ~ $ qquota resource quota rule limit filter -------------------------------------------------------------------------------- max_slots_per_user/1 slots=4/500 users *
Time Limits
Roughly 1/4 of the nodes cannot be used for longer than 72 hours. This means that there should always be cores available for short jobs. You do not need to know which machines are setup this way, the scheduler will handle the changes on it's own. qstat -j will tell you why a job is not starting.
Checkpointing
What is Checkpointing
Checkpointing is the ability for the running job to be "snapshotted" at a scheduled interval. If the machine the job is running on were to power off unexpectedly or your job gets suspended for any reason, the job could be resumed from the "snapshot." Checkpointing is turned off by default, so you must enable a checkpointing method for it to be enabled.
DMTCP
DMTCP is a new method of checkpointing on Beocat.
MPI checkpointing is currently broken, we are working to resolve the issue. When we get this fixed, your mpirun command should look like this mpirun --mca btl tcp,self <app>
- Unlike the old BLCR method, there are no compilation requirements.
- Infiniband is still not supported. The DMTCP developers are working on this for an upcoming release.
Checkpoints are taken only once every 12 hours by default. As this is a blocking checkpoint, if you are using a lot of memory (>=64GB) you may want to think about increasing the checkpoint interval. This can be done via the qsub option -c <interval> i.e. qsub -ckpt dmtcp -c 36:00:00 script.qsub
Will Checkpointing work for me?
Well, it depends.
- Does your application use Infiniband? If so, none of our current checkpointing software will work.
Does your application use OpenMPI? If so, is it a multi node job? If you answered yes to both of these questions, DMTCP checkpointing may work for you.
Are you using more than one application? If so, are they running at the same time? If you answered yes to both of the questions, checkpointing MAY work for you.
- For all other use cases, checkpointing method should work.
If you still have problems with checkpointing, e-mail the administrator (beocat@cis.ksu.edu), and he should be able to help you.
To enable DMTCP checkpointing for your application, add the -ckpt dmtcp option to your job submission script:
mozes@loki ~ $ qsub -ckpt dmtcp checkpoint.sh
In this submission, -ckpt dmtcp indicates that you want to use DMTCP checkpointing.
Accounting
Sometimes it is useful to know about the resources you have utilized. This can be done through the qacct command:
mozes@loki ~ $ qacct -o $USER OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW ====================================================================================================================== mozes 21856 11525.840 572.390 14276.910 32520.257 0.000 0.000
It may be more useful if you specify a timelimit for this query, for example the last day (-d 1):
mozes@loki ~ $ qacct -o $USER -d 1 OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW ====================================================================================================================== mozes 3832 2962.040 145.480 3234.560 7785.764 0.000 0.000
If you work with a group of people on Beocat, and need to account for everyone's time: NOTE This only works if you have a group setup in Beocat.
mozes@loki ~ $ qacct -P `qconf -suser $USER | grep default_project | awk '{ print $2 }'`
PROJECT WALLCLOCK UTIME STIME CPU MEMORY IO IOW
========================================================================================================================
CIS 12917 7656.330 360.300 9121.270 20576.334 0.000 0.000
Job script options
If you were fortunate enough to use the Torque/Maui setup, you may have used an option of specifying resource requests within the job script. This was done via "#PBS $OPTION". This can also be done in SGE:
mozes@loki ~ $ vim sge_test.sub #!/bin/bash #$ -S /usr/local/bin/sh # Specify my shell as sh #$ -l h_rt=2:00:00 # Give me a 2 hour limit to finish the job echo "Running osm2navit" /usr/bin/env cr_run ~/navit/navit/osm2navit Blah.bin < ~/osm/blah.osm echo "finished osm2navit with exit code $?"
As you can see, I used the construct "#$ $OPTION" to implement submit options in the job script.
SGE Environment variables
Within an actual job, sometimes you need to know specific things about the running environment to setup your scripts correctly. Here is a listing of environment variables that SGE makes available to you.
HOSTNAME=titan1.beocat SGE_TASK_STEPSIZE=undefined SGE_INFOTEXT_MAX_COLUMN=5000 SHELL=/usr/local/bin/sh NHOSTS=2 SGE_O_WORKDIR=/homes/mozes TMPDIR=/tmp/105.1.batch.q SGE_O_HOME=/homes/mozes SGE_ARCH=lx24-amd64 SGE_CELL=default RESTARTED=0 ARC=lx24-amd64 USER=mozes QUEUE=batch.q PVM_ARCH=LINUX64 SGE_TASK_ID=undefined SGE_BINARY_PATH=/opt/sge/bin/lx24-amd64 SGE_STDERR_PATH=/homes/mozes/sge_test.sub.e105 SGE_STDOUT_PATH=/homes/mozes/sge_test.sub.o105 SGE_ACCOUNT=sge SGE_RSH_COMMAND=builtin JOB_SCRIPT=/opt/sge/default/spool/titan1/job_scripts/105 JOB_NAME=sge_test.sub SGE_NOMSG=1 SGE_ROOT=/opt/sge REQNAME=sge_test.sub SGE_JOB_SPOOL_DIR=/opt/sge/default/spool/titan1/active_jobs/105.1 ENVIRONMENT=BATCH PE_HOSTFILE=/opt/sge/default/spool/titan1/active_jobs/105.1/pe_hostfile SGE_CWD_PATH=/homes/mozes NQUEUES=2 SGE_O_LOGNAME=mozes SGE_O_MAIL=/var/mail/mozes TMP=/tmp/105.1.batch.q JOB_ID=105 LOGNAME=mozes PE=mpi-fill SGE_TASK_FIRST=undefined SGE_O_HOST=loki SGE_O_SHELL=/bin/bash SGE_CLUSTER_NAME=beocat REQUEST=sge_test.sub NSLOTS=32 SGE_STDIN_PATH=/dev/null
Sometimes it is nice to know what hosts you have access to during a PE job. You would checkout the PE_HOSTFILE to know that. If your job has been restarted, it is nice to be able to change what happens rather than redoing all of your work. If this is the case, RESTARTED would equal 1. There are lots of useful Environment Variables there, I will leave it to you to identify the ones you want.