Getting an account

You will first need a Kansas State eID. You can request one here https://eid.k-state.edu if you do not have one already. After you have recieved your eID, send an email to beocat@cis.ksu.edu with your name, department, degree level, research area, and research advisor.

Logging in

When you have received login rights to Beocat with your eID, then you can proceed to log in to the system. Beocat currently only offers access through SSH (Secure Shell).

From Windows

For logging in from Windows, I recommend using PuTTY http://putty.nl. The download page will have many options, you want putty.exe. After downloading, there is no setup, simply double click on the exectuable. You will be greeted with the following settings window.

putty-settings.png

Enter beocat.cis.ksu.edu in the 'hostname' text box and make sure the 'SSH' radio button is checked. You may wish to save these settings under a name such as 'beocat' to save typing on future logins. After you have entered that information, hit 'Ok' and you will be prompted to save the host's key in the cache.

putty-host_key.png

Click 'Yes' and you will arrive at the login screen.

putty-login.png

Enter your login credentials here. Please note that your password will not show up at all, not even the traditional '*****' style.

putty-prompt.png

You are now logged in to Beocat. Your login scripts may display different information.

From OSX/Linux/Unix

If you are using OSX, Linux, or Unix, start your favorite terminal emulator. Use the program ssh for connecting to Beocat.

kuffs@[dr-orpheus] ~ % ssh beocat.cis.ksu.edu

or use username@beocat.cis.ksu.edu if your username on your local computer is different from your username on Beocat.

aaron@[dr-orpheus] ~ % ssh kuffs@beocat.cis.ksu.edu
Password for kuffs@W2K.CIS.KSU.EDU: 

You will prompted for your password. Note that it will not be displayed, not even with '*'s.

After you are logged in via your client you will be greeted with a prompt similar to this.

Last login: Wed Mar 21 00:59:30 2007 from adsl-69-155-228-250.dsl.tpkaks.swbell.net
kuffs@[loki] ~ % 

You are now ready to start your work on Beocat.

Transferring files to or from Beocat

Once you have established how to log in to Beocat, it is likely that you have some files to transfer over from another system.

From Windows

For Windows, I recommend WinSCP http://winscp.net. Download it and install it. When you first start it up you will be prompted with the window below.

winscp-settings.png

Fill out the 'Host name' and 'User name' fields with the appropriate values. Optionally, you can also enter your password and save the connection as a bookmark. If you do not enter your password now, you will be prompted for it upon connecting. Click 'Login' and you will be prompted with a host key dialog similar to the one you were presented with in PuTTY.

winscp-host_key.png

Click 'Yes' on this one as well.

winscp-commander.png

This is the working window for WinSCP. You can see that your local files are presented in the left-hand pane and your files on Beocat are presented in the right-hand pane. To transfer files or directories from one place to another, simply drag them from one side to the other.

From OSX/Linux/Unix

For OSX, Linux, or Unix use the program scp.

Open up a console or virtual terminal and run:

scp local_file beocat.cis.ksu.edu:~/

The same works for directories as well, with the -r flag

scp local_directory beocat.cis.ksu.edu:~/ -r

(As with regular ssh, you will need to include username@ if your Beocat username is not the same as your local username.)

To copy files from Beocat, use a syntax like this:

scp beocat.cis.ksu.edu:~/some_file .

Or for a directory:

scp beocat.cis.ksu.edu:/some_dir . -r

Read the scp manpage for more details.

DEPRECATED

Everything below here has no longer works with the current scheduler. If you need information for the current scheduler, look here: http://support.cis.ksu.edu/BeocatDocs/SunGridEngine

Using the job queue

Beocat uses a fork of the OpenPBS batch system called Torque for job management. It supports all of the normal PBS commands and attributes, and adds some extensions.

The job queue allows users to select systems based on the resources they require. It maintains a fair computing environment and keeps users from interfering with each other.

Running computational tasks on the head node is prohibited, except for small (around an hour or so) single-processor post-processing jobs and the like. Please see http://support.cis.ksu.edu/BeocatDocs/Policy for more details.

Basic introduction

Submitting jobs

qsub is the tool used to submit jobs to the cluster. A 'job' is simply a shell script. So, for an extremely simple example, my_job.sh is a short shell script that simply reports back the name of the machine and uptime.

   1 #!/bin/bash
   2 echo "$(hostname): $(uptime)"

Please note that the first line is actually unnecessary. All scripts will be run through the shell you specify to qsub, defaulting to your normal shell. In fact, job scripts need not even be executable. (It was just handy to be able to debug it interactively)

kuffs@[loki] ~ % qsub my_job.sh 
213.loki.beocat

So we have submitted our job and it was given job number 213.

Checking the queue

You can use qstat to view the status of the jobs in the queue

kuffs@[loki] ~ % qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
213.loki          my_job.sh        kuffs                  0 Q batch          

In this screen we see our job, its name, who it was launched by, and lots of other details. We also notice under the 'S' column there is a 'Q' meaning that the job is queued and waiting to be assigned somewhere to work. This isn't a big deal, it usually takes a few seconds for the scheduler to figure out what it is going to do.

Deleting jobs

Use the program qdel to delete jobs that you no longer want queued or that have started running and want stopped.

kuffs@[loki] ~ % qdel 213

Required resource definitions

There are only two resource requirements that you absolutely need to define for your jobs. You need to define the walltime, as it is only five minutes by default And you will need to have some sort of memory request.

Memory resource requests best for single-processor jobs:

Memory resource requests best for multi-processor jobs:

Submitting parallel jobs

Parallel job submission can get a little bit more complex. In this case you have two resources you are concerned about; number of nodes and number of processors per node.

For example:

kuffs@[loki] ~ % qsub -l nodes=4:ppn=5 my_job.sh

That will submit your job to be executed on 4 nodes with 5 processors each for a total of 20 cpus.

We can mix and match the values to get exactly what we would like:

kuffs@[loki] % qsub -l nodes=16:ppn=2 my_job.sh

This job will run on 16 nodes with 2 processors each for a total of 32 cpus.

A note about parallel jobs using MPI

The MPI library installed on Beocat supports tight integration with our job system. What this means for you is that you will not have to provide a hosts file or specify the number of nodes to run on to mpiexec / mpirun. The host list and number of nodes will be passed through in the jobs environment and mpiexec / mpirun will pick that information up.

So for example, we have an mpi job we would like to run. Normally, we would have to specify something like this to start our mpi program:

mpiexec -np 8 --hostsfile ~/.hosts ~/myprogram

But with our job system, you aren't required to pass that information to mpiexec. Your job script would look like this:

   1 #!/bin/bash
   2 mpiexec ~/myprogram

You would submit it specifying your cpu requests:

kuffs@[loki] ~ % qsub -l nodes=4:ppn=2 my_job.sh

And that's it. You don't have to maintain a separate hosts file or keep the number of processors updated in your submit script.

Requesting resources

If your job requires requires a specific kind of hardware or other resource, you can add it as a prerequisite to your submission. To request resources for your job use the -l flag followed by your resource request string. For a simple example we will submit our simple job above with a request for 5GB of memory and an execution time of 1 hour.

kuffs@[loki] ~ % qsub -l mem=5g,walltime=1:00:00 my_job.sh

Keep in mind that the default walltime on the cluster is 5 minutes. Remember to specify a walltime so that your job is not killed before it is finished.

The most common resource request for jobs is for a specific platform. Beocat is a heterogeneous cluster, so not all nodes are the same.

To make the selection of certain hardware types easier, we have the following special resource flags defined:

As an example

kuffs@[loki] ~ % qsub -l nodes=1:titan my_job.sh

This selects one of the titans for execution.

You can also select a specific node by using its name

kuffs@[loki] ~ % qsub -l nodes=brute5 my_job.sh

I generally recommend against specific-node selections as you potentially limit yourself in certain cases. Such as when a running job on an identical node finishes early, your job would stay queued waiting for your requested node when the now-free node is perfectly acceptable. Unless you are sure you want this behavior, don't use this option.

Keep in mind that the queue system WILL DO WHAT IS BEST FOR YOUR JOB. It is designed to utilize as much of the cluster as possible, and so will schedule your job to run as quickly as it can. It is unlikely that you have found a 'cheat' in the system. Please look at the troubleshooting entry towards the bottom of the page for more information about what causes jobs to wait in the queue.

More complex examples

Options can be mixed and matched to get the exact specification you want.

This selects two processors each on 4 brutes with an execution time of one hour

kuffs@[loki] ~ % qsub -l nodes=4:brute:ppn=2,walltime=1:00:00

Where to get more information

The official documentation is at http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki. They have a bit more details on resources and subtleties of job submission.

Common uses

Interactive jobs

Many users may wish to test their code interactively before they clean it up for regular job submission. Thankfully, there is a way to run interactive jobs through the queueing system. To use the interactive mode of the queue system, omit your job script and add the -I flag to qsub.

kuffs@[loki] ~ % qsub -I
qsub: waiting for job 3087.loki.beocat to start
qsub: job 3087.loki.beocat ready
kuffs@[virgrack6] ~ %

You can also run parallel jobs in the same manner, just provide the appropriate resource requests to qsub in addition to the -I flag.

kuffs@[loki] ~ % qsub -I -l nodes=4:x84_64:ppn=4
qsub: waiting for job 3088.loki.beocat to start
qsub: job 3088.loki.beocat ready
kuffs@[brute5] ~ %

Adding qsub parameters to your job script

To save yourself a bit of typing on the command line when testing, all the qsub parameters can be added as special comments in your job's shell script.

For example, let's say you have these parameters to give to qsub:

kuffs@[loki] ~ % qsub -l walltime=20:00:00 -S /bin/bash -N sim_pass3 my_job.sh

This would run the script my_job.sh using /bin/bash as the shell with a maximum run time of 20 hours and named 'sim_pass3'. This could get to be a pain to remember and edit every time. Fortunately, we can use special comments in my_job.sh that make our life easier:

   1 #!/bin/bash
   2 #PBS -l walltime=20:00:00
   3 #PBS -S /bin/bash
   4 #PBS -N sim_pass3
   5 echo "$(hostname): $(uptime)"

Now you would submit the job with the much simpler command:

kuffs@[loki] ~ % qsub my_job.sh

Troubleshooting jobs

Job won't start, always in queued state

So you've started a job, and can't figure out why it won't start.

kuffs@[loki] ~ % qsub my_job.sh -l mem=100G
219.loki.beocat
kuffs@[loki] ~ % qstat -n

loki.beocat: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
219.loki.beocat    kuffs    batch    my_job.sh     --      1  --  100gb 00:05 Q   -- 
    --          

Everything is fine, except that the scheduler doesn't seem to want to select nodes for the job to run on. There is a handy tool checkjob for this task.

kuffs@[loki] ~ % checkjob 219
checking job 219

State: Idle  EState: Deferred
Creds:  user:kuffs  group:kuffs_users  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:05:00
SubmitTime: Wed Nov 15 17:09:03
  (Time Queued  Total: 00:14:14  Eligible: 00:00:15)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 100G


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  NoResources  (cannot create reservation for job '219' (intital reservation attempt)
)
Holds:    Defer  (hold reason:  NoResources)
PE:  38.18  StartPriority:  1
cannot select job 219 for partition DEFAULT (job hold active)

This incredibly detailed output tells us that the scheduler can't find the resources to give our task. Poking through more of the output, we see Dedicated Resources Per Task: PROCS: 1  MEM: 100G... doh! The scheduler happens to be right, there are no nodes in the cluster that have 100G of ram.

So we delete the broken job.

kuffs@[loki] ~ % qdel 219

And submit our job, sans the typo.

kuffs@[loki] ~ % qsub my_job.sh -l mem=100m
221.loki.beocat

And things run just fine.

kuffs@[loki] ~ % checkjob 221
checking job 221

State: Running
Creds:  user:kuffs  group:kuffs_users  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:05:00
SubmitTime: Wed Nov 15 17:30:17
  (Time Queued  Total: 00:00:14  Eligible: 00:00:14)

StartTime: Wed Nov 15 17:30:31
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 100M
Allocated Nodes:
[virgrack6:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '221' (00:00:00 -> 00:05:00  Duration: 00:05:00)
PE:  1.00  StartPriority:  1

The job appears to be running just fine now. The scheduler has chosen one of the processors on virgrack6 for our task.

Node seems idle, but jobs won't start on it

Let's say, for example, that you've decided to ignore my warnings about requesting a specific node and submit a job like so:

kuffs@[loki] ~ % qsub -l nodes=brute2 my_job.sh 
10547.loki.beocat

And after waiting a bit, we see that it is not getting started.

kuffs@[loki] ~ % qstat -u kuffs

loki.beocat: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
10547.loki.beocat    kuffs    batch    my_job.sh     --      1  --    --  00:05 Q   -- 

So, following the normal troubleshooting patterns, let's look at what checkjob says

kuffs@[loki] ~ % checkjob 10547


checking job 10547

State: Idle
Creds:  user:kuffs  group:kuffs_users  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:05:00
SubmitTime: Tue Sep 11 16:11:11
  (Time Queued  Total: 00:02:13  Eligible: 00:02:13)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList: 
  [brute2:1]
PE:  1.00  StartPriority:  12
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs:  88  feasible procs:   0

Rejection Reasons: [ReserveTime  :    1][HostList     :   47]

Note the rejection reasons; ReserveTime and HostList. This means that the node we requested doesn't match the name of 47 other hosts, and that the one that does match is reserved (or will be reserved before your job had time to run to completion). To see what reservations are in effect, use the tools showres or diagnose -r

kuffs@[loki] ~ % showres
Reservations

ReservationID       Type S       Start         End    Duration    N/P    StartTime

10282                Job R -7:03:00:36  4:20:59:24 12:00:00:00    1/1    Tue Sep  4 13:15:54
10291                Job I  6:07:43:30 14:15:43:30  8:08:00:00    1/8    Tue Sep 18 00:00:00
10389                Job R -5:07:13:40  1:16:46:20  7:00:00:00    1/1    Thu Sep  6 09:02:50
10393                Job R -5:07:05:24  1:16:54:36  7:00:00:00    1/4    Thu Sep  6 09:11:06
10434                Job R -4:03:22:50  5:10:37:10  9:14:00:00    1/1    Fri Sep  7 12:53:40
10444                Job R -3:04:45:18  5:03:13:42  8:07:59:00    1/1    Sat Sep  8 11:31:12
10446                Job R -3:04:40:39  5:03:18:21  8:07:59:00    1/1    Sat Sep  8 11:35:51
10450                Job R -3:01:45:17     5:14:43  3:07:00:00    1/8    Sat Sep  8 14:31:13
10451                Job R -3:01:42:42     5:17:18  3:07:00:00    1/16   Sat Sep  8 14:33:48
10489                Job R -1:04:52:46  1:16:07:14  2:21:00:00    1/8    Mon Sep 10 11:23:44
10490                Job R -1:00:50:52  1:20:09:08  2:21:00:00    1/8    Mon Sep 10 15:25:38
10492                Job R -1:17:08:15  1:03:51:45  2:21:00:00    1/4    Sun Sep  9 23:08:15
10507                Job R   -18:21:31     1:37:29    19:59:00    1/8    Mon Sep 10 21:54:59
10508                Job R   -17:28:49    12:30:11  1:05:59:00    1/8    Mon Sep 10 22:47:41
10510                Job R    -4:10:24  1:20:49:36  2:01:00:00    1/8    Tue Sep 11 12:06:06
10521                Job R    -3:22:49    16:36:11    19:59:00    1/8    Tue Sep 11 12:53:41
10522                Job R    -3:22:49  1:12:36:11  1:15:59:00    1/8    Tue Sep 11 12:53:41
10523                Job R    -2:40:27  1:13:18:33  1:15:59:00    1/8    Tue Sep 11 13:36:03
10527                Job R    -6:38:18     1:21:42     8:00:00    1/4    Tue Sep 11 09:38:12
10538                Job R    -3:58:31     3:00:29     6:59:00    1/1    Tue Sep 11 12:17:59
10540                Job R    -2:47:41    17:11:19    19:59:00    1/1    Tue Sep 11 13:28:49
10542                Job R    -2:16:10    18:42:50    20:59:00    1/1    Tue Sep 11 14:00:20
10546                Job R   -00:34:10    19:24:50    19:59:00    1/1    Tue Sep 11 15:42:20
santos.1            User - -7:03:03:17    INFINITY    INFINITY    3/24   Tue Sep  4 13:13:13
pvfs-testing.0      User - -5:05:22:33    INFINITY    INFINITY    1/4    Thu Sep  6 10:53:57
Sept_17-Maintenance.1  User -  5:19:43:30  6:07:43:30    12:00:00   48/244  Mon Sep 17 12:00:00

26 reservations located

Most of the reservations are for currently running jobs, the ones you need to be concerned about are at the bottom.

kuffs@[loki] ~ % diagnose -r santos.1
Diagnosing Reservations
ResID                      Type Par   StartTime     EndTime     Duration Node Task Proc
-----                      ---- ---   ---------     -------     -------- ---- ---- ----
santos.1                   User DEF -7:03:05:52    INFINITY     INFINITY    3    3   24
    Flags: PREEMPTEE
    ACL: RES==santos= USER==alk5588+ 
    CL:  RES==santos 
    Task Resources: PROCS: [ALL]
    Attributes (HostList='brute[2-4]')
    Active PH: 0.00/4106.55 (0.00%)

Brute[2-4] are reserved for alk5588 presently, which is why your job will not start.

This problem will also come up nearing the biweekly maintenance periods. The queue will seem broken, but in actuality, it hasn't scheduled your job because it won't be able to finish before maintenance.