ACES queue system
To run jobs on ACES you must use PBS. Examples of PBS jobs can be found here.
Each ACES site has access to a set of queues that, by default sends jobs to the 32-bit hardware groups at that site. To send jobs to machines other than the default set of 32-bit machines you need to specify additional attribute information during job submission.
It is also possible to send jobs to other ACES sites by specifying an alternate site in the job submission command. To do this you must specify the "head node" for that site that is given in the hardware groups table.
ACES Portable Batch System (PBS)
The PBS resource management system handles the management and monitoring of the computational workload on the ACESGrid. Users submit "jobs" to the resource management system where they are queued up until the system is ready to run them. PBS selects which jobs to run, when, and where, according to a predetermined site policy meant to balance competing user needs and to maximize efficient use of the cluster resources.
It is important that all users learn about and cooperate with the queueing system in order to avoid system "hogging" or unnecessary resource contentions (eg. two or more people trying to use the same CPU at the same time). The queue systems described here should meet the needs of the majority of users. However, they are not "graven in stone" and flexibility is possible to accomodate special needs. Please if you have special queueing requirements.
To use PBS, you create a batch job command file which you submit to the PBS server to run on the ACESGrid. A batch job file is simply a shell script containing the set of commands you want run on some set of cluster compute nodes. It also contains directives which specify the characteristics (attributes), and resource requirements (e.g. number of compute nodes and maximum runtime) that your job needs. Once you create your PBS job file, you can reuse it if you wish or modify it for subsequent runs.
PBS also provides a special kind of batch job called interactive-batch. An interactive-batch job is treated just like a regular batch job, in that it is placed into the queue and must wait for resources to become available before it can run. Once it is started, however, the user's terminal input and output are connected to the job in what appears to be an rlogin session to one of the compute nodes. Many users find this useful for debugging their applications or for computational steering.
The ACESGrid is a heterogeneous computing environment consisting of several different sites each consisting of different types of hardware. The PBS queue system controls which hardware a particular job will be executed on.
Viewing ACESGrid PBS Queues
The qstat command is used to see PBS queues. Useful options include:
qstat -a |
Lists all of the jobs within the PBS cluster. |
qstat -an |
Lists all of the jobs within the PBS cluster and their respective execution hosts. |
qstat -q |
Lists all of the queues within the PBS cluster (including resource limits). |
qstat -s |
Lists all of the jobs within the PBS with their respective status comments. |
qstat -Qf queue |
Lists all information about a specific queue. |
qstat -f jobid |
Lists detailed information about a specific job. |
Additional options are available. Please, read more about qstat in man pages (man qstat).
Summary of available ACESGrid PBS Queues and their attributes
Queue name
|
Attributes
|
one |
- max running jobs in this queue = 1024
- max nodes per job = 1
- max running jobs per user = 64
- max walltime per job = 2 hours
|
four |
- default queue
- max running jobs in this queue = 1024
- max nodes per job = 16
- max running jobs per user = 8
- max walltime per job = 2 hours
|
four-twelve |
- max running jobs in this queue = 1024
- max nodes per job = 26
- max running jobs per user = 4
- max walltime per job = 12 hours
|
long |
- max running jobs in this queue = 1024
- max nodes per job = 16
- max running jobs per user = 8
- max walltime per job = 24 hours
|
toolong |
- max running jobs in this queue = 1024
- max nodes per job = 4
- max running jobs per user = 4
- max walltime per job = 168 hours
|
all |
- CNH's private queue containing all available ACES resources!
- max nodes per job = 1024
- max running jobs per user = 4
- max walltime per job = infinite hours
|
Job submission
The PBS qsub command is used to submit job command files for scheduling and execution. For example, to submit your job using a PBS command file called "pbs_script", the syntax would be
$ qsub pbs_script
1354.itrda
Notice that upon successful submission of a job, PBS returns a job identifier of the form jobid.itrda, where jobid is an integer number assigned by PBS to that job. You'll need the job identifier for any actions involving the job, such as checking job status, deleting the job. A simple example of a PBS command file is given below.
There are many options to the qsub command as can be seen by typing man qsub at the command prompt on itrda.acesgrid.org. In general jobs are submitted using qsub in either a "batch" mode (above), or in an "interactive" mode using the -I option (below). The -I option declares that the job is to be run "interactively", the -l option, allows resource requirements to be listed as part of the qsub command.
$ qsub -I -l nodes=2
qsub: waiting for job 46167.itrda to start
qsub: job 46167.itrda ready
aE34-500-036:simon <501>:
Notice once you start the interactive job, you are automatically logged into the first of the requested interactive nodes. Type exit from this shell to end the interactive session.
PBS batch script example
#!/bin/csh
#
#filename: pbs_script
#
# Example PBS script to run a job on the myrinet-3 cluster.
# The lines beginning #PBS set various queuing parameters.
#
# o -N Job Name
#PBS -N pbs_script
#
#
# o -l resource lists that control where job goes
# here we ask for 3 nodes, each with the attribute "gigabit".
#PBS -l nodes=3:gigabit
#
# o Where to write output
#PBS -e stderr
#
#PBS -o stdout
#
#
# o Export all my environment variables to the job
#PBS -V
#
echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo 'The list above shows the nodes this job has exclusive access to.'
echo 'The list can be found in the file named in the variable $PBS_NODEFILE'
Submit the file using the command:
$ qsub pbs_script
You should see output something like:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/var/spool/PBS/aux/22703.itrda
aE34-500-036
aE34-500-037
aE34-500-038
The list above shows the nodes this job has exclusive access to.
The list can be found in the file named in the variable $PBS_NODEFILE
Killing PBS Jobs
If for any reason you wish to kill a job (perhaps a job submitted in error), then the command to use is qdel and an example of the syntax is:
$ qdel 46784.itrda
|