ACESgrid
Alliance for Computational Earth Science
About & News
Getting Started
   Get An Account
   Login
   Environment customization
   ACES queues
   Queue Examples
   Compile code
   Hardware Groups
   Itanium2 nodes and IA64 software
   Storage
   Office hours
Sites
Available software
Status
People
FAQ
Mailing Lists
Quick Links
Contact Us
Sponsors
Search

MIT logo

ACES queue system

To run jobs on ACES you must use PBS. Examples of PBS jobs can be found here.

Each ACES site has access to a set of queues that, by default sends jobs to the 32-bit hardware groups at that site. To send jobs to machines other than the default set of 32-bit machines you need to specify additional attribute information during job submission.

It is also possible to send jobs to other ACES sites by specifying an alternate site in the job submission command. To do this you must specify the "head node" for that site that is given in the hardware groups table.

ACES Portable Batch System (PBS)

The PBS resource management system handles the management and monitoring of the computational workload on the ACESGrid. Users submit "jobs" to the resource management system where they are queued up until the system is ready to run them. PBS selects which jobs to run, when, and where, according to a predetermined site policy meant to balance competing user needs and to maximize efficient use of the cluster resources.

It is important that all users learn about and cooperate with the queueing system in order to avoid system "hogging" or unnecessary resource contentions (eg. two or more people trying to use the same CPU at the same time). The queue systems described here should meet the needs of the majority of users. However, they are not "graven in stone" and flexibility is possible to accomodate special needs. Please if you have special queueing requirements.

To use PBS, you create a batch job command file which you submit to the PBS server to run on the ACESGrid. A batch job file is simply a shell script containing the set of commands you want run on some set of cluster compute nodes. It also contains directives which specify the characteristics (attributes), and resource requirements (e.g. number of compute nodes and maximum runtime) that your job needs. Once you create your PBS job file, you can reuse it if you wish or modify it for subsequent runs.

PBS also provides a special kind of batch job called interactive-batch. An interactive-batch job is treated just like a regular batch job, in that it is placed into the queue and must wait for resources to become available before it can run. Once it is started, however, the user's terminal input and output are connected to the job in what appears to be an rlogin session to one of the compute nodes. Many users find this useful for debugging their applications or for computational steering.

The ACESGrid is a heterogeneous computing environment consisting of several different sites each consisting of different types of hardware. The PBS queue system controls which hardware a particular job will be executed on.

Viewing ACESGrid PBS Queues

The qstat command is used to see PBS queues. Useful options include:

qstat -a Lists all of the jobs within the PBS cluster.
qstat -an Lists all of the jobs within the PBS cluster and their respective execution hosts.
qstat -q Lists all of the queues within the PBS cluster (including resource limits).
qstat -s Lists all of the jobs within the PBS with their respective status comments.
qstat -Qf queue Lists all information about a specific queue.
qstat -f jobid Lists detailed information about a specific job.

Additional options are available. Please, read more about qstat in man pages (man qstat).

Summary of available ACESGrid PBS Queues and their attributes

Queue name

Attributes

one
  • max running jobs in this queue = 1024
  • max nodes per job = 1
  • max running jobs per user = 64
  • max walltime per job = 2 hours
four
  • default queue
  • max running jobs in this queue = 1024
  • max nodes per job = 16
  • max running jobs per user = 8
  • max walltime per job = 2 hours
four-twelve
  • max running jobs in this queue = 1024
  • max nodes per job = 26
  • max running jobs per user = 4
  • max walltime per job = 12 hours
long
  • max running jobs in this queue = 1024
  • max nodes per job = 16
  • max running jobs per user = 8
  • max walltime per job = 24 hours
toolong
  • max running jobs in this queue = 1024
  • max nodes per job = 4
  • max running jobs per user = 4
  • max walltime per job = 168 hours
all
  • CNH's private queue containing all available ACES resources!
  • max nodes per job = 1024
  • max running jobs per user = 4
  • max walltime per job = infinite hours

Job submission

The PBS qsub command is used to submit job command files for scheduling and execution. For example, to submit your job using a PBS command file called "pbs_script", the syntax would be

      $ qsub pbs_script
      1354.itrda

Notice that upon successful submission of a job, PBS returns a job identifier of the form jobid.itrda, where jobid is an integer number assigned by PBS to that job. You'll need the job identifier for any actions involving the job, such as checking job status, deleting the job. A simple example of a PBS command file is given below.

There are many options to the qsub command as can be seen by typing man qsub at the command prompt on itrda.acesgrid.org. In general jobs are submitted using qsub in either a "batch" mode (above), or in an "interactive" mode using the -I option (below). The -I option declares that the job is to be run "interactively", the -l option, allows resource requirements to be listed as part of the qsub command.

      $ qsub -I -l nodes=2
      qsub: waiting for job 46167.itrda to start
      qsub: job 46167.itrda ready
      aE34-500-036:simon <501>:

Notice once you start the interactive job, you are automatically logged into the first of the requested interactive nodes. Type exit from this shell to end the interactive session.

PBS batch script example

   
      #!/bin/csh 
      #
      #filename: pbs_script
      #
      # Example PBS script to run a job on the myrinet-3 cluster.
      # The lines beginning #PBS set various queuing parameters. 
      #
      # o -N Job Name
      #PBS -N pbs_script
      #
      #
      # o -l resource lists that control where job goes
      #    here we ask for 3 nodes, each with the attribute "gigabit".
      #PBS -l nodes=3:gigabit
      #
      # o Where to write output
      #PBS -e stderr
      #
      #PBS -o stdout
      #
      #
      # o Export all my environment variables to the job
      #PBS -V
      #
      echo $PBS_NODEFILE
      cat  $PBS_NODEFILE
      echo 'The list above shows the nodes this job has exclusive access to.'
      echo 'The list can be found in the file named in the variable $PBS_NODEFILE'

Submit the file using the command:

   
      $ qsub pbs_script

You should see output something like:

      Warning: no access to tty (Bad file descriptor).
      Thus no job control in this shell.
      /var/spool/PBS/aux/22703.itrda
      aE34-500-036
      aE34-500-037
      aE34-500-038
      The list above shows the nodes this job has exclusive access to.
      The list can be found in the file named in the variable $PBS_NODEFILE

Killing PBS Jobs

If for any reason you wish to kill a job (perhaps a job submitted in error), then the command to use is qdel and an example of the syntax is:

   
      $ qdel 46784.itrda