Commands to Monitor and Control Jobs on Eagle

Learn about a variety of Slurm commands to monitor and control jobs on Eagle.

Please see man pages for more information on the commands listed below. Also see --help or --usage.

Also see our Presentation on Advanced Slurm Features, which has supplementary information on how to manage jobs.

On Github, see another great resource for Slurm on Eagle.

Command	Description
squeue	Show the Slurm queue. Users can specify JOBID or USER.
scontrol	Controls various aspects of jobs such as job suspension, re-queuing or resuming jobs and can display diagnostic info about each job.
scancel	Cancel specified job(s).
sinfo	View information about all Slurm nodes and partitions.
sacct	Detailed information on accounting for all jobs and job steps.
sprio	View priority and the factors that determine scheduling priority.

Usage Examples

squeue

$ squeue -u hpcuser
           JOBID    PARTITION       NAME      USER   ST       TIME      NODES   NODELIST(REASON)
          506955          gpu   wait_tes   hpcuser   PD       0:00          1      (Resources)

$ squeue -l
Thu Dec 13 12:17:31 2018
 JOBID  PARTITION NAME     USER     STATE   TIME    TIME_LIMI   NODES  NODELIST(REASON)
 516890 standard Job007    user1    PENDING 0:00    12:00:00    1050   (Dependency)
 516891 standard Job008    user1    PENDING 0:00    12:00:00    1050   (Dependency)
 516897      gpu Job009    user2    PENDING 0:00    04:00:00       1   (Resources)
 516898 standard Job010    user3    PENDING 0:00    15:00:00      71   (Priority)
 516899 standard Job011    user3    PENDING 0:00    15:00:00      71   (Priority)
-----------------------------------------------------------------------------
 516704 standard Job001    user4    RUNNING 4:09:48 15:00:00      71    r1i0n[0-35],r1i1n[0-34]
 516702 standard Job002    user4    RUNNING 4:16:50 15:00:00      71    r1i6n35,r1i7n[0-35],r2i0n[0-33]
 516703 standard Job003    user4    RUNNING 4:16:57 15:00:00      71    r1i5n[0-35],r1i6n[0-34]
 516893 standard Job004    user4    RUNNING 7:19     3:00:00      71    r1i1n35,r1i2n[0-35],r1i3n[0-33]
 516894 standard Job005    user4    RUNNING 7:19     3:00:00      71    r4i2n[20-25],r6i6n[7-35],r6i7n[0-35]
 516895 standard Job006    user4    RUNNING 7:19     3:00:00      71    r4i2n[29-35],r4i3n[0-35],r4i4n[0-20]

To estimate when your jobs will start to run, use the squeue --start command with the JOBID.

$ squeue --start -j 509851,509852
 JOBID    PARTITION    NAME      USER      ST          START_TIME    NODES   SCHEDNODES   NODELIST(REASON)
 509851   short      test1.sh   hpcuser    PD                 N/A      100       (null)       (Dependency)
 509852   short      test2.sh   hpcuser    PD 2018-12-19T16:54:00        1      r1i6n35         (Priority)

scontrol

To get detailed information about your job before and while it runs, you may use scontrol show job with the JOBID. For example:

$ scontrol show job 522616
JobId=522616 JobName=myscript.sh
 UserId=hpcuser(123456) GroupId=hpcuser(123456) MCS_label=N/A
 Priority=43295364 Nice=0 Account=csc000 QOS=normal
 JobState=PENDING Reason=Dependency Dependency=afterany:522615

The scontrol command can also be used to modify pending and running jobs:

$ scontrol update jobid=526501 qos=high
$ sacct -j 526501 --format=jobid,partition,state,qos
       JobID  Partition      State        QOS
------------ ---------- ---------- ----------
526501            short    RUNNING       high
526501.exte+               RUNNING
526501.0                 COMPLETED

To pause a job: scontrol hold <JOBID>

To resume a job: scontrol resume <JOBID>

To cancel and rerun: scontrol requeue <JOBID>

Use scancel -i <jobID> for an interactive mode to confirm each job_id.step_id before performing the cancel operation. Use scancel --state=PENDING,RUNNING,SUSPENDED -u <userid> to cancel your jobs by STATE or scancel -u <userid> to cancel ALL of your jobs.

sinfo

Use sinfo to view cluster information:

$ sinfo -o %A
NODES(A/I)
1580/514

Above, sinfo shows nodes Allocated (A) and nodes idle (I) in the entire cluster.

To see specific node information use sinfo -n <node id> to show information about a single or list of nodes. You will see the partition to which the node can allocate as well as the node STATE.

$ sinfo -n r105u33,r2i4n27
PARTITION  AVAIL   TIMELIMIT NODES  STATE  NODELIST
short      up        4:00:00     1  drain   r2i4n27
short      up        4:00:00     1   down   r105u33
standard   up     2-00:00:00     1  drain   r2i4n27
standard   up     2-00:00:00     1   down   r105u33
long       up     10-00:00:0     1  drain   r2i4n27
long       up     10-00:00:0     1   down   r105u33
bigmem     up     2-00:00:00     1   down   r105u33
gpu        up     2-00:00:00     1   down   r105u33
bigscratch up     2-00:00:00     0    n/a
ddn        up     2-00:00:00     0    n/a

sacct

Use sacct to view accounting information about jobs AND job steps:

$ sacct -j 525198 --format=User,JobID,Jobname,partition,state,time,start,elapsed,nnodes,ncpus
     User        JobID    JobName  Partition      State  Timelimit               Start    Elapsed  NNodes    NCPUS
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------- --------
  hpcuser 525198        acct_test      short  COMPLETED   00:01:00 2018-12-19T16:09:34   00:00:54       4      144
          525198.batch      batch             COMPLETED            2018-12-19T16:09:34   00:00:54       1       36
          525198.exte+     extern             COMPLETED            2018-12-19T16:09:34   00:00:54       4      144
          525198.0           bash             COMPLETED            2018-12-19T16:09:38   00:00:00       4        4

Use sacct -e to print a list of fields that can be specified with the --format option.

sprio

By default, sprio returns information for all pending jobs. Options exist to display specific jobs by JOBID and USER.

$ sprio -u hpcuser
  JOBID  PARTITION     USER  PRIORITY   AGE  JOBSIZE PARTITION       QOS
 526752      short  hpcuser  43383470  3733   179737         0  43200000

$ sprio -u hpcuser -n
  JOBID  PARTITION     USER    PRIORITY        AGE    JOBSIZE  PARTITION        QOS
 526752      short  hpcuser  0.01010100  0.0008642  0.0009747  0.0000000  0.1000000