Skip to main content

Running High-Throughput Jobs on Peregrine

You can use a tool called Nitro to wrap many short tasks into a single job. It creates a coordinator and a number of workers. The coordinator gives tasks to the workers until all the tasks have been run. This greatly reduces the scheduler overhead associated with a large number of very short jobs.

Nitro allows users to:

  • Keep a set of cores busy running serial, threaded or modestly parallel programs
  • Get real time status updates while a job is running
  • Find out which tasks failed after the job is complete
  • Restart a workload in another job, and it will continue working on tasks that were not executed yet.

There are 480 core licenses available for Nitro, so a single job cannot use more than 480 cores.

Creating Jobs

First you will need to create a task file. This is a list of all the tasks you want to run. Each line will include a command to run, which is preceded by "cmd=". Everything after the = sign is treated as the command to be executed.

Task Files

Task files contain a list of commands to run. The user gives each command a unique name (using "name=") and usse the task name to create a unique output file name. For additional options, see the Nitro documentation.

 

Here's a task file that runs a trivial Hello world program.

name=k1 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k2 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k3 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k4 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k5 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k6 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k7 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k8 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k9 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k10 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k11 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k12 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k13 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k14 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k15 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k16 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k17 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k18 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k19 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k20 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k21 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k22 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k23 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k24 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k25 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
name=k26 cmd=/home/icarpent/hello_world >& $NITROTASKNAME.out
 

Here's an excerpt of a task file with multiple runs of the program Povray:

...
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_185.pov
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_186.pov
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_187.pov
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_188.pov
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_189.pov
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_190.pov
cmd=/home/$USER/bin/povray +H1125 +W858 -Ipyramid2.pov -Opyramid2_191.pov
...

Job Script

Next you create a job script that will be submitted in the normal way. In this script, you'll tell the system where to find your task file by setting the NITRO_TASK_FILE environment variable. You can set options that control how Nitro works, such as whether to create workers on the same node the coordinator runs on and how many tasks to give to each worker at a time using the NITRO_COORD_OPTIONS environment variable.

If your job runs on a single node, you need to run both the coordinator and the workers on that node. You tell Nitro to do this by including --run_local_worker in the NITRO_COORD_OPTIONS variable.

Finally, you start Nitro by executing launch_nitro.sh.

Sample Nitro job script

To run a job using the Povray task file, we can use this job script:

#!/bin/bash -l

#PBS -N nitro_test
#PBS -l nodes=3:ppn=16
#PBS -W x=GRES:nitro_core+48              #  hold the job until 48 core licenses are available
#PBS -l walltime=1:00:00
#PBS -q short
#PBS -A CSC000                            #  change this to your allocation handle

module load nitro

export NITRO_TASK_FILE=/home/$USER/povray/build/sample/POV/taskfile
export NITRO_COORD_OPTIONS=--assignment-size=80       # number of tasks the coordinator should pass to a worker at a time

launch_nitro.sh

A sample submission script and a task file can also be found on Peregrine at /nopt/nitro/sample.

Monitoring Jobs

Whether the job is running or has already finished, use the nitrostat command to check the status:

$ nitrostat *JOBID*

You can see the job's progress while it's running using

$ nitrostat *JOBID* -w.

After your job finishes, you can see what if all you tasks succeeded:

$ nitrostat *JOBID*

and if they didn't, which ones failed

$ nitrostat *JOBID* -f

By default, nitro will store log files in /home/$USER/nitro. You can change the location of log files by setting the NITROJOBDIR environment variable. For example, if you want to store the logs in your working directory, you can add the following to your job script file:

export NITROJOBDIR=$PBS_O_WORKDIR

If you do this, you will also have to reference the working directory when using the nitrostat command to check the status of your job. For example

$ nitrostat -d $PWD 

For more information, please see the Nitro Users Guide.