Running Batch Jobs on Eagle
Batch jobs are run on Eagle by submitting a job script to the scheduler. The script contains the commands needed to set up your environment and run your application.
To submit jobs on Eagle, the Slurm sbatch command should be used:
$ sbatch --account=<project-handle> <batch_script>
Scripts and program executables may reside in any file system, but input and output files should be read from or written to the /scratch file system mount. /scratch uses the Lustre filesystem which is designed to utilize the parallelized networking fabric that exists between Eagle nodes, and will result in much higher throughput on file manipulations.
Arguments to sbatch may be used to specify resource limits such as job duration (referred to as "walltime"), number of nodes, etc., as well as what hardware features you want your job to run with. These can also be supplied within the script itself by placing #SBATCH comment directives within the file.
For examples of implementations, please see our sample batch scripts.
Users familiar with job submissions to PBS on Peregrine may be interested in our PBS to Slurm Translation Sheet for quickly converting workflows over to Eagle. Also see the official Slurm Cheat Sheet produced by SchedMD, the developers of the Slurm.
Note: Command line arguments must precede the batch executable or they will be ignored. Duplicate arguments supplied via command line take precedence over #SBATCH directives.
Job Submission Requirements
Parameter | How Specified | Example | Additional Information |
---|---|---|---|
Project Handle | --account or -A |
--account=<project_handle> |
Project handles are provided by HPC Operations at the beginning of an allocation cycle. |
Maximum Job Duration (Walltime) |
--time or -t |
--time=1-12:05:30 (1-12:05:30 = 1day 12hours 5min 30sec) |
Recognized Time Formats:
|
Resource Request Descriptions
Parameter | How Specified | Example | Additional Information |
---|---|---|---|
Nodes / Tasks / MPI Ranks |
--nodes or -N --ntasks or -n --ntasks-per-node |
--nodes=2 --ntasks=40 --ntasks-per-node=20 |
If ntasks is specified it is still important to indicate the number of nodes to be requested. This helps with scheduling jobs on the fewest possible number of Ecells required for that job. The maximum number of tasks that can be assigned per node is 36. Note: the --tasks flag is not mentioned in official documentation, but exists as an alias for --ntasks-per-node |
Memory |
--mem --mem-per-cpu |
--mem=50000 or --mem 700GB |
memory per node memory per task / MPI rank |
Local Disk - /tmp/scratch | --tmp |
--tmp=20TB --tmp=1200GB --tmp 1000000 |
Request /tmp/scratch space in megabytes (default), GB, or TB. *1000000MB = 1TB |
GPUs | --gres | --gres=gpu:2 | There are 44 nodes in the queue, each with 2 NVidia Tesla V100 GPUs. |
Licenses Planned functionality |
--licenses | --licenses=COMSOL:1 |
Use scontrol show lic to display what licenses are available. This functionality is not yet set up on Eagle. Implementation may differ. |
Job Management and Output
Parameter | How Specified | Example | Additional Information |
---|---|---|---|
High Priority | --qos | --qos=high | Jobs that request qos=high will generally jump to the top of the queue. Time used will charge 2x normal rate. |
Job Dependencies | --dependency |
--dependency=\
Conditions:
|
You can submit jobs that will wait until a condition is met before running.
When job runs due to dependency conditions:
|
Job Name | --job-name | --job-name=myjob | Will help recognize a particular job in the queue to differentiate it from other jobs. |
Email Notification |
--mail-user
|
--mail-user=\ --mail-type=ALL |
Email address. Slurm will send updates on state change. Type specifies which state will generate an email update. Type may be: BEGIN, END, FAIL, or ALL. |
Output |
--output --error |
--output=job_stdout --output=job_stderr |
Defaults to slurm-<jobid>.out Defaults to slurm-<jobid>.out (same file as stdout) stderr will be written to same file as stdout unless specified otherwise |
Commonly Used Slurm Environment Variables
Environment Variable | Semantic Meaning | Sample Value |
$LOCAL_SCRATCH | Absolute directory path for large disk space per node (nodes will not have access to each other's local scratch). This should always be /tmp/scratch across all Eagle nodes | /tmp/scratch |
$SLURM_CLUSTER_NAME | Identical to $NREL_CLUSTER on our systems, but is the name given to the cluster in the configuration of the master Slurm daemon | eagle |
$SLURM_CPUS_ON_NODE | We have disabled hyperthreading, so this should always be 36 CPUs per node across all of Eagle. | 36 |
$SLURMD_NODENAME | The name (as configured in Slurm) of the individual node evaluating the variable, which should match the hostname on our systems. | r4i2n3 |
$SLURM_JOB_ACCOUNT | The account used to submit the job, which should be your project's handle. | csc000 |
$SLURM_JOB_CPUS_PER_NODE | If a value for --cpus-per-node is specified for the job, this will reflect that. This should be <= 36 | 36 |
$SLURM_JOB_ID or $SLURM_JOBID |
The job ID assigned to the job, so you can identify output related to just that job. | 521837 |
$SLURM_JOB_NAME | If the job was named, this will hold that name, otherwise it will default to the command that was run in the job. | sh |
$SLURM_JOB_NODELIST or $SLURM_NODELIST |
This contains the hostnames of the nodes that participated in your job. There will be at least one, and is often given in Slurm's range syntax for contiguous nodes, like the sample value. | r4i2n[2-6] |
$SLURM_JOB_NUM_NODES or $SLURM_NNODES |
The quantity of nodes that were requested for the job. | 5 |
$SLURM_JOB_PARTITION | Contains what partition(s) the job was placed in. | short |
$SLURM_JOB_QOS | Contains the Quality of Service specified by the job. | high |
$SLURM_JOB_USER | Contains the username of the user who submitted the job. | ttester |
$SLURM_NODEID | Per each individual node in the job, this represents its unique index value in the list of nodes. This should be a value between 0 and the node quantity requested. | 0 |
$SLURM_STEP_ID or $SLURM_STEPID |
From within a job, sequential srun commands are called job "steps." Each call to srun increments this variable, giving each step its own unique ID index. This may be helpful for debugging, for example seeing which step the job fails at. | 0 |
$SLURM_STEP_NODELIST | From within a job, calls to srun can contain differing specifications of how many nodes should be used for the step. If your job called for 5 nodes, and you used srun -N 3, this variable would contain the list of the 3 nodes from the 5 allocated to the job that participated in this job step. | r4i2n[2-4] |
$SLURM_STEP_NUM_NODES | Like above, except this will contain the quantity of nodes requested for the job step. | 3 |
$SLURM_STEP_NUM_TASKS | Contains the quantity of tasks requested to be executed in the job step, defaults to task number of the job request. | 1 |
$SLURM_STEP_TASKS_PER_NODE | Contains the value specified by --tasks-per-node in the job step, defaults to the tasks-per-node of the job request. | 1 |
$SLURM_SUBMIT_DIR | Contains the absolute path of the directory the job was submitted from. | /projects/csc000 |
$SLURM_SUBMIT_HOST | Contains the hostname of the system the job was submitted from. Should always be an Eagle login node or an Eagle DAV node. | el1 |
$SLURM_TASKS_PER_NODE | Contains the value specified by --tasks-per-node in the job request. | 1 |
Share