Eagle Utilization is Increasing!
June 04, 2019
Users may have become accustomed to queue wait times being nearly nonexistent on Eagle over the past several months as it has been in use concurrently alongside Peregrine. As projects have migrated their software stacks to Eagle and Peregrine is being phased out, the majority of HPC usage is now taking place on Eagle. As a result, your jobs may now be subject to higher queue wait times as resources get managed by the job scheduler due to more frequent resource occupancy in overlapping job submissions. Here are some command-line tools to see information about Eagle occupancy and the queue status of your job(s):
- shownodes will show how many nodes, grouped by their specialized hardware features, are available. This is useful if your job needs a special hardware feature and you want to see how many nodes with that feature are available.
- squeue -u $USER --start | grep -v N/A will show an estimated start time for your jobs, filtering out those that don't have an estimated start time by the scheduler yet.
- sprio -o "%.10Y %.10i %.8u" | sort -r will show a list of pending jobs sorted by descending priority to see the order in which jobs are slated to be launched.
You may see a "NODELIST(REASON)" in your squeue output for why your job is pending ("PD") stating that "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions".
This is a typical "reason" that your job is not running and can mean any of the following:
- Other jobs with higher priority are waiting to run and need to reserve nodes for themselves (otherwise jobs with lots of nodes would never run!)
- Your job is waiting for a hardware feature for which all respective nodes are in use (e.g. all GPU-nodes are taken)
- Your job conflicts with a reservation, most commonly your walltime extends into a system time where we reserve the entire cluster. You can use scontrol show reservation to see scheduled reservations. If left alone, your job will run once the reservation has passed.
Rest assured, your job will run eventually—Slurm will likely refuse to enqueue your job if there is an erroneous specification that would prevent it from running.