Queue Wait Times
June 5, 2020
At the start of April, the 'queue depth' - the time it would take Eagle to complete all the jobs in the queue based on wall time limits - increased significantly from the historically common < 4 days to over 3 weeks. This isn't unprecedented, but it has persisted longer than previously and we realize this creates a challenge for users. Eagle is making progress and has reduced the backlog to a bit under 2 weeks. In light of the sustained high volume of work we have taken the following actions:
Tuned scheduler performance based on longer wait times. Previously it was optimized
for up to 5 day wait time
Decreased ability of standby jobs to accumulate priority. Previously they would accumulate priority, albeit slowly, and eventually could run if they waited long enough. This should no longer happen if there are any 'non-standby' jobs in the queue. Standby jobs may still run, using Slurm backfill scheduling, if they do not impact the start time of any priority jobs.
Tentatively delayed development of a new system image to next quarter to reduce scheduled downtime this quarter
Implemented the published allocation reduction policy to bring remaining allocations into alignment with available hour
Improved introspection of log data to monitor throughput and queue performance to help with ongoing efforts to adapt to changing workloads. We plan to make this data visible through a web based interface in the near future - look for an announcement in the upcoming weeks.
The scheduling algorithm is sophisticated and designed to maximize overall productivity of the machine, which makes it difficult sometimes to see why particular jobs start before others. The factors that go into calculating job priority are described on the following pages: