Stability of Large Jobs
June 5, 2020
In early May HP and NREL formed a joint technical team along Mellanox to look into errors impacting large job productivity on Eagle that showed up intermittently under heavy network loads. Last Friday the team identified the root cause and is working on a mitigation strategy along with an eventual long term solution. Specifically, an issue has been identified with the version of OFED the system was delivered with. OFED works to tie the Linux kernel, the InfiniBand adaptors, the fabric and the MPI software together: we have identified a version of OFED and associated firmware to upgrade to in order to more reliably run large jobs. In order to deploy these fixes we will be need to create a new image to deploy on Eagle. This will take a full system outage to deploy, and we are sensitive to the large backload of work in the Eagle queue. In the meantime it appears that jobs on adjacent nodes experience reduced impacts so topology aware placement may help.
Next steps include:
- Develop a new system image (see below) based on a new firmware / driver recipe provided by HP
- Enable topology aware job placement / scheduling to mitigate impacts.
- Deployment of the new image and continued testing to confirm resolution of the problem
- Ongoing monitoring of system performance across the entire workload
- Operationalizing lessons learned to optimize job placement to improve overall system performance