Eagle Year In Review
October 09, 2020
FY20 was a tough year for Eagle availability.
There were about 17 days of planned outages (12 is typical):
- October 1-7 2019: Addition of eCell #8. NREL purchased 72 nodes and 2 petabytes of storage. We also added power circuits to power eCell #9 (WETO) and eCell #10 (AMO) and performed patches and upgrades.
- February 10-16 2020: Patch/Upgrade lustre. Scan and repair filesystem anomalies.
- June 4: Integrate eCell #9 (WETO) and eCell #10 (AMO)
- July 27-28: Drain and refill water in Eagle. Deploy new OS image to address cyber and job stability issues.
There were about 22 days with unplanned outages (6 is typical):
- Nov 17-18: power outage. Double UPS failure.
- Dec 3: Lustre problem
- Jan 7-10: manifold connector leak prompted Lustre failure. Filesystem scan/repair could not complete because there were too many directories on /scratch and /projects.
- Jan 11, 13, 21-22, 23,27,29-30: Downtime is attributed to a known bug or network communications problems due to OFED Version. Both of these were fixed during the February planned outage.
- Feb 27-28: network communication issue caused portions of Lustre to be unable to other
Mar 21-22: scheduling paused to manage metadata issues: too many files and directories on a single metadata server
- Jun 25: suspected lightning strike flipped main breaker serving Eagle
- July 6-7: Lustre problem - had to 'reboot' to get things working again
Eagle delivered availability of 88%. Projects used 57M AUs compared to 56M AUs planned during the FY20 allocation cycle.
We hope and expect that FY20 was anomalous. Most years HPC systems at NREL deliver more than 95% availability and many fewer interruptions. We have fixed hardware issues, patched and upgraded OS and software, and increased focus on communications. We do our very best to decrease the number and duration of both planned and unplanned outages that do occur.