March Eagle Status
March 03, 2020
During the system time February 10-16, HPC Operations worked closely with HPE and DDN to patch, upgrade and adjust configurations intended to make the eaglefs filesystem (/scratch and /projects) more reliable. So far those updates appear to be effective. We continue to monitor Eagle closely and continue working with HPE and DDN.
One issue that affected some of you was GPU-equipped nodes failing under some situations. Nodes were complaining about 'not enough power.' We added 2nd power supplies to these nodes in the February 20-24th time period. This resolved the 'not enough power' message, so that will make the GPU nodes more stable/reliable.
Large jobs running openmpi with srun should run more reliably now. A patch was applied to slurm/srun that appears to improve success rate.
New software has been installed on the interactive login nodes (el[1-4], ed[1-7]) which should prevent users from inadvertently overloading these nodes.
The external internet-facing DAV node (eagle-dav.nrel.gov) has been replaced with a new node that supports VirtualGL. For connecting to this external DAV node, please see Connecting to DAV Nodes Using FastX.