November 14, 2019
Over the summer we had a few issues with Lustre (/scratch, /projects, /shared-projects and /datasets.) A combination of a firmware and a hardware issue would cause one controller out of a dual set to fail, and then the second one would not take over. We applied a firmware fix during a previous system time. During the October system time we replaced all the controllers. We also resolved any remaining issues with Lustre and returned the targets (Lustre equivalent of disks) that had been made read only and had data migrated off them to service. We also added additional targets to the cluster.
We’ve been investigating issues with some MPI jobs at scale. Although Intel MPI generally functions well, we are currently working with multiple vendors to determine the underlying causes of the observed problems.