Skip to main content

Announcements

Read announcements for NREL’s high-performance computing (HPC) system users. 

Eagle FY 21 HPC Allocation Call and Process Webinar

June 05, 2020

NREL’s Annual allocation cycle is open to run on Eagle for FY21.  Users are encouraged to submit requests for currently funded projects, proposed research, and projects where a proposal is pending.   This includes EERE, non-EERE DOE projects such as ARPA-E and Office of Science, other federal work, and SPP projects.  Allocation requests are due July 1.  For more information about allocations, please see https://www.nrel.gov/hpc/resource-allocation-requests.html

To help users, the NREL allocations team will be hosting a webinar at 1:30 p.m. MST on Thursday, June 11th.   NREL HPC staff will go through the process, and answer questions.

FY21 NREL HPC Allocation Process Webinar June 11th

Thursday, June 11, 2020 1:30 pm | 1 hour 30 minutes | (UTC-06:00) Mountain Time (US & Canada)

https://meetingsamer7.webex.com/meetingsamer7/j.php?MTID=mf2b66669ba77b5c0ddf3192965a63289

Continue reading

Intro to HPC - Summer Series

June 05, 2020

HPC User Operations will be hosting an Intro to HPC workshop series for new HPC users. We will be covering introductory command line navigation, connecting to NREL HPC systems, job submission strategies, Eagle file systems, and more! Note: NREL HPC user account is required.

Sessions:

HPC Linux Basics:

Thursday, June 18th - 1:00PM - 2:30PM

-Set up your system to use HPC
-Cover the basics of using the Linux command line
-Basics of remote system access (SSH) and Linux permissions

Introduction to NREL HPC Systems

Thursday, June 25th - 10:30AM - 12:00PM

-Explore HPC File Systems
-Software environments and modules
-Introduction to Slurm
-Getting productive with Eagle

Interactive Workshop

Wednesday, July 1st - 1:00PM - 3:00PM

Bring your questions! Where do you need help? If we don't have topics, we may demo:

  • Installing applications on Eagle / conda environments
  • Maintaining your own environment modules
  • Bash tips and tricks
  • Version control - Git

Please RSVP to any or all sessions by sending an email to Jennifer.Southerland@nrel.gov. Please visit the HPC Training Calendar for updates.

Continue reading

Queue Wait Times

June 05, 2020

At the start of April, the 'queue depth' - the time it would take Eagle to complete all the jobs in the queue based on wall time limits - increased significantly from the historically common < 4 days to over 3 weeks. This isn't unprecedented, but it has persisted longer than previously and we realize this creates a challenge for users. Eagle is making progress and has reduced the backlog to a bit under 2 weeks.  In light of the sustained high volume of work we have taken the following actions:

  • Tuned scheduler performance based on longer wait times. Previously it was optimized for up to 5 day wait time
  • Decreased ability of standby jobs to accumulate priority. Previously they would accumulate priority, albeit slowly, and eventually could run if they waited long enough. This should no longer happen if there are any 'non-standby' jobs in the queue. Standby jobs may still run, using Slurm backfill scheduling, if they do not impact the start time of any priority jobs.
  • Tentatively delayed development of a new system image to next quarter to reduce scheduled downtime this quarter
  • Implemented the published allocation reduction policy to bring remaining allocations into alignment with available hour
  • Improved introspection of log data to monitor throughput and queue performance to help with ongoing efforts to adapt to changing workloads. We plan to make this data visible through a web based interface in the near future - look for an announcement in the upcoming weeks. 

The scheduling algorithm is sophisticated and designed to maximize overall productivity of the machine, which makes it difficult sometimes to see why particular jobs start before others. The factors that go into calculating job priority are described on the following pages:

https://www.nrel.gov/hpc/eagle-job-partitions-scheduling.html

https://www.nrel.gov/hpc/eagle-job-priorities.html

Continue reading

Stability of Large Jobs

June 05, 2020

In early May HP and NREL formed a joint technical team along Mellanox to look into errors impacting large job productivity on Eagle that showed up intermittently under heavy network loads. Last Friday the team identified the root cause and is working on a mitigation strategy along with an eventual long term solution. Specifically, an issue has been identified with the version of OFED the system was delivered with. OFED works to tie the Linux kernel, the InfiniBand adaptors, the fabric and the MPI software together: we have identified a version of OFED and associated firmware to upgrade to in order to more reliably run large jobs.  In order to deploy these fixes we will be need to create a new image to deploy on Eagle. This will take a full system outage to deploy, and we are sensitive to the large backload of work in the Eagle queue. In the meantime it appears that jobs on adjacent nodes experience reduced impacts so topology aware placement may help. Continue reading

Upgrade Image on Eagle

June 05, 2020

Eagle has been running on CentOS 7.4 image since its deployment in late 2018. The ACO team has been working on upgrading the image to CentOS 7.7 and is anticipated to be rolled out this July. 

The new image will be running the upgraded kernel, lustre client, OFED drivers and Infiniband HCA firmware. This update will address the requests for newer kernel, issues related to large job productivity,  vulnerabilities of the current kernel and to align with the new recipe of software/firmware stack provided by HPE and Mellanox.

FastX running on the user facing Data Analysis and Visualization (DAV) nodes will be upgraded to the newer release to address the current issues with load balancing.

Continue reading

Cybersecurity on Eagle

June 05, 2020

The Advanced Computing Operations team has been working on a DOE mandated cyber security assessment and mitigation effort -tied to a PEMP goal- that is due to be completed mid June; reduction of patchable security vulnerability continues. The last day of reportable updates will be June 12th, after which final reports will be generated and all data will be compiled and information sent to DOE. Continue reading

Eagle Expansion nodes for AMO and WETO

June 05, 2020

The Advanced Computing Operations team is working with Hewlett Packard Enterprise to install 432 additional nodes and 2 petabytes of storage on Eagle this week purchased bye AMO and WETO. The installation team includes about 10 people who are working to place, plumb, fill (with water), connect, configure, and test the new equipment.  Installation activities started at 08:00 Monday June 1 and are expected to be complete by COB on Friday June 5. Following successful installation the new hardware will be exercised and tested, then released to production next week.

Continue reading

Advanced Computing Operations

May 11, 2020

HPC Systems and Operations is now Advanced Computing Operations. The ACO team supports operation of the ESIF HPC User Facility and NREL projects using Amazon Web Services (AWS) Cloud. Continue reading

Intended Use of /projects and /scratch

May 11, 2020

/projects and /scratch are shared resource for Eagle. We encourage users to review the published Shared Storage Usage Policy.

/projects is intended to be used by approved Eagle allocated projects to contain only critical information and programs necessary for the project to succeed, up to the capacity approved by the allocation request in Lex (https://lex.hpc.nrel.gov/projects/<Lex project number>/award/).  It is recommended that critical information in /projects be regularly copied to the Mass Storage System (MSS). We anticipate that quotas matching approved allocations for /projects will be implemented in the near future.

To see a project's usage on Eaglefs, you can run the following command substituting your project name for csc000:

Continue reading