Running MPI Jobs on Eagle GPUs

To run MPI (message-passing interface) jobs on the Eagle system's NVidia GPUs, the MPI library must be "CUDA-aware."

A suitable OpenMPI build has been made available via the openmpi/4.0.4/gcc+cuda module. This module is currently in test.

Interactive Use

srun does not work with this OpenMPI build when running interactively, so please use orterun instead. However, OpenMPI is cognizant of the Slurm environment, so one should request the resources needed via srun (for example, the number of available "slots" is determined by the number of tasks requested via srun). Ranks are mapped round-robin to the GPUs on a node. nvidia-smi shows, for example,

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24625      C   ./jacobi                                     803MiB |
|    0     24627      C   ./jacobi                                     803MiB |
|    1     24626      C   ./jacobi                                     803MiB |
+-----------------------------------------------------------------------------+

when oversubscribing 3 ranks onto the 2 GPUs via the commands

srun --nodes=1 --ntasks-per-node=3 --account=<allocation_id> --time=10:00 --gres=gpu:2 --pty $SHELL
...<getting node>...
orterun -np 3 ./jacobi

If more ranks are desired than were originally requested via srun, the OpenMPI flag --oversubscribe could be added to the orterun command.

Batch Use

An example batch script to run 4 MPI ranks across two nodes is as follows.

#!/bin/bash --login
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=2:00
#SBATCH --gres=gpu:2
#SBATCH --job-name=GPU_MPItest
#SBATCH --account=<allocation_id>
#SBATCH --error=%x-%j.err
#SBATCH --output=%x-%j.out

ml use -a /nopt/nrel/apps/modules/test/modulefiles
ml gcc/8.4.0 cuda/10.2.89 openmpi/4.0.4/gcc+cuda

cd $SLURM_SUBMIT_DIR
srun ./jacobi

Multi-Process Service

To run multiple ranks per GPU, you may find it beneficial to run NVidia's Multi-Process Service. This process management service can increase GPU utilization, reduce on-GPU storage requirements, and reduce context switching. To do so, include the following functionality in your Slurm script or interactive session:

# MPS setup
export CUDA_MPS_PIPE_DIRECTORY=/tmp/scratch/nvidia-mps
if [ -d $CUDA_MPS_PIPE_DIRECTORY ]
then
   rm -rf $CUDA_MPS_PIPE_DIRECTORY
fi
mkdir $CUDA_MPS_PIPE_DIRECTORY

export CUDA_MPS_LOG_DIRECTORY=/tmp/scratch/nvidia-log
if [ -d $CUDA_MPS_LOG_DIRECTORY ]
then
   rm -rf $CUDA_MPS_LOG_DIRECTORY
fi
mkdir $CUDA_MPS_LOG_DIRECTORY

# Start user-space daemon
nvidia-cuda-mps-control -d

# Run OpenMPI job.
orterun ...

# To clean up afterward, shut down daemon, remove directories, and unset variables
echo quit | nvidia-cuda-mps-control
for i in `env | grep CUDA_MPS | sed 's/=.*//'`; do rm -rf ${!i}; unset $i; done

For more information on MPS, see the NVidia guide.