Running MPI Jobs on Eagle GPUs
To run MPI (message-passing interface) jobs on the Eagle system's NVidia GPUs, the MPI library must be "CUDA-aware."
A suitable OpenMPI build has been made available via the openmpi/4.0.4/gcc+cuda module. This module is currently in test.
Interactive Use
srun does not work with this OpenMPI build when running interactively, so please use orterun instead. However, OpenMPI is cognizant of the Slurm environment, so one should request the resources needed via srun (for example, the number of available "slots" is determined by the number of tasks requested via srun). Ranks are mapped round-robin to the GPUs on a node. nvidia-smi shows, for example,
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24625 C ./jacobi 803MiB |
| 0 24627 C ./jacobi 803MiB |
| 1 24626 C ./jacobi 803MiB |
+-----------------------------------------------------------------------------+
when oversubscribing 3 ranks onto the 2 GPUs via the commands
srun --nodes=1 --ntasks-per-node=3 --account=<allocation_id> --time=10:00 --gres=gpu:2 --pty $SHELL
...<getting node>...
orterun -np 3 ./jacobi
If more ranks are desired than were originally requested via srun, the OpenMPI flag --oversubscribe could be added to the orterun command.
Batch Use
An example batch script to run 4 MPI ranks across two nodes is as follows.
#!/bin/bash --login
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=2:00
#SBATCH --gres=gpu:2
#SBATCH --job-name=GPU_MPItest
#SBATCH --account=<allocation_id>
#SBATCH --error=%x-%j.err
#SBATCH --output=%x-%j.out
ml use -a /nopt/nrel/apps/modules/test/modulefiles
ml gcc/8.4.0 cuda/10.2.89 openmpi/4.0.4/gcc+cuda
cd $SLURM_SUBMIT_DIR
srun ./jacobi
Multi-Process Service
To run multiple ranks per GPU, you may find it beneficial to run NVidia's Multi-Process Service. This process management service can increase GPU utilization, reduce on-GPU storage requirements, and reduce context switching. To do so, include the following functionality in your Slurm script or interactive session:
# MPS setup
export CUDA_MPS_PIPE_DIRECTORY=/tmp/scratch/nvidia-mps
if [ -d $CUDA_MPS_PIPE_DIRECTORY ]
then
rm -rf $CUDA_MPS_PIPE_DIRECTORY
fi
mkdir $CUDA_MPS_PIPE_DIRECTORY
export CUDA_MPS_LOG_DIRECTORY=/tmp/scratch/nvidia-log
if [ -d $CUDA_MPS_LOG_DIRECTORY ]
then
rm -rf $CUDA_MPS_LOG_DIRECTORY
fi
mkdir $CUDA_MPS_LOG_DIRECTORY
# Start user-space daemon
nvidia-cuda-mps-control -d
# Run OpenMPI job.
orterun ...
# To clean up afterward, shut down daemon, remove directories, and unset variables
echo quit | nvidia-cuda-mps-control
for i in `env | grep CUDA_MPS | sed 's/=.*//'`; do rm -rf ${!i}; unset $i; done
For more information on MPS, see the NVidia guide.
Share