Skip to main content

Running R Statistical Computing Environment Software on the Peregrine System

Learn how to run the R statistical computing environment software on the Peregrine system.

What is R?

R is an open source programming language designed for statistical computing and graphics. It is the current standard for the development of new statistical methodologies and enjoys a large user base.

Please consult the R website for more information related to the R project.

Running R Interactively on Peregrine

R is most commonly used via an interactive shell. To do this on Peregrine, first request an interactive compute node (see Running Interactive Jobs on Peregrine) using the qsub -I command.

Once on a compute node, R can be loaded using the appropriate module (see Environment Modules on the Peregrine System). The standard version of R on Peregrine is built using the GNU tool chain. To avoid possible conflicts, remove any Intel compiler modules before loading R. One way to do this is via the following:

$ module purge
$ module load R

To access the R interactive console, simply type R at the command line. You will be prompted with the familiar R console in your terminal window.

  $ R

R version 3.1.0 (2014-04-10) -- "Spring Dance"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

R will attempt to install into ~/R/library by default (if it exists).
If it doesn't exist, please make this directory and restart R.
Use R (or Rscript) --no-site-file to suppress this message.

>

Running R Scripts

Since running R programs line by line in the interactive console can be a little tedious, it is often better to combine R commands into a single script and have R execute them all at once. R scripts are just text files containing R commands with file extension .R.

  [user@foo cur.dir]$ cat hello_world.R
#hello_world.R

message = "Hi there!"
nums = sample(1:100, 5)
cat(message, "\n")
cat("Here are some random numbers: ", paste(nums, sep = ", "),"\n")

There are several options for running R scripts:

The source()  function will execute R scripts from inside the interactive console.

> source("hello_world.R")
Hi there!
Here are some random numbers:  100 41 14 82 63

The Rscript command can be used to run R scripts from the command line. Output is piped to the stdout.

   [user@foo cur.dir]$ Rscript hello_world.R

 R will attempt to install into ~/R/library by default (if it exists).
 If it doesn't exist, please make this directory and restart R. Send
 an email to Ryan Elmore if you have questions. Use R (or Rscript)
 --no-site-file to suppress this message.

Hi there!
Here are some random numbers:  71 37 50 24 90

R CMD BATCH is an older function that behaves similar to Rscript. All output is piped to a corresponding .Rout file.

  [user@foo cur.dir]$ R CMD BATCH hello_world.R --no-site-file
[user@foo cur.dir]$ cat hello_world.Rout

> #hello_world.R
>
> message = "Hi there!"
> nums = sample(1:100, 5)
> cat(message, "\n")
Hi there!
> cat("Here are some random numbers: ", paste(nums, sep = ", "),"\n")
Here are some random numbers:  41 51 61 70 43
>
> proc.time()
   user  system elapsed
  0.188   0.024   0.277

Submitting Jobs to Peregrine

Another option for using R on Peregrine is to submit them as part of job scripts to be run on non-interactive nodes (see Requesting Different Nodes Types When Submitting Jobs on the Peregrine System). 

An example job script for running the hello_world.R example is below.

  #! /bin/bash
#PBS -l walltime=3600 # WALLTIME LIMIT
#PBS -N helloworld # Name of job

cd $PBS_O_WORKDIR

module purge
module load R
Rscript hello_world.R

Versions and Packages

R is a popular open source language with an active development community. New versions of R are frequently released. The most current version of R is 3.3.3 and is found in the candidate module file directory. Commands for making this version available is shown below:

[user@foo cur.dir]$ module use /nopt/nrel/apps/modules/candidate/modulefiles/R
[user@foo cur.dir]$ module avail R
------------------------- /nopt/nrel/apps/modules/candidate/modulefiles ------------------------- R/3.2.1 R/3.2.2-base R/3.3.3 -------------------------- /nopt/nrel/apps/modules/default/modulefiles -------------------------- R/3.1.0 R/3.2.2(default)

Installing New Packages

The install.packages() command in R will download new packages from the CRAN source directory and install them for your account. These will not be visible to other users.

Checking Installed Packages

The command installed.packages() in R list details about all packages that are loaded and visible to current R session.

Loading Packages

Packages are loaded into the current R environment through the library() function.

Graphics

R is commonly used to produce high-quality graphics based on data. This capability is built-in and can be extended through the use of packages such as ggplot2. To produce graphics on Peregrine, the easiest method is to output graphical displays to an appropriate filetype (pdf, jpeg, etc.). Then this file can be moved to your local machine using command line tools such as scp or rsync.

#Example R script for graphics output

library(ggplot2)
set.seed(8675309)
numbers = rnorm(200, sd = 2)
more.numbers = rnorm(100, mean = 10, sd = 2)

df = data.frame(values = c(numbers, more.numbers))

p = ggplot(df, aes(x = values, y = ..density..)) +
  geom_histogram(fill = "dodgerblue",
                 colour = "black",
                 alpha = .5,
                 binwidth = .5) +
  geom_density(size = 1.5) +
  labs(y = "Density", x = "Value",
       title = "Histogram Example")

png(file = "histogram_example.png")
print(p)
dev.off()

Parallel Programming in R

Programming in R on Peregrine has two distinct advantages. First, running jobs on a remote system means you do not have to tie up your local machine. This can be particularly useful for jobs that take considerable time and resources to run. Secondly, the increased computational capabilities of Peregrine's provide an opportunity to improve performance through parallel processing. R code, like many programming languages, is typically written and executed serially. This means that the added benefits of having multiple processing cores available are typically lost.

A major goal of the R community in recent years has been the development of specialized libraries and programming paradigms to better leverage modern HPC systems. The CRAN task view for High Performance Computing and Parallel Programming contains a detailed list of packages that address various aspects of these problems (link). Notable examples are:

  • Parallel
  • Foreach
  • Multicore
  • Snow
  • pbdR
  • Rmpi

Each package includes in-depth documentation and examples for how to implement parallel processing in R code. Learning these packages does require a moderate amount of time but for many large problems the improvements in computational efficiency dramatically out weighs the initial investment.

The pbdR project "enables high-level distributed data parallelism in R, so that it can easily utilize large HPC platforms with thousands of cores, making the R language scale to unparalleled heights." There are several packages within this project, pbdMPI for easy MPI work, pbdDMAT for distributed data matrices and associated functions, and pbdDEMO for a tutorial/vignette describing most of the project's details.

The following script is a simple helloworld.R example using the pbdMPI package.

  library(pbdMPI, quiet = TRUE)
init()
.comm.size <- comm.size()
.comm.rank <- comm.rank()
.hostname <- Sys.info()["nodename"]
msg <- sprintf("I am %d of %d on %s.\n", .comm.rank, .comm.size, .hostname)
comm.cat(msg, all.rank = TRUE, quiet = TRUE)
comm.cat(msg, rank.print = sample(0:.comm.size, size = 1))
comm.cat(msg, rank.print = sample(0:.comm.size, size = 1), quiet = TRUE)
finalize()

You could run this interactively from a compute node or by submitting it to the job scheduling using a shell script similar to the one given below. For example, you would submit using qsub helloworld.sh from a login node provided you name the script appropriately.

  #!/bin/bash
#PBS -A XXXXXX
#PBS -l walltime=00:00:05:00 # WALLTIME limit
#PBS -l nodes=2:ppn=24 # Number of nodes, use 16 processes on each
# Specify :ppn=x in previous line if you want
# to use a "x" processors on each node
# if there are core/memory concerns
#PBS -M first.last@nrel.gov
#PBS -m be # (b) begin, (e) end, (a) abort

cd $PBS_O_WORKDIR # Runs the job in the current working directory
set -x

module purge
module load openmpi-gcc
module load R # Load the module for the current R version

INPUT_BASENAME=helloworld # JOB NAME - USER INPUT PARAMETER
JOB_FILE=$INPUT_BASENAME.R
OUT_FILE=$INPUT_BASENAME.Rout
mpirun -np 48 Rscript --no-site-file $JOB_FILE > $OUT_FILE

In either case (interactive or queue submission), the output produced from the simple helloworld.R script should look like this.

  I am 0 of 48 on n1094.
I am 1 of 48 on n1094.
I am 2 of 48 on n1094.
I am 3 of 48 on n1094.
I am 4 of 48 on n1094.
I am 5 of 48 on n1094.
I am 6 of 48 on n1094.
I am 7 of 48 on n1094.
I am 8 of 48 on n1094.
I am 9 of 48 on n1094.
... 

A more complicated example (dd-regression.R) involves computing the regression coefficients for a linear regression using distributed matrices. You can vary p to increase the dimension of the design matrix. 

  library(pbdDMAT, quiet = TRUE)
init.grid()

### Generate balanced fake data.
comm.set.seed(1234, diff = TRUE)
N <- 100
p <- 2

dx <- ddmatrix(rnorm(N * p), ncol = p)
dbeta <- ddmatrix(1:p, ncol = 1)
depsilon <- ddmatrix(rnorm(N), ncol = 1)
dy <- dx %*% dbeta + depsilon
dols <- solve(t(dx) %*% dx) %*% t(dx) %*% dy
ols <- as.matrix(dols, proc.dest = 0)
comm.cat("Straight matrix multiplication:\n", quiet = TRUE)
comm.print(ols, queit = TRUE)

## alternatively
dres <- lm.fit(dx, dy)
res <- as.matrix(dres$coef, proc.dest = 0)
comm.cat("\nUsing lm.fit:\n", quiet = TRUE)
comm.print(res, quiet = TRUE)

## Undistribute and compute (on a single core)
x <- as.matrix(dx, proc.dest = 0)
y <- as.matrix(dy, proc.dest = 0)
if(comm.rank() == 0){
serial.coef <- lm(y ~ x - 1)$coef
}
comm.cat("\nSerial fit:\n", quiet = TRUE)
comm.print(serial.coef, quiet = TRUE)
finalize()

The output of running the dd-regression.R script:

  Using 8x6 for the default grid size

Straight matrix multiplication:
COMM.RANK = 0
[,1]
[1,] 0.9422302
[2,] 2.0726872

Using lm.fit:

[,1]
[1,] 0.9422302
[2,] 2.0726872

Serial fit:
x1 x2
0.9422302 2.0726872

As part of the NREL LDRD project " A Framework for Comparison of Spatiotemporal and Time Series Datasets" (Dan Getman, PI), we investigated several scaling factors involved in running a principal components analysis (PCA) on Peregrine.  Specifically, we recorded the execution time of PCA for datasets of varying size (7.6 megabytes [MB], 76 MB, 763 MB, and 7.45 gigabytes [GB]), differing numbers of cores (96, 144, 192, 240, and 288), and numbers of cores per node (16 and 24). The results given in the figures below are based on the average timing over ten independent trials.

Be careful when requesting the number of compute nodes for a given job; more nodes do not always yield better outcomes! For small datasets (< 100 MB), requesting fewer nodes yields better results. However, a substantial speedup is attained when using 288 cores on the 7+ GB datasets relative to just using 96 cores. 

The skeleton R script for performing this PCA experiment is given below. 

  library(pbdDMAT, quiet = TRUE)
init.grid()

## Prelims
comm.set.seed(19838, diff = TRUE)
kRow <- 10e3
kCols <- 10^c(2:5)

# Normal parms
kMean <- 0
kSD <- 1

# blocking
kBlock <- 4
# replications
kReps <- 10

for (i in 1:4){

comm.cat(sprintf("\n%d rows and %d columns: %s\n", kRow, kCols[i],
Sys.time()), quiet=T)

# benchmark
datatimes <- system.time(
dx <- ddmatrix("rnorm",
nrow = kRow,
ncol = kCols[i],
bldim = kBlock,
mean = kMean,
sd = kSD,
ICTXT=0))[3]
datatimes <- allreduce(datatimes, op='max')

size <- kCols[i]*kRow*8/1024
unit <- "kb"

if (log10(size) > 3){
size <- size/1024
unit <- "mb"
}

if (log10(size) > 3){
size <- size/1024
unit <- "gb"
}

comm.cat(sprintf("\n%.2f %s of data generated in %.3f seconds\n\n",
size, unit, datatimes), quiet=T)

times <- sapply(1:kReps, function(.) system.time(prcomp(dx))[3])
total <- allreduce(sum(times), op='max')
avg <- total/kReps

bench <- data.frame(operation="prcomp(dx)", mean.runtime=avg, total.runtime=total)
row.names(bench) <- ""
comm.print(bench, quiet=T)
}

finalize()

The shell script used to the PCA comparison is based on the following script. Note that the openmpi-gcc module is necessary for these examples and is loaded in this script.

  #!/bin/bash
#PBS -A XXXXXX
#PBS -l walltime=00:06:00:00 # WALLTIME limit
#PBS -l nodes=6:ppn=24 # Number of nodes, use 16 processes on each
# Specify :ppn=x in previous line if you want
# to use a "x" processors on each node
# if there are core/memory concerns
#PBS -M first.last@nrel.gov
#PBS -m be # (b) begin, (e) end, (a) abort

cd $PBS_O_WORKDIR

module purge
module load openmpi-gcc
module load R

INPUT_BASENAME=pca # JOB NAME - USER INPUT PARAMETER
JOB_FILE=$INPUT_BASENAME.R
OUT_FILE=$INPUT_BASENAME-$PBS_NP.Rout
mpirun -np $PBS_NP Rscript $JOB_FILE > $OUT_FILE

There are a host of benchmarking examples that come along with the pbdDEMO and can be found here /nopt/nrel/apps/R/3.1.0-gcc/lib64/R/library/pbdDEMO/Benchmarks on Peregrine.

One of several ways of using R in parallel computers is using the Rmpi library.

See The Comprehensive R Archive Network.
See Western University-Canada's website.

To test a parallel R in the interactive queue, start an interactive session with multiple nodes:

qsub -A ABC000  -q inter -I -l nodes=2:ppn=8,walltime=30:00

Once the resource is allocated, you can do below to start R.

module load R
R -q

This starts an interactive R session. Once you are in R, load the Rmpi package

> library("Rmpi")
> mpi.spawn.Rslaves(nslaves=16)

Sample output:       
16 slaves are spawned successfully. 0 failed.
master  (rank 0 , comm 1) of size 17 is running on: rr14 
slave1  (rank 1 , comm 1) of size 17 is running on: rr14 
slave2  (rank 2 , comm 1) of size 17 is running on: rr14 
slave3  (rank 3 , comm 1) of size 17 is running on: rr14 
... ... ...
slave15 (rank 15, comm 1) of size 17 is running on: rr15 
slave16 (rank 16, comm 1) of size 17 is running on: rr15

> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

Sample output:
$slave1
[1] "I am 1 of 17"
$slave2
[1] "I am 2 of 17"
... ... ...
$slave16
[1] "I am 16 of 17"

> mpi.close.Rslaves()
> mpi.quit()

The mpi.close.Rslaves and mpi.quit commands stop the MPI processes and exit R.

R --no-save -q < Hello.R > Hello.out  

Hello.R

# Taken form this site
#   http://math.acadiau.ca/ACMMaC/Rmpi/sample.html
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
    library("Rmpi")
    }
# Spawn as many slaves as possible
#  mpi.spwan.Rslaves()
# Above option does not work for now, so specify the number
# of slaves as below.
mpi.spawn.Rslaves(nslaves=16)
#
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
    if (is.loaded("mpi_initialize")){
        if (mpi.comm.size(1) > 0){
            print("Please use mpi.close.Rslaves() to close slaves.")
            mpi.close.Rslaves()
        }
        print("Please use mpi.quit() to quit R")
        .Call("mpi_finalize")
    }
# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()

Create a slurm script file, myscript.slurm as below.

#!/bin/bash
#PBS -l walltime=00:20:00      # WALLTIME
#PBS -l nodes=2:ppn=16         # NUMBER OF NODES 
#PBS -A <allocation>           # ALLOCATION ID
#PBS -N your_simulation        # NAME OF JOB
#PBS -j oe                     # JOIN stdout AND stderr TO SINGLE PIPE
#PBS -q short # USE short QUEUE

module load R # Load the module for the current R version   INPUT_BASENAME=Hello # JOB NAME - USER INPUT PARAMETER JOB_FILE=$INPUT_BASENAME.R # INPUT FILE NAME - USER INPUT PARAMETER OUT_FILE=$INPUT_BASENAME.Rout   # Run Job echo "------- Running R ------" date R --no-save -q < $JOB_FILE > $OUT_FILE date echo "------- Finished R job ------"

Submit it to the queue:

qsub myscript.pbs

The status of the submitted job can be checked by doing

qstat -u $USER

Additional Information

For general R or statistics questions, contact Bruce Bugbee. Additionally, NREL has an internal R Users Group that meets periodically to highlight interesting packages, problems, and share experiences related to R programming. Contact Daniel Inman for more details.

References

Rmpi: Interface Wrapper to MPI (Message-Passing Interface)

University of Western Ontario – Rmpi News

State of the Art in Parallel Computing with R