Create an R Environment and Run a Job Array¶

This tutorial demonstrates how to create an R environment on ScienceCluster and how to run a job array within the R environment. It is intended for basic R workflows without GPUs or large memory requirements.

Workflow Overview¶

Following the general Conda environment workflow on ScienceCluster, we recommend the same principles here for integrating an R environment into job submission scripts:

Set up your R environment: first start an interactive session. Then, create your R environment using mamba / conda and install the necessary packages within that session.
Integrate environment into SLURM script: after installing the packages, exit the interactive session. On the login nodes, integrate the R environment into your SLURM script and submit the batch script from the login node.

Prepare the R environment¶

R environments on ScienceCluster are created using mamba or conda. For this reason, it is recommended to perform environment setup in an interactive session on a compute node rather than on the login nodes.

To create an R environment:

srun --pty -n 1 -c 4 --time=01:00:00 --mem=8G bash -l
module load mamba
mamba create -n renv -c conda-forge r-base -y

For more details, refer to the generic Conda-based R environment guide.

To activate the environment:

source activate renv

Once the R environment is created, consider whether you need additional cluster modules, e.g., GCC, which can be loaded with module load gcc.

Once the R environment is activated, start an interactive R session by running in the terminal:

Using an interactive R session, you can check which packages are installed or install any additional packages you need. You can install packages:

in your default user library, or
in a custom directory

Once you have verified and installed all packages required for your workflow, you can proceed to setting up your job submission script.

Install packages to `/data` directory¶

When installing R packages for the first time, you will be prompted to choose whether to install them in a user library. You can safely answer yes if the default user library location suggested by R (typically in your home directory) meets your needs.

If you plan to install a large number of packages or anticipate that your home directory may not have sufficient space, you can instead specify an alternative location in your /data space using the lib argument of the install.packages() function.

For example, to install the ggplot2 package into a directory called rpackages in your /data area, first create the directory:

mkdir -p /data/$USER/rpackages

Then, from an interactive R session, specify this directory as the installation location using the lib argument:

username = Sys.getenv()['USER']
install.packages('ggplot2', lib=paste("/data/",username,"/rpackages",sep=""))

This snippet uses Sys.getenv() to retrieve your ScienceCluster $USER variable. The paste() function constructs the full path to your custom package directory.

You can either specify the directory with lib.loc explicitly when loading a package, or add this custom directory to .libPaths() in R.

Use lib.loc when loading package. For example, to load the ggplot2 package installed in your custom rpackages directory, include the following line at the beginning of your R script (remember to load all required packages within the script you submit):

username <- Sys.getenv("USER")
library("ggplot2", lib.loc = paste("/data/", username, "/rpackages", sep = ""))

Update .libPaths. To make R automatically search your custom package directory without specifying lib.loc each time, you can append the directory to .libPaths() at the start of your R script:

username <- Sys.getenv("USER")
custom_lib <- paste("/data/", username, "/rpackages", sep = "")
.libPaths(c(.libPaths(), custom_lib))

Prepare the job submission script¶

Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. This command will create a submission script called arrayscript.sh.

cat << EOF > arrayscript.sh
#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3

module load mamba
source activate renv
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " \$SLURM_ARRAY_TASK_ID
Rscript --vanilla testarray.R \$SLURM_ARRAY_TASK_ID
EOF

To view the contents of the file, run cat arrayscript.sh.

There are a few aspects of this submission script to note:

First, the --output and --error flags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar to arrayJob_123456_1 where 123456 is an example Job ID and 1 is an example sub-job ID. Note: to achieve this format, the %A is used to represent the overall Job ID in the desired character string, and the %a is used to represent the sub-job ID in the desired character string.
Second, the --array flag here specifies the Bash array that is used for this job submission. Specifically, an array of 1-3 will expand so that there are 3 sub-jobs using 3 values (i.e., 1, 2, and 3). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example, --array=1,2,5,19,27 would specify the values 1, 2, 3, 5, 19, and 27. Alternatively, --array=1-7:2 would use values between 1 and 7 applying a step size of 2 (i.e., 1, 3, 5, and 7). Creative uses of the --array input and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like --array=1-3) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest.
Lastly, the line that reads echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID will print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding $SLURM_ARRAY_TASK_ID to the end of the Rscript line will input the sub-job ID (in this case, a value of 1, 2, or 3) as an environment variable into the R script itself (see below).

Prepare the R code to be run¶

The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID environmental variable to demonstrate how environmental variables can be inputted into R code. To create the file testarray.R, run

cat << EOF > testarray.R
# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]
# Use the command line argument input in some way
fileName = paste0(input1,".csv")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)
EOF

To view the contents of the file, run cat testarray.R.

As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID via the commandArgs() function. This would be equivalent to running, for example: Rscript --vanilla testarray.R 1. Once the array value has been passed to R, it can be used within the R script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSVs written to the home area of the cluster (matching the array value of 1-3). Each file should correspond to the array sub-job ID.

Submitting the job¶

To the submit the script, ensure that both the submission script and the R script are in the same folder. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh. When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.

Understanding job outputs¶

To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3 that's used. For every sub-job submitted from the array, you should receive a .out output file (which contains the printed output from each of your sub-jobs), a .err error file (which logs any errors from each sub-job), and a .csv file that uses the array sub-job ID as the title.