Create an R Environment and Run a Job Array¶
This tutorial demonstrates how to create an R environment on ScienceCluster and how to run a job array within the R environment. It is intended for basic R workflows without GPUs or large memory requirements.
Workflow Overview¶
Following the general Conda environment workflow on ScienceCluster, we recommend the same principles here for integrating an R environment into job submission scripts:
-  Set up your R environment: first start an interactive session. Then, create your R environment using mamba/condaand install the necessary packages within that session.
-  Integrate environment into SLURM script: after installing the packages, exit the interactive session. On the login nodes, integrate the R environment into your SLURM script and submit the batch script from the login node. 
Prepare the R environment¶
R environments on ScienceCluster are created using mamba or conda. For this reason, it is recommended to perform environment setup in an interactive session on a compute node rather than on the login nodes.
To create an R environment:
srun --pty -n 1 -c 4 --time=01:00:00 --mem=8G bash -l
module load mamba
mamba create -n renv -c conda-forge r-base -y
For more details, refer to the generic Conda-based R environment guide.
To activate the environment:
source activate renv
Once the R environment is created, consider whether you need additional cluster modules, e.g., GCC, which can be loaded with module load gcc.
Once the R environment is activated, start an interactive R session by running in the terminal:
R
Using an interactive R session, you can check which packages are installed or install any additional packages you need. You can install packages:
- in your default user library, or
- in a custom directory
Once you have verified and installed all packages required for your workflow, you can proceed to setting up your job submission script.
Install packages to /data directory¶
 When installing R packages for the first time, you will be prompted to choose whether to install them in a user library. You can safely answer yes if the default user library location suggested by R (typically in your home directory) meets your needs.
If you plan to install a large number of packages or anticipate that your home directory may not have sufficient space, you can instead specify an alternative location in your /data space using the lib argument of the install.packages() function.
For example, to install the ggplot2 package into a directory called rpackages in your /data area, first create the directory:
mkdir -p /data/$USER/rpackages
Then, from an interactive R session, specify this directory as the installation location using the lib argument:
username = Sys.getenv()['USER']
install.packages('ggplot2', lib=paste("/data/",username,"/rpackages",sep=""))
This snippet uses Sys.getenv() to retrieve your ScienceCluster $USER variable. The paste() function constructs the full path to your custom package directory.
You can either specify the directory with lib.loc explicitly when loading a package, or add this custom directory to .libPaths() in R.
- Use lib.locwhen loading package. For example, to load theggplot2package installed in your customrpackagesdirectory, include the following line at the beginning of your R script (remember to load all required packages within the script you submit):
username <- Sys.getenv("USER")
library("ggplot2", lib.loc = paste("/data/", username, "/rpackages", sep = ""))
- Update .libPaths. To make R automatically search your custom package directory without specifyinglib.loceach time, you can append the directory to.libPaths()at the start of your R script:
username <- Sys.getenv("USER")
custom_lib <- paste("/data/", username, "/rpackages", sep = "")
.libPaths(c(.libPaths(), custom_lib))
Prepare the job submission script¶
Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. This command will create a submission script called arrayscript.sh.
cat << EOF > arrayscript.sh
#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3
module load mamba
source activate renv
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " \$SLURM_ARRAY_TASK_ID
Rscript --vanilla testarray.R \$SLURM_ARRAY_TASK_ID
EOF
To view the contents of the file, run cat arrayscript.sh.
There are a few aspects of this submission script to note:
-  First, the --outputand--errorflags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar toarrayJob_123456_1where123456is an example Job ID and1is an example sub-job ID. Note: to achieve this format, the%Ais used to represent the overall Job ID in the desired character string, and the%ais used to represent the sub-job ID in the desired character string.
-  Second, the --arrayflag here specifies the Bash array that is used for this job submission. Specifically, an array of1-3will expand so that there are 3 sub-jobs using 3 values (i.e.,1,2, and3). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example,--array=1,2,5,19,27would specify the values1,2,3,5,19, and27. Alternatively,--array=1-7:2would use values between1and7applying a step size of 2 (i.e.,1,3,5, and7). Creative uses of the--arrayinput and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like--array=1-3) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest.
-  Lastly, the line that reads echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_IDwill print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding$SLURM_ARRAY_TASK_IDto the end of theRscriptline will input the sub-job ID (in this case, a value of1,2, or3) as an environment variable into the R script itself (see below).
Prepare the R code to be run¶
The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID environmental variable to demonstrate how environmental variables can be inputted into R code. To create the file testarray.R, run
cat << EOF > testarray.R
# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]
# Use the command line argument input in some way
fileName = paste0(input1,".csv")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)
EOF
To view the contents of the file, run cat testarray.R.
As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID via the commandArgs() function. This would be equivalent to running, for example: Rscript --vanilla testarray.R 1. Once the array value has been passed to R, it can be used within the R script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSVs written to the home area of the cluster (matching the array value of 1-3). Each file should correspond to the array sub-job ID.
Submitting the job¶
To the submit the script, ensure that both the submission script and the R script are in the same folder. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh. When submitted, the console should print a message similar to
Submitted batch job <jobid>
where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.
Understanding job outputs¶
To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3 that's used. For every sub-job submitted from the array, you should receive a .out output file (which contains the printed output from each of your sub-jobs), a .err error file (which logs any errors from each sub-job), and a .csv file that uses the array sub-job ID as the title.