R Example script with a job array¶
This tutorial demonstrates the basics of how to create an R environment on ScienceCluster with specific packages of interest. It also demonstrates the use of a job array. This demonstration is for basic R usage without GPUs or large memory requirements.
Preparing the environment¶
To get started, log in to the cluster from a terminal, and load the mamba module:
module load mamba
You will use mamba to create a virtual environment for your R installation. Mamba is a drop-in replacement for conda.
To create a virtual environment called renv
and install R within in, run
mamba create -n renv -c conda-forge r-base -y
To install a specific version of R, use this command
mamba create -n renv -c conda-forge "r-base==4.2.2" -y
And to activate the virtual environment, either from the login node or a compute node.
source activate renv
Once you've created the R environment, also consider whether you need other modules available on the cluster (e.g., GCC, which is provided as a module and can be loaded via a module load gcc
command). Additionally, you should ensure that the packages you intend to use are also available in your R environment.
To run your code interactively on a compute node, instead of the login node, run
srun --pty -n 1 -c 4 --time=01:00:00 --mem=8G bash -l
And then once on the compute node
module load mamba
source activate renv
R
Once the interactive R session has started, to see the packages that are installed by default on the system you can use the following command
installed.packages()
If the packages you need are listed after this command call, you can move forward to setting up your submission script. If you need to install more packages, you can install them from the interactive session using the appropriate command. For instance, if you want to use the tictoc
package, from the interactive R session you would run
install.packages('tictoc')
Note
⚠️ Make sure to quit the interactive R
session once you've successfully installed your packages of interest. To quit R
, simply run q()
and follow the prompts. You do not need to save your workspace image, so you can enter n
when asked.
When you install packages for the first time, you will be prompted about whether you'd like to install the packages within a user library. You can input yes
to these prompts if the default user space within the cluster suggested by R suits your needs. If you have a tremendous number of packages to install, and the user space within the home
directory may not be enough space for these packages, instead consider passing another location in your data
area via the lib
argument of the install.packages()
function. For example, to install the tictoc
package into a directory titled 'rpackages' that exists within your user data
area, you would first create the directory from the login node command line area with
mkdir -p /data/$USER/rpackages
Then you can then specify it as the location for your R packages using the lib
argument from within an interactive R
session run from the login node, like so
username = Sys.getenv()['USER']
install.packages('tictoc', lib=paste("/data/",username,"/rpackages",sep=""))
Note
⚠️ Notice that this code snippet (and the one below) uses the Sys.getenv()
function to retrieve the $USER
variable from the cluster environment. The paste()
function is then use to construct the full path as a string. You can call print(paste("/data/",username,"/rpackages",sep=""))
to see this string specifically after username = Sys.getenv()['USER']
has been called.
This new directory can be added to the .libPaths()
in R
, or you can simply specify this directory when you're loading the package. For example, to load the package you just installed in this new location, you would use the following line at the beginning of your R
script. (Make sure to load your required packages within the R
script that you're submitting.)
username = Sys.getenv()['USER']
library("tictoc", lib.loc=paste("/data/",username,"/rpackages",sep=""))
To deactivate the virtual environment, run
conda deactivate
Preparing the job submission script¶
Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. This command will create a submission script called arrayscript.sh
.
cat << EOF > arrayscript.sh
#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3
module load mamba
source activate renv
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " \$SLURM_ARRAY_TASK_ID
Rscript --vanilla testarray.R \$SLURM_ARRAY_TASK_ID
EOF
To view the contents of the file, run cat arrayscript.sh
.
There are a few aspects of this submission script to note:
-
First, the
--output
and--error
flags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar toarrayJob_123456_1
where123456
is an example Job ID and1
is an example sub-job ID. Note: to achieve this format, the%A
is used to represent the overall Job ID in the desired character string, and the%a
is used to represent the sub-job ID in the desired character string. -
Second, the
--array
flag here specifies the Bash array that is used for this job submission. Specifically, an array of1-3
will expand so that there are 3 sub-jobs using 3 values (i.e.,1
,2
, and3
). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example,--array=1,2,5,19,27
would specify the values1
,2
,3
,5
,19
, and27
. Alternatively,--array=1-7:2
would use values between1
and7
applying a step size of 2 (i.e.,1
,3
,5
, and7
). Creative uses of the--array
input and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like--array=1-3
) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest. -
Lastly, the line that reads
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
will print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding$SLURM_ARRAY_TASK_ID
to the end of theRscript
line will input the sub-job ID (in this case, a value of1
,2
, or3
) as an environment variable into the R script itself (see below).
Preparing the code to be run¶
The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID
environmental variable to demonstrate how environmental variables can be inputted into R
code. To create the file testarray.R
, run
cat << EOF > testarray.R
# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]
# Use the command line argument input in some way
fileName = paste0(input1,".csv")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)
EOF
To view the contents of the file, run cat testarray.R
.
As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID
via the commandArgs()
function. This would be equivalent to running, for example: Rscript --vanilla testarray.R 1
. Once the array value has been passed to R, it can be used within the R
script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSVs written to the home area of the cluster (matching the array value of 1-3
). Each file should correspond to the array sub-job ID.
Submitting the job¶
To the submit the script, ensure that both the submission script and the R
script are in the same folder. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh
. When submitted, the console should print a message similar to
Submitted batch job <jobid>
where <jobid>
is the Job ID numeric code assigned by the SLURM Batch Submission system.
Understanding job outputs¶
To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3
that's used. For every sub-job submitted from the array, you should receive a .out
output file (which contains the printed output from each of your sub-jobs), a .err
error file (which logs any errors from each sub-job), and a .csv
file that uses the array sub-job ID as the title.