Example Installations for Python-based Machine Learning Programming on GPU Nodes¶
This tutorial demonstrates the basics of how to create a Python environment on GPU compute nodes of ScienceCluster with specific packages of interest, in this case TensorFlow and PyTorch.
Creating an environment for TensorFlow on a GPU node¶
After connecting from a terminal to ScienceCluster, work through the following steps
# load the gpu module
module load gpu
# request an interactive session, which allows the package installer to see the GPU hardware
srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --mem=8G bash -l
# (optional) confirm the gpu is available. The output should show basic information about at least
# one GPU.
nvidia-smi
# use mamba (drop-in replacement for conda)
module load mamba
# create a virtual environment named 'tf' and install packages
mamba create -n tf -c conda-forge tensorflow cudatoolkit-dev
# activate the virtual environment
source activate tf
# confirm that the GPU is correctly detected
python -c 'import tensorflow as tf; print("Built with CUDA:", tf.test.is_built_with_cuda()); print("Num GPUs Available:", len(tf.config.list_physical_devices("GPU"))); print("TF version:", tf.__version__)'
# when finished with your test, exit the interactive cluster session
conda deactivate
exit
Creating an environment for PyTorch on a GPU node¶
After connecting from a terminal to the ScienceCluster, work through the following steps
# load the gpu module
module load gpu
# request an interactive session, which allows the package installer to see the GPU hardware
srun --pty -n 1 -c 2 --time=01:00:00 --gpus=1 --mem=8G bash -l
# (optional) confirm the gpu is available. The output should show basic information about at least
# one GPU.
nvidia-smi
# use mamba (drop-in replacement for conda)
module load mamba
# create a virtual environment named 'torch' and install packages
mamba create -n torch -c pytorch -c nvidia pytorch torchvision torchaudio pytorch-cuda
# activate the virtual environment
source activate torch
# confirm that the GPU is correctly detected
python -c 'import torch as t; print("is available: ", t.cuda.is_available()); print("device count: ", t.cuda.device_count()); print("current device: ", t.cuda.current_device()); print("cuda device: ", t.cuda.device(0)); print("cuda device name: ", t.cuda.get_device_name(0)); print("cuda version: ", t.version.cuda)'
# when finished with your test, exit the interactive cluster session
conda deactivate
exit
Using this virtual environment in ScienceApps¶
If you would like to use your TensorFlow or Torch with Jupyter and ScienceApps, see the documentation about installing the environment as an ipython kernel.
Preparing a job submission script¶
Single Node GPU Jobs¶
Once the virtual environment is created and packages installed, it can then be activated from within the job submission script.
First, create a file called examplecode.py
, in this case for TensorFlow, with the following command:
cat << EOF > examplecode.py
import tensorflow as tf
print("Built with CUDA:", tf.test.is_built_with_cuda())
print()
print("Tensorflow version:", tf.__version__)
print()
print(tf.config.list_physical_devices("GPU"))
print()
print("Preparing a test case...")
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
print("Compiling a model...")
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
print("Training the model...")
model.fit(x_train, y_train, epochs=5)
print("Evaluating the model...")
model.evaluate(x_test, y_test, verbose=2)
print("Done")
EOF
Then, similarly create the submission script:
cat << EOF > tfsubmission.sh
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --gpus=1
module load mamba
source activate tf
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/pkgs/cuda-toolkit
python examplecode.py
EOF
You can check the contents of these files with cat examplecode.py
and cat tfsubmission.sh
.
Important
The XLA_FLAGS
variable is set to prevent the "libdevice not found" error that may occur during training in tensorflow starting with v2.11. In our tests, this fix is sufficient. If you still get the error even with the XLA_FLAGS
variable being set, you can try other approaches outlined in the official installation guide
Note
Please observe that the --gpus=1
flag is included in this batch submission script. Otherwise, SLURM will not allocate a GPU for your job and the code will run only on CPUs.
To request more than 1 GPU on the same node, you can simply adjust --gpus=1
to --gpus=X
where X
is your desired requested number of GPUs on the single node. This flag can request no more than the maximum number of GPUs found on any specific node.
Important
If you request more than 1 GPU, you must also ensure your specific code makes use of a multi-GPU environment. If it does not, requesting multiple GPUs will not make your code run faster or improve your workflow.
Multi Node GPU Jobs¶
Requesting multiple GPUs across different nodes for a single requires special preparation not only in the data analysis code but also in the Slurm submission script. For example, one can request and run an example PyTorch multi-node multi-GPU job with the following submission script:
#!/bin/bash
#SBATCH --job-name=jobname
#SBATCH --output=%x%j.out
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gpus-per-node=2
#SBATCH --constraint=GPUMEM32GB
#SBATCH --mem-per-gpu=32G
#SBATCH --cpus-per-gpu=2
#SBATCH --time=00:10:00
module load mamba
source activate torch
# Node networking section
head_node_ip=$(hostname --ip-address)
echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# Analytical code
srun torchrun \
--nnodes $SLURM_JOB_NUM_NODES \
--nproc_per_node $SLURM_GPUS_PER_NODE \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
~/data/examples/distributed/ddp-tutorial-series/multinode.py 50 10
This submission script has been adapted from this PyTorch Slurm example. You can clone the corresponding GitHub repo using git clone https://github.com/pytorch/examples.git
, and the submission script presumes the repo was cloned in the user's ~/data
directory. It also assumes that you run module load multigpu
before you submit the script.
You can find a video description of it as well as additional documentation on the multinode.py
script here.
The Slurm submission script has the following notable inclusions:
- The recommended
SBATCH
flags for a multinode setup:--nodes=2
; select your total number of nodes.--ntasks=2
; this parameter should match the--nodes
parameter.--gpus-per-node=2
; adjust this parameter to select how many GPUs you want on each node.--mem-per-gpu=32G
; this flag should match the amount of GPU memory in your selected GPU model.--cpus-per-gpu=2
; users should at minimum request 2 CPUs per GPU; only increase this number of CPUs if your code is also CPU parallelized. Otherwise you will pay for unused resources.--constraint=GPUMEM32GB
; other options includeA100
(when requesting only A100 GPUs) orGPUMEM16GB
(when specifically requesting 16GB V100 GPUs)- Submission with these parameters will result in requesting a total of 4 V100 32GB GPUs across 2 nodes. Adjusting these parameters will allow you to request an identical number of GPUs across multiple nodes. While possible to request uneven numbers of nodes, the setup for such a submission is beyond the scope of this example.
- The
torchrun
command arguments include Slurm variables computed from theSBATCH
parameter set:$SLURM_JOB_NUM_NODES
directs PyTorch to run across the requested number of nodes.$SLURM_GPUS_PER_NODE
directs PyTorch to run across the requested number of GPUs on each node.
- A node networking section, where a head node is appointed to organize the analysis across each other worker node.
- Note: if someone else sharing your node also uses port 29500 (see the line
--rdzv_endpoint $head_node_ip:29500
), you may need to change this to an alternative port number (e.g., 29505) to ensure your traffic doesn't collide.
- Note: if someone else sharing your node also uses port 29500 (see the line
Keep in mind that you will necessarily need to adjust your data analysis code in ways specific to your chosen modelling framework (TensorFlow, PyTorch, etc.). This example, alongside the materials from PyTorch, serves as a guide for adapting your own code to Slurm on the ScienceCluster.
Submitting the job¶
To submit this script for processing (after the modules have been loaded and the Conda environment has been created), simply run
sbatch tfsubmission.sh
When submitted, the console should print a message similar to
Submitted batch job <jobid>
where <jobid>
is the Job ID numeric code assigned by the SLURM Batch Submission system.
Understanding job outputs¶
When the job runs to completion (provided your submitted code does not produce any errors) any/all files outputted by your script should have been written to their designated locations and a file named slurm-<jobid>.out
should exist from where you submitted the script, unless you specified otherwise. This file contains the printed output from your job. Examine the output to ensure that the training and evaluation was successful. In particular, you should see a message listing the loss and accuracy of the model towards the end of the output.