FAQs¶
What will happen to my queued jobs during maintenance?¶
To perform maintenance on ScienceCluster, Science IT admins will create a reservation for all computational nodes. The reservation ensures that no jobs are running during maintenance because software updates may interfere with running jobs.
Pending jobs that cannot finish before the reservation, based on their requested time, will remain pending until the reservation expires. The priority of jobs will not change during the maintenance, and they will be scheduled to run based on priority once the maintenance is over.
To see if there is a reservation for the next maintenance window, you can use the following command:
scontrol show reservations
The output will show the start time, end time, and affected nodes. For example:
ReservationName=s3it.uzh_24 StartTime=2024-06-05T06:00:00 EndTime=2024-06-05T18:00:00 Duration=12:00:00
Nodes=u20-compute-l[1-40],u20-compute-lmem[1,3-5],u20-compute-m[1-10],u20-compute-p[1-2],u20-compute-q1,u20-computegpu-[1-10],u20-computeib-hpc[1-12,14-18],u20-computeibmgpu-vesta[6-13,16-20],u20-computemgpu-vesta[14-15] NodeCnt=99 CoreCnt=3222 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
TRES=cpu=4174
Users=(null) Groups=(null) Accounts=s3it.uzh Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
Typically, reservations last from 6:00 until 18:00 on the maintenance day. However, they may also start earlier and finish later. The end time may also be adjusted during the maintenance if necessary.
In addition to the SLURM reservation, access to the login nodes may also be restricted. In that case, you will see a special message about the reservation when you try to log in, and the login will fail.
During the maintenance, it is often necessary to reboot the login nodes. This means that all tmux and screen session will be terminated.
When will my job run?¶
We receive this question rather frequently. Unfortunately it has neither a definitive nor even an approximate answer due to the complexity of the scheduling algorithm and the highly dynamic nature of the environment.
On ScienceCluster, SLURM schedules jobs based on priority, which primarily depends on the user's previous resource consumption. This impact from consumption decays over time. Additionally, a job's priority increases the longer it sits in the queue, but it can also decrease if the user has other jobs running.
We have also enabled the job backfilling feature. This allows lower-priority jobs to run earlier as long as they do not delay higher-priority jobs. Backfilling is particularly beneficial for shorter jobs that require fewer resources. This might encourage users to request only the resources their jobs can actually use. However, it is still advisable to request a small buffer especially when it comes to memory and runtime.
High-priority jobs may lose their top position in the queue if another user submits jobs with higher priority. Given the large number of users on ScienceCluster, who may work at various hours, this can happen at any time. Although SLURM can provide an estimated completion time for your job, actual completion may be delayed due to newly submitted jobs. Conversely, jobs often finish earlier than their requested time suggests, which can help your job start sooner.
Even though there is no definitive answer to the question, there are several commands you can use to check the cluster's current load and your job's position in the queue.
I am over-quota. How can I clean up my file storage?¶
Consider storing large files in your scalable storage folder, which is in your project space and can be found by running the command quota
.
Folders that typically grow in size with cache or temporary files are .local
and .cache
. To find the storage used (in GB) in all subfolders in your /home/$USER
and /data/$USER
folders, run:
ls -lha
In addition, you may want to check the number of files in your /home/$USER
directory with:
cat /home/$USER
The total number of files and directories will be shown as rentries
and it may not exceeds 100,000.
If you cannot login anymore into the cluster you can still connect using terminal only:
ssh -t <shortname>@cluster.s3it.uzh.ch bash
Anaconda / Mamba¶
To clean up cached installation packages from Anaconda, run the following commands:
module load anaconda3
conda clean -a
pip cache purge
Or with Mamba:
module load mamba
mamba clean -a
pip cache purge
Singularity¶
Singularity stores its cache by default in a user's home folder. To determine your cache folder for Singularity:
echo $SINGULARITY_CACHEDIR
module load singularityce
singularity cache clean
You can change your singularity cache path with this command
export SINGULARITY_CACHEDIR=/scratch/$USER/
Or add it to your .bashrc
file so that it is set each time you log in.
echo "export SINGULARITY_CACHEDIR=/scratch/$USER/" >> ~/.bashrc
source ~/.bashrc
echo $SINGULARITY_CACHEDIR
Framework folders¶
Certain software frameworks (e.g., HuggingFace) cache files programmatically, which can be cleaned with their own commands. For example, with HuggingFace consider using:
huggingface-cli delete-cache
What to do if I have a broken conda or mamba environment?¶
There are a variety of possible causes that a conda (or mamba) virtual environment might no longer function, even if it worked in the past, so there is not a single answer to cover all cases. There are two general approaches: either to start over with a new environment, or to repair the existing environment.
Start fresh with a new environment¶
One approach, and generally the simplest and most reliable, is to create a new environment and start again following the methods outlined in this how-to article.
In some cases, that may not be sufficient. For example, if one has inadvertently installed packages using pip while not within an activated virtual environment, those packages may end up in .local
, and they may conflict with packages within a virtual environment. In that case, it may be needed clean up .local/lib
and .local/bin
. Check whether either of those directories exist with ls .local
, then ls .local/lib
to see whether that directory contains folder or files with names containing "python". If so, one can clean these directories in a reversible way (to avoid deleting something that may be needed by another application) by renaming (instead of deleting) those directories
mv .local/lib .local/lib_bak
mv .local/bin .local/bin_bak
conda install pip
(or mamba install pip
) within your activated virtual environment before installing any packages with pip. Do NOT modify .local/share
because that directory may contain important configuration settings for other applications. Check version compatibility: Sometimes, in order to get packages working in a new environment, a specific package might require an older (or newer) version of python; check documentation about that package. In that case, one can create a new environment with a specific python version, e.g.:
conda create --name myenv python=3.10
conda install <package_name>=<version_number>
Repair the environment¶
Another approach, though not guaranteed to work, is to attempt to repair the virtual environment. Some possible steps (not a comprehensive guide) that may help in some cases are below. Update packages:
conda update --all
conda remove <package_name>
conda install <package_name>