The cluster is a shared resource available to all researchers in the College of Engineering. There may be many users logged in at the same time accessing the filesystem, hundreds of jobs may be running on the compute nodes, with a hundred jobs queued up waiting for resources.
All users must:
- Avoid running jobs under their home directory, this will suffer in performance as well as possibly impacting the responsiveness of the filesystem for others. Jobs must be run in the Lustre filesystem.
- Avoid too many simultaneous file transfers. This is especially true when transferring files outside the cluster as the network bandwidth available to others may suffer.
Do not run on the Login Node
The login node is shared among all users of the facility. A single user running computational workloads on this node will negatively impact performance and responsiveness for other users. Should an administrator discover users using the login node in this way, they will kill these jobs without notice.
In order to run jobs on the compute nodes you will need to submit a batch job. This is a small text file that describes the resources the job will require as well as specifying the commands that must be run in order to complete the job. This is then submitted to the queueing system which will decide where and when to run the job in order to most effectively utilise the available compute nodes.
Create a Job Submission Script (Single CPU jobs)
The example below shows a typical job submission script for a single CPU job. Example job submission scripts for different types of parallel jobs can be found lower down the page or by clicking here.
#!/bin/bash # Set the name of the job # (this gets displayed when you get a list of jobs on the cluster) #SBATCH --job-name="My Wave Simulation" # Specify the maximum wall clock time your job can use # (Your job will be killed if it exceeds this) #SBATCH --time=3:00:00 # Specify the amount of memory your job needs (in Mb) # (Your job will be killed if it exceeds this for a significant length of time) #SBATCH --mem-per-cpu=1024 # Specify the number of cpu cores your job requires #SBATCH --ntasks=1 # Set up the environment module load intel # Run the application echo My job is started ./wave_1d echo My job has finished
Submit your Job to the Queue
To submit a job to the cluster use the sbatch command along with the job-script file
This will return your unique job id. This can then be used to query or control your job.
Monitoring your Job
The squeue command is used to query the jobs currently in the queue or running on the cluster. It can be used with various options to filter the list.
To list all jobs
To list just your jobs
To get more specific information about a particular job the scontrol command can be used
scontrol show job=job_id
Killing and Removing your Job
If you have submitted a job to the queue it can be removed using the scancel command
The job will be removed from the queue and if it has started running it will be killed.
Job Submission Scripts for Parallel Jobs
There are a number of permutations of supported compilers and flavours of MPI on the cluster:
- GNU compilers and OpenMPI
- GNU Compilers and MVAPICH2 (Infiniband version of MPICH)
- Intel compilers and MVAPICH2 (Infiniband version of MPICH)
- The combination of Intel compiler and OpenMPI is not available at this time.
For each combination it is important that the modules loaded during the build of the software are the same when it is executed.
An example submission script for a code compiled using the Intel compilers and the MVAPICH2 libraries is shown below:
#!/bin/bash # Set the name of the job # (this gets displayed when you get a list of jobs on the cluster) #SBATCH --job-name="My Heat Transfer" # Specify the maximum wall clock time your job can use # (Your job will be killed if it exceeds this) #SBATCH --time=3:00:00 # Specify the amount of memory your job needs per cpu-core (in Mb) # (Your job will be killed if it exceeds this for a significant length of time) #SBATCH --mem-per-cpu=1024 # Specify the number of cpu cores your job requires #SBATCH --ntasks=56 # Set up the environment module load intel module load mvapich2 # Run the application echo My job is started mpiexec ./heat_transfer echo My job has finished