Running Software | Astute 2020 HPC Wiki

The cluster is a shared resource available to all researchers in the College of Engineering. There may be many users logged in at the same time accessing the filesystem, hundreds of jobs may be running on the compute nodes, with a hundred jobs queued up waiting for resources.

All users must:

Avoid running jobs under their home directory, this will suffer in performance as well as possibly impacting the responsiveness of the filesystem for others. Jobs must be run in the Lustre filesystem.
Avoid too many simultaneous file transfers. This is especially true when transferring files outside the cluster as the network bandwidth available to others may suffer.

Do not run on the Login Node

The login node is shared among all users of the facility. A single user running computational workloads on this node will negatively impact performance and responsiveness for other users. Should an administrator discover users using the login node in this way, they will kill these jobs without notice.

Submitting Jobs

In order to run jobs on the compute nodes you will need to submit a batch job. This is a small text file that describes the resources the job will require as well as specifying the commands that must be run in order to complete the job. This is then submitted to the queueing system which will decide where and when to run the job in order to most effectively utilise the available compute nodes.

Create a Job Submission Script (Single CPU jobs)

The example below shows a typical job submission script for a single CPU job. Example job submission scripts for different types of parallel jobs can be found lower down the page or by clicking here.

#!/bin/bash

# Set the name of the job
# (this gets displayed when you get a list of jobs on the cluster)
#SBATCH --job-name="My Wave Simulation"

# Specify the maximum wall clock time your job can use
# (Your job will be killed if it exceeds this)
#SBATCH --time=3:00:00

# Specify the amount of memory your job needs (in Mb)
# (Your job will be killed if it exceeds this for a significant length of time)
#SBATCH --mem-per-cpu=1024

# Specify the number of cpu cores your job requires
#SBATCH --ntasks=1

# Set up the environment
module load intel

# Run the application
echo My job is started
./wave_1d
echo My job has finished

Submit your Job to the Queue

To submit a job to the cluster use the sbatch command along with the job-script file

sbatch wave.job

This will return your unique job id. This can then be used to query or control your job.

Monitoring your Job

The squeue command is used to query the jobs currently in the queue or running on the cluster. It can be used with various options to filter the list.

To list all jobs

squeue

To list just your jobs

squeue --user=username

To get more specific information about a particular job the scontrol command can be used

scontrol show job=job_id

Killing and Removing your Job

If you have submitted a job to the queue it can be removed using the scancel command

scancel job_id

The job will be removed from the queue and if it has started running it will be killed.

Job Submission Scripts for Parallel Jobs

There are a number of permutations of supported compilers and flavours of MPI on the cluster:

GNU compilers and OpenMPI
GNU Compilers and MVAPICH2 (Infiniband version of MPICH)
Intel compilers and MVAPICH2 (Infiniband version of MPICH)
The combination of Intel compiler and OpenMPI is not available at this time.

For each combination it is important that the modules loaded during the build of the software are the same when it is executed.

An example submission script for a code compiled using the Intel compilers and the MVAPICH2 libraries is shown below:

#!/bin/bash

# Set the name of the job
# (this gets displayed when you get a list of jobs on the cluster)
#SBATCH --job-name="My Heat Transfer"

# Specify the maximum wall clock time your job can use
# (Your job will be killed if it exceeds this)
#SBATCH --time=3:00:00

# Specify the amount of memory your job needs per cpu-core (in Mb)
# (Your job will be killed if it exceeds this for a significant length of time)
#SBATCH --mem-per-cpu=1024

# Specify the number of cpu cores your job requires
#SBATCH --ntasks=56

# Set up the environment
module load intel
module load mvapich2

# Run the application
echo My job is started
mpiexec ./heat_transfer
echo My job has finished