How to Build and Run an Example Program on Metis

Building Your First Program

After you have loaded the necessary modules for the type of programs you will run on Metis, you are ready to build and run your first application.

We have created an example illustrating how to build and run a parallel program utilizing both MPI and CUDA libraries at Metis.

To get a copy of the examples, first login to Metis and enter the following command:

metis% mkdir examples
metis% cd examples
metis% rsync -av /home/examples/examples-metis/cuda-mpi-pbs ./

This will create a directory called cuda-mpi-pbs. Enter that directory.
Once there, you will find the source code (simpleCUDAMPI.cu, simpleMPI.c, simpleMPI.h)
the Makefile, the job description file cudaMPI.pbs, the documentation README and the pre-compiled
program cudaMPI.

metis% cd cuda-mpi-pbs
metis% ls
cudaMPI  cudaMPI.pbs  Makefile  README  simpleCUDAMPI.cu  simpleMPI.c  simpleMPI.h

You should be able to re-build the cudaMPI program as below

metis% module purge; module load openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8

metis% make clean
rm -f cudaMPI simpleCUDAMPI.o simpleMPI.o

metis% make all
nvcc -c -g -G -I/opt/metis/el8/contrib/openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8/include simpleCUDAMPI.cu 
mpic++ -c -std=c++11 -g simpleMPI.c 
mpic++ -o cudaMPI -L/opt/metis/el8/contrib/cuda/cuda-11.8/lib64 -lcudart -lcuda simpleCUDAMPI.o simpleMPI.o

This will re-create the executable program file,cudaMPI.

Back to Top

Running Your First Program

All the work you have done thus far on Metis has been on the metis.niu.edu computer that you logged on to. That computer is called the login node. In addition to serving as the entry point for users to login to Metis, the login computer is also there for users to create, edit, store, and build their programs. The programs created and built on Gaea's login node are executed on a different set of computers, Metis cluster's compute nodes.

Before executing your program on compute nodes, create a small file describing the program you want to run. This file is called a PBS (Portable Batch System) script. It is typically created only once for each application you want to run, and after that, it is re-used each time you want to run your program. The example directory already has a PBS script that has been created for you and is ready to use (cudaMPI.pbs).

Once you have your PBS script, submit it to Metis using the 'qsub' command. The batch system scheduler reads the contents of the PBS script, creates what it calls a 'job' for executing your program, and places the newly created job in its queue.

You can submit the PBS script created for you in the example directory using the qsub command like this:

metis% qsub cudaMPI.pbs

10255.cm

metis%

In the example above, Metis scheduler created a job based on the contents of the PBS script having filename cudaMPI.pbs and assigned that job a unique ID, 10255.

In most cases, jobs sit in the queue for a time as they wait for Metis compute nodes to become available, then they are loaded onto Metis compute nodes for execution and then end by entering a brief completion stage as Metis scheduler removes the application from the compute nodes. So, most jobs go through three stages: queued, running, and completion. In some (lucky) cases, a submitted job might spend little or no time in the queue and start running immediately after submission.

To observe the status of the job over time, use the qstat command.

A job in the queue (Q) waiting to run on two nodes looks like this:

metis% qstat -a 10255

cm:
Job ID
Username
Queue
Jobname
SessID
NDS
TSK
Memory
Time
S
Time
-------
--------
-----
-------
------
---
---
-------
---------
-
--------
10255.cm
z123456
short
cudaMPI
0
2
4
--
00:15:00
Q
00:00:03
metis%

The output from qstat (i.e., running qstat again) will look like that as long as the job is in the queue. Eventually, the job will be removed from the queue and loaded onto Gaea's compute nodes for execution. When that happens the status of the job changes in qstat's output from Q to R for "running" and will look like this:

metis% qstat -a 10255

cm:
Job ID
Username
Queue
Jobname
SessID
NDS
TSK
Memory
Time
S
Time
-------
--------
-----
-------
------
---
---
------
----
-
----
10255.cm
z123456
short
cudaMPI
0
2
4
--
00:15:00
R
00:00:03
metis%

Once the job has been completed, there will be a brief time when Metis will perform some cleanup operations. If you happen to run qstat during the cleanup phase, you will see the job status has changed from R to C and will look like this:

metis% qstat -a 10255

cm:
Job ID
Username
Queue
Jobname
SessID
NDS
TSK
Memory
Time
S
Time
-------
--------
-----
-------
------
---
---
------
----
-
----
10255.cm
z123456
short
cudaMPI
0
2
4
--
00:15:00
C
00:00:03
metis%

Eventually, the cleanup phase will be complete, and the job will be removed from qstat's view altogether. When that happens qstat's output will look like this:

metis% qstat -a 10255

qstat: Unknown Job Id 10255.cm

metis%

Back to Top

Your First Output

When your program runs, a file will be created in the same directory where you submitted the qsub command. Your program's output (stdout and stderr) will go into that file. The filename will have the form .o. In the example above, the outputfile name created for this job is cudaMPI.o10255.

In this example, the PBS job script prints  PBS environment variables, loads the openmpi module, and run the cudaMPI application, which does some calculations and prints the results ending with the line  'Test PASSED'. The PBS script's details are explained in 'A closer look at a PBS file'

metis% cat cudaMPI.o10255
The job working directory $PBS_O_WORKDIR is /home/z123456/examples/cuda-mpi-pbs
#============
PBS Environment variables, can be used in the job submission scripts as $PBS_VARNAME
PBS_ENVIRONMENT=PBS_BATCH
****************************************************
Job starting at: Wed Nov 29 19:08:07 CST 2023 at compute node cn14
****************************************************
Loading required environment modules
Loading openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8
Loading requirement: gcc/gcc-11.4.0 cuda/cuda-11.8
Currently Loaded Modulefiles:
1) gcc/gcc-11.4.0 3) openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8
2) cuda/cuda-11.8
Running the ./cudaMPI program using 16 mpi processes: mpirun ./cudaMPI
Running on 16 nodes, dataSizeTotal= 160 MB
Average of square roots is: 0.667268
Average of square roots 27987268.000000 over 41943040 is: 0.667268
Test PASSED
****************************************************
Job completed at: Wed Nov 29 19:08:19 CST 2023
****************************************************

metis%

Back to Top

A Closer Look at a PBS File and Output

The example's PBS file, cudaMPI.pbs,contains an accurate explanation of its content.
Please download it, read all comments, and pay attention to the details.
The vital PBS file properties and directives are also explained below

The PBS file structure

#
#  - the very first line defines the shell interpreter to be used
#    "#!/bin/bash"  in this case, do not edit
#
#  - all lines starting with # are comments except those 
#    started with '#PBS' and  "#!" 
#
#  - lines starting with '#PBS' are the batch system directives
#    (google for "PBS directives" for examples and tutorials)
#
#  - lines starting with '#--#PBS' are commented PBS directives 
#
#  - all other lines will be interpreted as commands, exactly
#    like in the terminal session

The PBS directives

#The name of this job
#PBS -N cudaMPI

#Tells PBS to place all the stdout and stderr in a single output file
#(in this examples cudaMPI.o10255
#PBS -j oe

#Requests resources to run the job
#PBS -l select=2:ncpus=8:mpiprocs=8:ngpus=1:mem=16gb

#At metis, a CPU core and a GPU card are elementary units of computational resources. 
#The cluster has 32 nodes, each with 128 CPU cores, one GPU card, 251 GB (28 nodes) or 1024 GB (4 nodes). 
#The resource requests are organized in "chunks." The smallest chunk includes one CPU and 2 GB of memory.
#The largest chunk includes 128 CPUs and 1007 GB of memory. A chunk can also request a GPU card 
#and specify the number of MPI tasks, usually equal to the number of CPU cores in a chunk. 
#A single chunk usually serves one MPI task or one application instance. 
#Small CPU-only chunks can occupy the same node - for example, up to 128 1-CPU chunks can run on a single node.
#To request a specific number of chunks,CPUs,MPI processes and GPUs                               
#use the command "#PBS -l select=Nchunks:ncpus=Ncpus:mpiprocs=NPmpi:ngpus=Ngpus:mem=Xgb"
#For CPU-only jobs use the command "#PBS -l select=Nchunks:ncpus=Ncpus:mpiprocs=NPmpi:mem=Xgb"
#
#Note:   
#              Nchunks<=32, for GPU chunks
#              Nchunks<=4096/Ncpus for CPU-only chunks
#              (run 'shownodes' command to find the number of free cpus) 
#              Ncpus<=128, the total number of CPUs per node is 128 
#              NPmpi<=Ncpus, the total number of CPUs allocated for MPI tasks, 
#                              request NPmpi=Ncpus for non-OPENMP jobs                           
#              Ngpus==1,  the total number of GPUs per node is 1    
#              X<=251,  28 of 32 Metis modes have 251 gb of RAM                       
#                       special jobs can request up to 1007 gb of RAM (4 nodes)
#Above we request two chunks; each chunk needs
#8 CPUs, 8 MPI processes, 1 GPU card, and 16 GB RAM

#Request no more than 15 minutes to run the job (format hh:mm:ss)
#PBS -l walltime=00:15:0

#
#To estimate the time needed for long jobs:
# - Estimate the fraction of events, records, or iterations, which
#   your application can process during ~15 min
# - Extrapolate to find the time needed to process an entire dataset
# Example: the measured time to process 10 records is 10 sec.
#          one can expect that 1000 records will be processed in 1000 sec 
# Multiply the result by a factor of two to cover the positive uncertainty.
# If the result exceeds 24-48 hours, think about how to split the job -
# running several short jobs can decrease the waiting time in the PBS queue. 
#

#When to send a status email
#("-m abe" sends e-mails at job abort, begin, and end)
#PBS -m ae

#Custom user's email; edit and uncomment
#(remove the leading "#--" to activate)
#--#PBS -M account@niu.edu

If you are affiliated with more than one project, then you will want to uncomment this directive (i.e., delete leading "#--") and replace project with the project associated with this job.
#--#PBS -A project

PBS Script Command Section

When your job starts, the current directory will likely be your home directory.
You need to 'cd' to where your project executable files are to run them:
cd $PBS_O_WORKDIR

The $PBS_O_WORKDIR variable will be set to the location where you executed the qsub command. Presumably, you would have done that from the directory where your application program is, as shown above. If not, then you can hard-code any directory in its place that makes sense for your project.

Next, the PBS script prints PBS environment variables, loads the openmpi module, and runs the cudaMPI program using the mpirun command.

This example logs the time it starts and finishes using the echo commands. We print the  command  section of the  cudaMPI.pbs using the "tail" command.

metis% cat cudaMPI.pbs | tail -40

#===================================================================#
#==== Script Command  Section (executed on a remote node)===========#
# Use the "normal" bash script syntacsis (google for "bash tutorials")
# for example, https://linuxhint.com/30_bash_script_examples 
#===================================================================#
# Change to the directory where the 'qsub' command was executed.
# The $PBS_O_WORKDIR is always pointing to the job submission directory
echo "The job working directory \$PBS_O_WORKDIR is $PBS_O_WORKDIR"
cd $PBS_O_WORKDIR   
#Print out PBS environment variables
echo "#============="
echo "PBS Environment variables, can be used in the job submission scripts as \$PBS_VARNAME"
env | grep PBS
echo "#============="
echo "For example,we can find the number NPmpi of allocated MPI processes as"
echo "NPmpi=\"\$(cat \$PBS_NODEFILE | wc -l)\"" 
NPmpi="$(cat $PBS_NODEFILE | wc -l)" 
echo "NPmpi=$NPmpi"
#
# Print out when and wher this job starts
echo '****************************************************'
echo "Job starting at: `date` at compute node `hostname`"
echo '****************************************************'
# Uncomment 'set -x' to enable a mode of the shell 
# where all executed commands are printed to the output file.
# (may help to visualize the control flow of the script if it is not functioning as expected)
#set -x 
#
echo "Loading required environment modules"
module purge; module load openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8
# List the loaded modules
module list
# Run the program 'cudaMPI', expected to be present in the submission folder
# ('./' is the path to the current directory)
echo "Running the ./cudaMPI program using $NPmpi mpi processes: mpirun ./cudaMPI"
mpirun ./cudaMPI
set +x
echo '****************************************************'
echo "Job completed at: `date`"
echo '****************************************************'

PBS output file

The output file contains a combination of output from the example program (Test PASSED), Metis system and print (i.e., 'echo') statements in the PBS script.

metis% cat cudaMPI.o10255

The job working directory $PBS_O_WORKDIR is /home/z123456/examples/cuda-mpi-pbs
#============
PBS Environment variables, can be used in the job submission scripts as $PBS_VARNAME
PBS_ENVIRONMENT=PBS_BATCH
****************************************************
Job starting at: Wed Nov 29 19:08:07 CST 2023 at compute node cn14
****************************************************
Loading required environment modules
Loading openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8
Loading requirement: gcc/gcc-11.4.0 cuda/cuda-11.8
Currently Loaded Modulefiles:
1) gcc/gcc-11.4.0 3) openmpi/openmpi-4.1.5-gcc-11.4.0-cuda-11.8
2) cuda/cuda-11.8.

Running the ./cudaMPI program using 16 mpi processes: mpirun ./cudaMPI
Running on 16 nodes, dataSizeTotal= 160 MB
Average of square roots is: 0.667268
Average of square roots 27987268.000000 over 41943040 is: 0.667268
Test PASSED


****************************************************
Job completed at: Wed Nov 29 19:08:19 CST 2023
****************************************************

metis%

Prospective user?

Request an account.

Back to top