What everyone should know about running Molcas in parallel.

Steven · 2015-11-06 16:07:36

Below are a set of guidelines for those who need/want to install Molcas on a cluster. You should read these before posting questions to this forum. They are meant to be an up-to-date reference, so feel free to point out any errors or inaccuracies so the post can be edited.

First, go read the Molcas parallellization efforts to know more about what parts of Molcas can benefit from running in parallel.

To avoid confusion below, when talking about a "node", I mean a single machine in the computer cluster that is connected to the network. With "process" I mean an MPI process.

Understand you won't be able to use all available cores

There is a common misunderstanding among users new to running parallel Molcas that parallel means: use all cores. This is completely wrong. Some core Molcas modules (if not most of them) have other bottlenecks than just CPU, instead they are limited by memory bandwidth and/or disk I/O. On most current architectures, cores have to share main memory bandwidth and disk I/O, so using more of them doesn't necessarily improve performance.

Preparation

Read relevant documentation

Molcas has a section on parallel installation in the user's manual.
Computer centers have different ways of setting up and running parallel/MPI jobs. Read their local documentation first and do not assume anything about the environment in which Molcas is running.

Use stable but up-to-date compilers/libraries

Use compiler/library versions Molcas was tested with.
Use the default/supported compilers/libraries of your computer center.
Try to install a basic Molcas first (no extra options besides MPI).[*/]

Test before trying to run actual calculations

If using GA, run its tests first.
Test a submit script separately by using e.g. a simple script (e.g. echo hello) instead of molcas
Run the Molcas test suite in parallel, preferably on more than 1 node.

Building

The simplest way to use parallellization with Molcas is to use the standard, built-in DGA back-end.

When using standard configure+make to build Molcas, you probably should use the flag '-mpi_wrappers', otherwise you will need to specify the exact link-line needed, e.g.:

# load MPI module, adds path of the wrappers
module load openmpi
# verify the compiler wrappers
mpicc -show
mpifort -show
# build Molcas with configure+make
./configure -parallel ompi -mpi_wrappers -par_run /path/to/mpirun
make

With CMake (8.1 and later), you only need to make sure that the MPI wrappers are pointing to the correct compilers, then just tell it to use MPI, e.g.:

# load MPI module, adds path of the wrappers
module load openmpi
# verify the compiler wrappers
mpicc -show
mpifort -show
# build Molcas with cmake
CC=mpicc FC=mpifort cmake -DMPI=ON /path/to/molcas
make

An example with a complete Intel tool set, HDF5 enabled would be:

# load MPI module, adds path of the wrappers
module load intel
module load impi
module load imkl
# build Molcas with cmake
CC=mpiicc FC=mpiifort cmake -DMPI=ON -DLINALG=MKL -DHDF5=ON /path/to/molcas
make

Running

The way Molcas is run in parallel is through the command stored in the 'RUNBINARY' variable in the 'molcas.rte' file. You should check this variable and if necessary alter it to make sure it is correct. By default, it will be something like:

RUNBINARY='/usr/bin/mpirun -n $MOLCAS_NPROCS $program'

Often, a computer center recommends its own launcher instead of 'mpirun', or you need to provide extra options to it, e.g. '-machinefile $NODEFILE', where the 'NODEFILE' variable will point to an actual file during a calculation.

The last thing to decide is how to choose the number of MPI processes/OpenMP threads. The recommended number of MPI processes is equal to the number of physical CPUs on the machine. If you have e.g. 4 nodes and 2 sockets (physical CPUs) per node with 32GB of RAM per node, the way to run Molcas with SLURM would be to create the following job submit script:

#!/bin/bash
#SBATCH -J my-job-name
#SBATCH -t 06:00:00
#SBATCH -N 4
#SBATCH --tasks-per-node=2
#SBATCH --mem-per-cpu=16000
#SBATCH --exclusive
#SBATCH -o job-%j.stdout
#SBATCH -e job-%j.stderr

module load mkl/11.1.3.174
module load intel/14.0.3
module load impi/4.1.3.048

# Intel MPI settings
export I_MPI_PIN_DOMAIN=socket
export I_MPI_DEBUG=4

# OpenMP settings
export OMP_NUM_THREADS=1

#### set Molcas location
export MOLCAS=/home/username/molcas-8.1

export MOLCAS_MEM=12000
export MOLCAS_NODELIST=$SLURM_NODELIST
export MOLCAS_NNODES=$SLURM_NNODES
export MOLCAS_NPROCS=$SLURM_NPROCS

#### start the calculation ####

cd $SLURM_SUBMIT_DIR
molcas -f my-project.input

Since there is 32GB of memory per node, there is about 16GB of memory per MPI process (since we will run 2 MPI processes on a node). The recommended safety margin is to set MOLCAS_MEM to use 75% of the maximum available memory, so 12GB or 12000MB. Note that for the sbatch settings, we do ask for the maximum of 16GB per cpu, since that will be the maximum available memory.

Molcas Forum

Announcement

#1 2015-11-06 16:07:36