Parallelization and performance drop for CASPT2 calculations

Okko · 2023-05-02 15:04:52

Dear all,

for a proposal we had to perform a scaling study and noticed a strange performance drop when using certain number of cores (on 1 node) for the following input:

>> export OMP_NUM_THREADS=1
>> export MOLCAS_MEM = 6000

&GATEWAY
  Title= Mo1
  Coord= $CurrDir/Structure.Opt.xyz
  Basis set
  ANO-RCC-VTZP
  Group= NoSym

&SEWARD
 cholesky
End of input

&SCF
  Title
  Molecule
  Iterations
  50
End of Input

&RASSCF
  Title
  RASSCF_S3_I1
  Symmetry
  1
  Spin
  3
  CIROOT
  21  21  1
  nActEl
  8  0  0
  Inactive
  56
  Ras2
  6
  Linear
  Lumorb
  THRS
  1.0e-08  1.0e-04  1.0e-04
  Levshft
  1.50
  Iteration
  200  50
  CIMX
  200
  SDAV
  500
End of Input

&CASPT2
  Title
  CASPT2_S3_I1
  MAXITER
  25
  Multistate
  21  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21
  Imaginary Shift
  0.1
End of Input
>>  COPY  $Project.JobMix  $CurrDir/JobMix.13

>>  COPY  $CurrDir/JobMix.13  JOB001
&RASSI
  Nr of JobIphs
  1  21
  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21
  Spin  Orbit
  Ejob
  Omega
End of input

In particular, the above-mentioned case deals with an excited triplet state and leads to the following wall clock times on 1 node:

Number of Cores Wall clock time [sec] Speed-up (w.r. t. baseline)

1 3454.15 1.00
2 2773.96 1.25
4 2575.82 1.34
8 1488.68 2.32
12 237844.91 0.01
16 92684.75 0.04

The same setting has been used for a different electronic configuration, i.e., excited quintet state calculations which do not show this dramatic performance drop when more cores than 12 are used. Hence, we started several attempts to examine this performance drop:

1) In order to exclude that any other user is using the same node on the HPC cluster while these calculations are running, we modified the batch script and made sure that we "exclusively" use the one node and fully occupy this node. We tested this for 16 cores per job on one node with 128 available cores. However, this test didn't resolve the issue concerning the performance drop.
2) We compiled OpenMolcas on a separate virtual machine and observed more or less the same behavior with a remarkable wall clock time of 18348.67 s for a test with 12 cores per job even though it's still approx. 12 times faster than the calculation which has been performed on the HPC cluster resulting in a wall clock time of 237844.91 s (see above).

A few more observations which are probably logical:

By looking at the *.status" file of the corresponding job we notice that the calculations are taking too long when running the CASPT2 module and solving the "CASPT2 eqs for state 1-21". We could imagine that this problem still occurs in specific cases especially when requesting more cores than 8 although for the excited quintet state calculation including more cores than 8 we could not observe this performance drop. For the scaling study on the HPC cluster we used OpenMPI libraries in conjunction with Intel compilers whereas on the separate virtual machine OpenMPI libraries & GNU compilers were used. If you need further information we can also share the results for the excited quintet state calculations.

What might be the reason for such a performance drop especially in the case of the excited triplet state calculation? It's clear to us that parallization has its limits which might lead even to "over-parallelization" in the worst case but why does this performance drop only apply to the excited triplet state calculation while the excited quintet calculation seems to be unaffected?

I am gratefuly for any hint!

Thanks in advance!

Ignacio · 2023-05-03 08:27:17

MPI parallelization works best with 1 core per node, because there is then no competition between the processes for the resources of the node. That is, assuming inter-node communication is negligible, which it probably isn't.

The performance drop may have something to do with pinning, or some of the options that can be controlled with the MPI launcher. But it could also be related to the main memory. Note that the memory specified with MOLCAS_MEM is per process, and you should allow for some overhead (both from the OS and [Open]Molcas). So if the node has less than 6*12=72 GB of RAM, you may be exhausting the memory and swapping.

Okko · 2023-05-03 08:43:47

Thanks for the reply, Ignacio!

Actually, that's something I already kept in mind because one node has approx. 256 GB of ram so that I can exclude that it gets exhausted. The virtual machine itself has also around 125 GB of ram therefore I can say that any potential overhead stemming from the calculation should be handled without any problem.

However, how can it be explained that this performance drop only applies to the excited triplet state calculation while the excited quintet calculation seems to be unaffected? Could the pinning issue still have an effect on the performance drop in such a scenario?

Ignacio · 2023-05-03 08:52:46

Perhaps the problem sizes are different and behave different with respect to parallelization. It could be some I/O issue, all processes are using the same disk, and concurrency may be an issue.

If I remember correctly, the main purpose of CASPT2 parallelization was not to speed up a calculation, but to allow combining resources (memory) from different nodes in order to perform a calculation that would not fit in a single node.

valera · 2023-05-11 17:51:05

Okko, a poor parallel performance of CASPT2 is well known fact. I works only if the number of MOLCAS_NPROCS is small, and all calculations are done at one node (assuming that there is enough memory). The problem is not only known, but pinned - there are several known places in the code, which have to be rewritten.
I planned to run a project to fix these problems, but some circumstances prevents me from doing this. If your interest to the problem is serious - email me, I can give you some hints. But this is not a simple fix. Sorry.
And the starting point here is to have a good test case.
And BTW, there is no point to compare CASPT2 performance 1 vs 2, since one process is always sleeping in the current implementation. So, the speed up you have is only due to other codes, which must be excluded from testing.

Molcas Forum

Announcement

#1 2023-05-02 15:04:52

Parallelization and performance drop for CASPT2 calculations

#2 2023-05-03 08:27:17

Re: Parallelization and performance drop for CASPT2 calculations

#3 2023-05-03 08:43:47

Re: Parallelization and performance drop for CASPT2 calculations

#4 2023-05-03 08:52:46

Re: Parallelization and performance drop for CASPT2 calculations

#5 2023-05-11 17:51:05

Re: Parallelization and performance drop for CASPT2 calculations

Board footer