Support and discussions for Molcas and OpenMolcas users and developers
You are not logged in.
Please note: The forum's URL has changed. The new URL is: https://molcasforum.univie.ac.at. Please update your bookmarks!
You can choose an avatar and change the default style by going to "Profile" → "Personality" or "Display".Dear all,
for a proposal we had to perform a scaling study and noticed a strange performance drop when using certain number of cores (on 1 node) for the following input:
>> export OMP_NUM_THREADS=1
>> export MOLCAS_MEM = 6000
&GATEWAY
Title= Mo1
Coord= $CurrDir/Structure.Opt.xyz
Basis set
ANO-RCC-VTZP
Group= NoSym
&SEWARD
cholesky
End of input
&SCF
Title
Molecule
Iterations
50
End of Input
&RASSCF
Title
RASSCF_S3_I1
Symmetry
1
Spin
3
CIROOT
21 21 1
nActEl
8 0 0
Inactive
56
Ras2
6
Linear
Lumorb
THRS
1.0e-08 1.0e-04 1.0e-04
Levshft
1.50
Iteration
200 50
CIMX
200
SDAV
500
End of Input
&CASPT2
Title
CASPT2_S3_I1
MAXITER
25
Multistate
21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Imaginary Shift
0.1
End of Input
>> COPY $Project.JobMix $CurrDir/JobMix.13
>> COPY $CurrDir/JobMix.13 JOB001
&RASSI
Nr of JobIphs
1 21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Spin Orbit
Ejob
Omega
End of input
In particular, the above-mentioned case deals with an excited triplet state and leads to the following wall clock times on 1 node:
Number of Cores Wall clock time [sec] Speed-up (w.r. t. baseline)
1 3454.15 1.00
2 2773.96 1.25
4 2575.82 1.34
8 1488.68 2.32
12 237844.91 0.01
16 92684.75 0.04
The same setting has been used for a different electronic configuration, i.e., excited quintet state calculations which do not show this dramatic performance drop when more cores than 12 are used. Hence, we started several attempts to examine this performance drop:
1) In order to exclude that any other user is using the same node on the HPC cluster while these calculations are running, we modified the batch script and made sure that we "exclusively" use the one node and fully occupy this node. We tested this for 16 cores per job on one node with 128 available cores. However, this test didn't resolve the issue concerning the performance drop.
2) We compiled OpenMolcas on a separate virtual machine and observed more or less the same behavior with a remarkable wall clock time of 18348.67 s for a test with 12 cores per job even though it's still approx. 12 times faster than the calculation which has been performed on the HPC cluster resulting in a wall clock time of 237844.91 s (see above).
A few more observations which are probably logical:
By looking at the *.status" file of the corresponding job we notice that the calculations are taking too long when running the CASPT2 module and solving the "CASPT2 eqs for state 1-21". We could imagine that this problem still occurs in specific cases especially when requesting more cores than 8 although for the excited quintet state calculation including more cores than 8 we could not observe this performance drop. For the scaling study on the HPC cluster we used OpenMPI libraries in conjunction with Intel compilers whereas on the separate virtual machine OpenMPI libraries & GNU compilers were used. If you need further information we can also share the results for the excited quintet state calculations.
What might be the reason for such a performance drop especially in the case of the excited triplet state calculation? It's clear to us that parallization has its limits which might lead even to "over-parallelization" in the worst case but why does this performance drop only apply to the excited triplet state calculation while the excited quintet calculation seems to be unaffected?
I am gratefuly for any hint!
Thanks in advance!
Offline
MPI parallelization works best with 1 core per node, because there is then no competition between the processes for the resources of the node. That is, assuming inter-node communication is negligible, which it probably isn't.
The performance drop may have something to do with pinning, or some of the options that can be controlled with the MPI launcher. But it could also be related to the main memory. Note that the memory specified with MOLCAS_MEM is per process, and you should allow for some overhead (both from the OS and [Open]Molcas). So if the node has less than 6*12=72 GB of RAM, you may be exhausting the memory and swapping.
Offline
Thanks for the reply, Ignacio!
Actually, that's something I already kept in mind because one node has approx. 256 GB of ram so that I can exclude that it gets exhausted. The virtual machine itself has also around 125 GB of ram therefore I can say that any potential overhead stemming from the calculation should be handled without any problem.
However, how can it be explained that this performance drop only applies to the excited triplet state calculation while the excited quintet calculation seems to be unaffected? Could the pinning issue still have an effect on the performance drop in such a scenario?
Offline
Perhaps the problem sizes are different and behave different with respect to parallelization. It could be some I/O issue, all processes are using the same disk, and concurrency may be an issue.
If I remember correctly, the main purpose of CASPT2 parallelization was not to speed up a calculation, but to allow combining resources (memory) from different nodes in order to perform a calculation that would not fit in a single node.
Offline
Okko, a poor parallel performance of CASPT2 is well known fact. I works only if the number of MOLCAS_NPROCS is small, and all calculations are done at one node (assuming that there is enough memory). The problem is not only known, but pinned - there are several known places in the code, which have to be rewritten.
I planned to run a project to fix these problems, but some circumstances prevents me from doing this. If your interest to the problem is serious - email me, I can give you some hints. But this is not a simple fix. Sorry.
And the starting point here is to have a good test case.
And BTW, there is no point to compare CASPT2 performance 1 vs 2, since one process is always sleeping in the current implementation. So, the speed up you have is only due to other codes, which must be excluded from testing.
Offline