Support and discussions for Molcas and OpenMolcas users and developers
You are not logged in.
Please note: The forum's URL has changed. The new URL is: https://molcasforum.univie.ac.at. Please update your bookmarks!
You can choose an avatar and change the default style by going to "Profile" → "Personality" or "Display".Hello!
I am in the process of shifting from working on a cluster to a supercomputer. I have compiled OpenMolcas in both. However, I quickly noticed that Seward is considerably faster in the cluster than it is in the super computer. I have ran a simple test with a 45-atom molecule in SERIAL, single thread, cc-pVDZ (445 basis functions). In the cluster, the module seward is done in 23 minutes, and in the super computer it takes 2 hours 25 minutes. It appears that the reading/writing of ORDINT and TEMP01 is the bottleneck.
I am trying to undestand why this is happening to see if there is a way to improve the performance on the supercomputer.
These are the specs of the nodes that I am using on each of the systems:
Cluster (Seward is 6 times faster here):
1 Intel Xeon Gold Processor 6130 (22M Cache, 2.1 GHz, 16 cores)
Memory 96 GB UPI
Every compute node in the system contains a disk. These disks are much more efficient than the home file system and they are only accessible within the node itself.
The scratch file system is located on such a local disk.
Supercomputer:
2 x Intel Xeon Processor E5-2690 v3 (30M cache, 2.6 GHz, 12 cores)
Memory 64 GB
"/scratch-local/" behaves like it is local to each node, whereas "/scratch-shared/" denotes the same location on every node. But in fact not even the /scratch-local/ directories are truly (physically) local
Cluster (23 minutes)
configuration info
------------------
C Compiler ID: GNU
C flags: -std=gnu99 -fopenmp
Fortran Compiler ID: GNU
Fortran flags: -cpp -fno-aggressive-loop-optimizations -fdefault-integer-8 -fopenmp
Definitions: _MOLCAS_;_I8_;_LINUX_;_GA_;_MOLCAS_MPP_;SCALAPACK;_MKL_
Parallel: ON (GA=ON)
&GATEWAY
coord=$CurrDir/Geom.xyz
basis=cc-pVDZ
group=C1
&SEWARD
++ I/O STATISTICS
I. General I/O information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Unit Name Flsize Write/Read MBytes Write/Read
(MBytes) Calls In/Out Time, sec.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 RUNFILE 16.20 . 1316/ 8224 . 25.2/ 89.9 . 0/ 0
2 NQGRID 0.00 . 2/ 0 . 0.0/ 0.0 . 0/ 0
3 ONEINT 16.50 . 44/ 889 . 47.6/ 28.4 . 0/ 0
4 ORDINT 42801.78 . 5079503/ 480480 . 156050.0/ 120120.0 . 592/ 16
5 TEMP01 17965.38 . 4742742/ 143723 . 35930.3/ 17965.4 . 22/ 3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* TOTAL 60799.86 . 9823607/ 633316 . 192053.1/ 138203.7 . 615/ 20
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
II. I/O Access Patterns
- - - - - - - - - - - - - - - - - - - -
Unit Name % of random
Write/Read calls
- - - - - - - - - - - - - - - - - - - -
1 RUNFILE 28.6/ 10.6
2 NQGRID 50.0/ 0.0
3 ONEINT 93.2/ 1.7
4 ORDINT 65.0/ 64.4
5 TEMP01 63.0/ 100.0
- - - - - - - - - - - - - - - - - - - -
--
--- Stop Module: seward at Sun Mar 29 18:29:04 2020 /rc=_RC_ALL_IS_WELL_ ---
--- Module seward spent 23 minutes 17 seconds ---
Super computer (2 hours, 25 min):
configuration info
------------------
C Compiler ID: GNU
C flags: -std=gnu99 -fopenmp
Fortran Compiler ID: GNU
Fortran flags: -fno-aggressive-loop-optimizations -cpp -fdefault-integer-8 -fopenmp
Definitions: _MOLCAS_;_I8_;_LINUX_;_GA_;_MOLCAS_MPP_;SCALAPACK;_MKL_
Parallel: ON (GA=ON)
++ --------- Input file ---------
&GATEWAY
coord=$CurrDir/Geom.xyz
basis=cc-pVDZ
group=C1
&SEWARD
-- ----------------------------------
++ I/O STATISTICS
I. General I/O information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Unit Name Flsize Write/Read MBytes Write/Read
(MBytes) Calls In/Out Time, sec.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 RUNFILE 16.20 . 1428/ 8775 . 26.0/ 93.9 . 0/ 0
2 NQGRID 0.00 . 2/ 0 . 0.0/ 0.0 . 0/ 0
3 ONEINT 23.31 . 62/ 889 . 66.6/ 28.4 . 0/ 0
4 ORDINT 41057.53 . 5285395/ 471585 . 155504.2/ 117896.2 . 723/ 5091
5 TEMP01 18804.12 . 4964239/ 150433 . 37608.1/ 18804.1 . 128/ 1853
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* TOTAL 59901.17 .10251126/ 631682 . 193204.8/ 136822.7 . 852/ 6944
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
II. I/O Access Patterns
- - - - - - - - - - - - - - - - - - - -
Unit Name % of random
Write/Read calls
- - - - - - - - - - - - - - - - - - - -
1 RUNFILE 28.6/ 10.6
2 NQGRID 50.0/ 0.0
3 ONEINT 95.2/ 1.7
4 ORDINT 57.3/ 65.2
5 TEMP01 54.8/ 100.0
- - - - - - - - - - - - - - - - - - - -
--
--- Stop Module: seward at Mon Mar 30 03:02:25 2020 /rc=_RC_ALL_IS_WELL_ ---
--- Module seward spent 2 hours 25 minutes 22 seconds ---
From looking at this, my thoughts are that either
(A) The disk being truly local in the cluster provides a very substantial I/O performance boost.
(B) There is something else that I might be able to change in order to boost the prformance.
If someone has any tips for how I can navigate through this issue I would appreciate it very much! My experience with this is very limited, so even resources to learn the basics of I/O performance and benchmarking would be helpful
Thank you!
Max
Offline
Of course, a truly local scratch directory is what you want for a reasonable efficiency. A non-local scratch directory is asking for trouble. As a possible workaround you could use RICD, that will reduce I/O, but you'll still have an I/O problem with non-local scratch.
Offline
Of course, a truly local scratch directory is what you want for a reasonable efficiency. A non-local scratch directory is asking for trouble. As a possible workaround you could use RICD, that will reduce I/O, but you'll still have an I/O problem with non-local scratch.
Thank you Ignacio. I have done some experimenting with RICD, and it has worked very well for me in XMS-CASPT2 single-point calculations. However, when trying to optimize MECPs is where I am facing problems. During my first attempt, Alaska triggered the Numerical Gradients. I then added "DoAnalytical" to seward, and now Alaska calls MCLR, and on the second Alaska call it freezes (the last message that prints is: " A total of 11606515. entities were prescreened and 11606515. were kept.", this message does not appear without RICD. I have left the calculation running for over 10 hours now and it seems stuck (on my previous optimizations on the cluster these alaska calls without RICD took 5 minutes).
This is my input:
&GATEWAY
Coord=Geom.xyz
Basis=cc-pVDZ
Group=NoSym
RICD
Constraints
a = Ediff 1 2
Value
a = 0.000
End of Constraints
>>> EXPORT MOLCAS_MAXITER=300
>> Do While
&SEWARD
DoAnalytical
&RASSCF
Spin=1
Charge=0
CIRoot = 2 2 1
&ALASKA
PNEW
&SLAPAF
>>> EndDo
Do you have any suggestion of what might be going on?
Thanks again!
Max
Offline
It could have the same I/O problems as SEWARD.
Offline
Thanks! Too bad things aren't as simple as I had hoped... Still, it is lots of fun to learn. I got in touch with the sys admins and they have confirmed that it is due to the non-locality and the heavy load on the scratch. However, they did suggest a few ways to potentially overcome this issue. One of the main things that they suggest is that I look into their "lustre striping" system:
"The Lustre file system uses 48 OSTs (=Object Storage Targets), each with multiple disks, to store all the data in parallel. The OSTs are connected with InfiniBand to the compute nodes. By default, the Lustre file system stores each file on a single OST, which works quite well for most situations where files are accessed from a single process and files are relatively small (<10GB). However, when very large files need to be read or written in parallel from multiple nodes, Lustre striping is needed to reach a good performance."
I am still not sure whether this would only improve performance when running in multiple nodes, or if I can also use single-node multi-threading to access the multiple OSTs in parallel and improve the I/O performance. I am also not sure yet about whether MOLCAS can use this system at all. I'll do some more reading/testing and hopefully report back some positive news
Offline