Support and discussions for Molcas and OpenMolcas users and developers
You are not logged in.
Please note: The forum's URL has changed. The new URL is: https://molcasforum.univie.ac.at. Please update your bookmarks!
You can choose an avatar and change the default style by going to "Profile" → "Personality" or "Display".Pages: 1
Test system: a 49-atom molecule with 474 basis functions, CAS(2,2)/cc-pVDZ
Single node, 24 cores, OpenMP (MKL)
When using OpenMolcas in a cluster with a lustre-parallel network scratch file system, rather than a local disk, my SA-CASSCF geometry optimizations became very slow (First iteration: seward 5 h , scf 1 hr 30 min, rasscf 1 hr, mclr 3:30 hr, alaska 8 minutes).
You helped me diagnose that this was due to the slow I/O, and recommended that I use the atomic compact Cholesky decomposition (aaCD). This has been very helpful for performing single-point calculations, but this has introduced new bottleneck in alaska. (First iteration: seward 39 s , scf 30 min, rasscf 20 min, mclr 8 min, alaska > 25 hr) The calculation of the 3-center ERI (Drvg1_3Center_RI), for my test molecule occurs at a rate of ~80 integrals/second. This causes a single SA-CASSCF alaska gradient to take a whole day, whereas without RICD it takes 8 minutes.
I am trying to understand why this bottleneck is happening to see if I can fix it. The debugging mode will not output the timings and I/O statistics at the end of alaska (is there a simple way to activate it?), but I suppose that the bottleneck is also related to I/O.
Is this the likely cause of the bottleneck? Do you have any tips on how to overcome this? Is this step MPI-parallel, and, if not, would it be relatively simple to parallelize it for a beginner?
Thank you!
Max
Last edited by MaxParadiz (2020-05-26 19:42:59)
Offline
After a bit more testing I figured out that the I/O statistics were not being printed because I had the "PNEW" option in Alaska. Removing it allowed me to look at the statistics, and there does not appear to be an I/O bottleneck. This is good, because then the bottleneck is not caused by having a non-local scratch disk! There is more hope!
So, what could it be?
According to benchmarks for a single core in [1], figure 3, this step should be quite fast.
[1] Delcey, Mickaël G., et al. "Analytical gradients of the state-average complete active space self-consistent field method with density fitting." The Journal of chemical physics 143.4 (2015): 044110.
Offline
That does not look right. Might it have something to do with the parallelization? I presume the 24 cores are only used with OpenMP and not with MPI (i.e. MOLCAS_NPROCS=1)?
I just tried on a medium-range desktop, 1 core, 1 thread. Saccharose with cc-pVDZ (45 atoms, 432 basis functions). SEWARD: 44 s, SCF: 5 min, MCLR (8 GB mem): 21.5 min, ALASKA (8 GB mem): 22.5 min.
Offline
Dear Ignacio,
Thank you again for looking into this. Yes, I am currently testing with a build that has OpenMP and no MPI and use MOLCAS_NPROCS=1, OMP_NUM_THREADS=24. I am now testing with Saccarose as well. I compiled OpenMolcas in an 8-core desktop and got results similar to yours. I have compared the core usage between the desktop and the cluster, and there was some odd behavior. I decided to install the OpenBlas library locally instead of using the recommended pre-built MKL library to see if this improved things. The profile of core usage throughout the calculation are now very similar between the desktop and the cluster, but now only 16 of the 24 cores are being used... Is this expected? The increase in the speed of MCLR is substantial (7 min vs 22 min), and I can see that 16 cores are being used at full power. So, to me that says that the multi-threaded linear algebra library is doing its job.
I have also tested using a single core, one thread, and Alaska also stalls forever in the cluster. In debug mode I can see that it is working, it is just that Drvg1_3Center_RI is going very slowly. I am not sure what else it could be... I have also tried to compile with MKL/OpenBlas, fortran and gnu compilers. Another odd thing - I did a test with a 16 atom molecule with cc-pVQZ, 730 basis functions (5 times more prescreened and kept entities), and Alaska only took 1 hour and 1 minute in the cluster.
EDIT
I figured out how to print the timings. The number of threads appears to have no influence on Alaska, just on MCLR. For a smaller system (16 atoms, 170 basis functions) that will run to completion, for OMP_NUM_THREADS=1 the detailed times are:
Time spent in Cho_get_grad:
Wall/CPU 2.0800000000745058 1.9799999999999998
-------------------------
Time spent in Mult_Rijk_Qkl:
Wall/CPU 9.9999997764825821E-003 1.0000000000000231E-002
-------------------------
Time spent in Prepp:
Wall/CPU 0.12999999988824129 2.0000000000000018E-002
-------------------------
Time spent in Pget_ri2:
Wall/CPU 17.319999997504056 4.0200000000000173
-------------------------
Time spent in Pget_ri3:
Wall/CPU 354.47000000160187 77.959999999941928
-------------------------
Time spent in Drvg1_ri:
Wall/CPU 377.82000000029802 86.160000000000011
-------------------------
Total Time for Density:
Wall/CPU 374.00999999884516 83.989999999941944
-------------------------
Total TIme for 2-center Derivatives:
Wall/CPU 0.15000000130385160 7.0000000000002061E-002
-------------------------
Total TIme for 3-center Derivatives:
Wall/CPU 2.9899999992921948 1.7299999999997056
-------------------------
Total Time for Derivatives:
Wall/CPU 3.1400000005960464 1.7999999999997076
-------------------------
Derivative check:
Wall/CPU 3.8100000014528632 2.1700000000580673
So Pget_ri3, Drvg1_ri, and Density are the slow steps.
Last edited by MaxParadiz (2020-05-29 01:44:42)
Offline
So, do you have a case that works fast in a desktop and slow in the cluster? It must be some compilation/library/hardware problem then.
I see you are using "55GB of memory", I hope that much is actually available, otherwise you may be seeing some swapping. Maybe the memory handling is an issue anyway, try with a smaller amount, like 8GB. But if it works fine with a larger system, I don't know what it could be.
Offline
The cluster has 64 GB mem / node, but I have done again tests with 12 GB in both and I get the same results. It doesn't seem to be a memory issue (unless there is some allocation problem, but with htop I see that the memory is free).
Saccharose in Desktop
version: 19.11
*************************************************
* pymolcas version py2.08 *
* build bed633bbe89b7744c23cdbfc3f64969c *
* (after the EMIL interpreter by V. Veryazov) *
*************************************************
configuration info
------------------
C Compiler ID: GNU
C flags: -std=gnu99 -fopenmp
Fortran Compiler ID: GNU
Fortran flags: -fno-aggressive-loop-optimizations -cpp -fdefault-integer-8 -fopenmp
Definitions: _MOLCAS_;_I8_;_LINUX_;_OPENBLAS_
Parallel: OFF (GA=OFF)
-------------------------------------------------------
|
| Project: INPUT
| Submitted from: /home/max/Molcas/Saccharose/Compare
| Scratch area: /tmp/INPUT
| Save outputs to: /home/max/Molcas/Saccharose/Compare
| Molcas: /opt/OpenMolcas/build
|
| Scratch area is empty
|
| MOLCAS_DRIVER = /home/max/bin/pymolcas
| MOLCAS_MAXITER = 500
| MOLCAS_NPROCS = 1
| MOLCAS_PRINT = VERBOSE
| MOLCAS_SOURCE = /opt/OpenMolcas
| MOLCAS_STRUCTURE = 0
|
-------------------------------------------------------
++ --------- Input file ---------
&GATEWAY
Coord=Saccharose.xyz
Basis=cc-pVDZ
Group=NoSym
RICD
&SEWARD
DoAnalytical
&SCF
&RASSCF
nactel = 2
ras2 = 2
Spin=1
Charge = 0
CIROOT = 2 2 1
&ALASKA
VERBOSE
.
. seward, scf, rasscf, alaska calls mclr
.
()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
&ALASKA
only a single process is used
available to each process: 12 GB of memory, 8 threads
pid: 17895
()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
RI-ERI gradients!
Wavefunction type: State average CASSCF
A total of 9602274. entities were prescreened and 9602274. were kept.
A total of 624391365. entities were prescreened and 403982237. were kept.
-------------------------
Time spent in Cho_get_grad:
Wall/CPU 41.220000002533197 218.13000000000000
-------------------------
Time spent in Mult_Rijk_Qkl:
Wall/CPU 3.0000001192092896E-002 0.12999999999999545
-------------------------
Time spent in Prepp:
Wall/CPU 5.0000000745058060E-002 8.0000000000000071E-002
-------------------------
Time spent in Pget_ri2:
Wall/CPU 62.349999990314245 47.689999999974589
-------------------------
Time spent in Pget_ri3:
Wall/CPU 1438.7799999862909 2337.4600000329260
-------------------------
Time spent in Drvg1_ri:
Wall/CPU 1571.9299999997020 2656.1600000000003
-------------------------
Total Time for Density:
Wall/CPU 1542.4299999810755 2603.4900000329008
-------------------------
Total TIme for 2-center Derivatives:
Wall/CPU 0.63000000268220901 0.41999999999978854
-------------------------
Total TIme for 3-center Derivatives:
Wall/CPU 23.910000011324883 40.910000000525997
-------------------------
Total Time for Derivatives:
Wall/CPU 24.540000014007092 41.330000000525786
-------------------------
Derivative check:
Wall/CPU 29.500000018626451 52.669999967099557
-------------------------
**************************************************
* *
* Molecular gradients *
* *
**************************************************
Irreducible representation: a
------------------------------------------------------------------------------------------
X Y Z
------------------------------------------------------------------------------------------
gradient info
------------------------------------------------------------------------------------------
++ I/O STATISTICS
I. General I/O information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Unit Name Flsize Write/Read MBytes Write/Read
(MBytes) Calls In/Out Time, sec.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 RUNFILE 26.40 . 77/28996376 . 0.5/ 209773.2 . 0/ 42
2 CHRED 1.41 . 0/ 6 . 0.0/ 2.1 . 0/ 0
3 CHVEC1 1212.48 . 0/ 2 . 0.0/ 1212.5 . 0/ 0
4 CHORST 0.07 . 0/ 12 . 0.0/ 0.1 . 0/ 0
5 CHOMAP 0.05 . 0/ 0 . 0.0/ 0.0 . 0/ 0
6 CHODIAG 0.66 . 0/ 2 . 0.0/ 1.3 . 0/ 0
7 CHTA11 263.34 . 1/ 1 . 263.3/ 263.3 . 0/ 0
8 CHTB11 5.04 . 1/ 1 . 5.0/ 5.0 . 0/ 0
9 QVEC00 25.69 . 0/ 10 . 0.0/ 256.9 . 0/ 0
10 CVEA1 263.34 . 1/ 5571192 . 263.3/1288136.7 . 0/ 119
11 CVEB1 5.04 . 1/ 4224852 . 5.0/ 18701.1 . 0/ 7
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* TOTAL 1803.51 . 81/38792454 . 537.3/1518352.3 . 0/ 170
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
II. I/O Access Patterns
- - - - - - - - - - - - - - - - - - - -
Unit Name % of random
Write/Read calls
- - - - - - - - - - - - - - - - - - - -
1 RUNFILE 28.6/ 7.4
2 CHRED 0.0/ 16.7
3 CHVEC1 0.0/ 0.0
4 CHORST 0.0/ 0.0
5 CHOMAP 0.0/ 0.0
6 CHODIAG 0.0/ 0.0
7 CHTA11 0.0/ 0.0
8 CHTB11 0.0/ 0.0
9 QVEC00 0.0/ 0.0
10 CVEA1 0.0/ 3.9
11 CVEB1 0.0/ 5.1
- - - - - - - - - - - - - - - - - - - -
--
--- Stop Module: alaska at Fri May 29 16:10:20 2020 /rc=_RC_ALL_IS_WELL_ ---
*** files: xmldump
saved to directory /home/max/Molcas/Saccharose/Compare
--- Module alaska spent 26 minutes 13 seconds ---
.##################.
.# Happy landing! #.
.##################.
Timing: Wall=5061.23 User=14509.15 System=1228.47
Saccharose in cluster
version: 19.11
*************************************************
* pymolcas version py2.08 *
* build 15f26fa235c442e72564882f668400fc *
* (after the EMIL interpreter by V. Veryazov) *
*************************************************
configuration info
------------------
C Compiler ID: GNU
C flags: -std=gnu99 -fopenmp
Fortran Compiler ID: GNU
Fortran flags: -fno-aggressive-loop-optimizations -cpp -fdefault-integer-8 -fopenmp
Definitions: _MOLCAS_;_I8_;_LINUX_;_OPENBLAS_
Parallel: OFF (GA=OFF)
---------------------------------------------------------------------
|
| Project: INPUT
| Submitted from: /nfs/home2/paradiz/home/Molcas/Saccharose/Compare
| Scratch area: /scratch-local/paradiz.8273817/INPUT
| Save outputs to: /nfs/home2/paradiz/home/Molcas/Saccharose/Compare
| Molcas: /home/paradiz/git/OpenMolcas/build
|
| Scratch area is empty
|
| MOLCAS19 = /home/paradiz/git/OpenMolcas/build
| MOLCAS_DRIVER = /nfs/home2/paradiz/bin/pymolcas
| MOLCAS_MAXITER = 500
| MOLCAS_NPROCS = 1
| MOLCAS_PRINT = VERBOSE
| MOLCAS_SOURCE = /home/paradiz/git/OpenMolcas
| MOLCAS_STRUCTURE = 0
|
---------------------------------------------------------------------
++ --------- Input file ---------
&GATEWAY
Coord=Saccharose.xyz
Basis=cc-pVDZ
Group=NoSym
RICD
&SEWARD
DoAnalytical
&SCF
&RASSCF
nactel = 2
ras2 = 2
Spin=1
Charge = 0
CIROOT = 2 2 1
&ALASKA
VERBOSE
-- ----------------------------------
()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
&ALASKA
only a single process is used
available to each process: 12 GB of memory, 24 threads
pid: 16533
()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
No automatic utilization of translational and rotational invariance of the energy is employed.
RI-ERI gradients!
Wavefunction type: State average CASSCF
A total of 9602274. entities were prescreened and 9602274. were kept.
------ This runs for over a day (I have not seen it finish)
Uracil (132 basis functions) is an example that does finish in the cluster, but takes longer than in the desktop. Using the same configuration as above, these are timings:
Desktop:
A total of 946369. entities were prescreened and 946369. were kept.
A total of 34369088. entities were prescreened and 27727172. were kept.
-------------------------
Time spent in Cho_get_grad:
Wall/CPU 0.66999999806284904 2.6499999999999999
-------------------------
Time spent in Mult_Rijk_Qkl:
Wall/CPU 0.0000000000000000 1.0000000000000231E-002
-------------------------
Time spent in Prepp:
Wall/CPU 1.9999999552965164E-002 1.9999999999999990E-002
-------------------------
Time spent in Pget_ri2:
Wall/CPU 1.9299999885261059 1.6799999999999784
-------------------------
Time spent in Pget_ri3:
Wall/CPU 26.639999970793724 24.220000000002724
-------------------------
Time spent in Drvg1_ri:
Wall/CPU 30.780000001192093 29.980000000000000
-------------------------
Total Time for Density:
Wall/CPU 29.259999956935644 28.580000000002702
-------------------------
Total TIme for 2-center Derivatives:
Wall/CPU 4.0000006556510925E-002 5.0000000000000711E-002
-------------------------
Total TIme for 3-center Derivatives:
Wall/CPU 1.2900000289082527 0.95999999999992625
-------------------------
Total Time for Derivatives:
Wall/CPU 1.3300000354647636 1.0099999999999270
-------------------------
Derivative check:
Wall/CPU 1.5200000442564487 1.3999999999972985
-------------------------
**************************************************
* *
* Molecular gradients *
* *
**************************************************
Irreducible representation: a
------------------------------------------------------------------------------------------
X Y Z
------------------------------------------------------------------------------------------
O1 -1.98851141800633E-02 -1.14970627422544E-02 2.12050069768441E-16
C2 2.84865583657403E-02 -4.80631924642658E-02 -3.96646399700776E-17
C3 -8.69113664295718E-03 -3.64423636090918E-02 -3.16954259465065E-16
C4 1.79977768096897E-02 9.64441213013232E-02 1.92238150917718E-16
H5 -8.45373739694377E-03 5.80609196165760E-04 2.50740075238916E-17
N6 4.92010425432912E-02 -1.16000520192880E-02 7.46549437561080E-18
H7 -9.12343056982667E-03 7.36100690135721E-03 4.40192337329983E-17
C8 -8.04589220797345E-02 1.54668253945878E-02 -1.12273219002112E-16
H9 -1.18165036388812E-02 -3.73286884604180E-02 -2.61603997364965E-17
O10 3.13200213047473E-02 -8.09400572426178E-03 7.00400081208646E-18
N11 -2.48766844857844E-02 1.02322388623231E-02 -5.59552615420773E-17
H12 3.63001299709840E-02 2.29405633639810E-02 6.31568225850947E-17
------------------------------------------------------------------------------------------
++ I/O STATISTICS
I. General I/O information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Unit Name Flsize Write/Read MBytes Write/Read
(MBytes) Calls In/Out Time, sec.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 RUNFILE 14.39 . 77/ 882572 . 0.5/ 6397.6 . 0/ 1
2 CHRED 0.14 . 0/ 6 . 0.0/ 0.2 . 0/ 0
3 CHVEC1 38.71 . 0/ 1 . 0.0/ 38.7 . 0/ 0
4 CHORST 0.02 . 0/ 12 . 0.0/ 0.0 . 0/ 0
5 CHOMAP 0.00 . 0/ 0 . 0.0/ 0.0 . 0/ 0
6 CHODIAG 0.07 . 0/ 2 . 0.0/ 0.1 . 0/ 0
7 CHTA11 7.53 . 1/ 1 . 7.5/ 7.5 . 0/ 0
8 CHTB11 0.49 . 1/ 1 . 0.5/ 0.5 . 0/ 0
9 QVEC00 2.55 . 0/ 10 . 0.0/ 25.5 . 0/ 0
10 CVEA1 7.53 . 1/ 250760 . 7.5/ 5237.0 . 0/ 0
11 CVEB1 0.49 . 1/ 218116 . 0.5/ 298.7 . 0/ 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* TOTAL 71.94 . 81/ 1351481 . 16.6/ 12005.9 . 0/ 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
II. I/O Access Patterns
- - - - - - - - - - - - - - - - - - - -
Unit Name % of random
Write/Read calls
- - - - - - - - - - - - - - - - - - - -
1 RUNFILE 28.6/ 7.4
2 CHRED 0.0/ 16.7
3 CHVEC1 0.0/ 0.0
4 CHORST 0.0/ 0.0
5 CHOMAP 0.0/ 0.0
6 CHODIAG 0.0/ 0.0
7 CHTA11 0.0/ 0.0
8 CHTB11 0.0/ 0.0
9 QVEC00 0.0/ 0.0
10 CVEA1 0.0/ 8.5
11 CVEB1 0.0/ 9.7
- - - - - - - - - - - - - - - - - - - -
--
--- Stop Module: alaska at Fri May 29 16:17:24 2020 /rc=_RC_ALL_IS_WELL_ ---
*** files: xmldump
saved to directory /home/max/Molcas/Uracil
--- Module alaska spent 30 seconds ---
.##################.
.# Happy landing! #.
.##################.
Timing: Wall=57.86 User=172.55 System=9.20
Uracil in cluster
()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
&ALASKA
only a single process is used
available to each process: 12 GB of memory, 24 threads
pid: 16628
()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
Threshold for contributions to the gradient: 0.100E-06
********************************************
* Symmetry Adapted Cartesian Displacements *
********************************************
Irreducible representation : a
Basis function(s) of irrep: x, y, xy, Rz, z, xz, Ry, yz, Rx, I
Basis Label Type Center Phase
1 O1 x 1 1
2 O1 y 1 1
3 O1 z 1 1
4 C2 x 2 1
5 C2 y 2 1
6 C2 z 2 1
7 C3 x 3 1
8 C3 y 3 1
9 C3 z 3 1
10 C4 x 4 1
11 C4 y 4 1
12 C4 z 4 1
13 H5 x 5 1
14 H5 y 5 1
15 H5 z 5 1
16 N6 x 6 1
17 N6 y 6 1
18 N6 z 6 1
19 H7 x 7 1
20 H7 y 7 1
21 H7 z 7 1
22 C8 x 8 1
23 C8 y 8 1
24 C8 z 8 1
25 H9 x 9 1
26 H9 y 9 1
27 H9 z 9 1
28 O10 x 10 1
29 O10 y 10 1
30 O10 z 10 1
31 N11 x 11 1
32 N11 y 11 1
33 N11 z 11 1
34 H12 x 12 1
35 H12 y 12 1
36 H12 z 12 1
No automatic utilization of translational and rotational invariance of the energy is employed.
RI-ERI gradients!
Wavefunction type: State average CASSCF
A total of 946369. entities were prescreened and 946369. were kept.
A total of 34369088. entities were prescreened and 27727172. were kept.
-------------------------
Time spent in Cho_get_grad:
Wall/CPU 0.80999999959021807 4.4600000000000000
-------------------------
Time spent in Mult_Rijk_Qkl:
Wall/CPU 1.0000000707805157E-002 1.9999999999999574E-002
-------------------------
Time spent in Prepp:
Wall/CPU 0.12999999988824129 1.9999999999999990E-002
-------------------------
Time spent in Pget_ri2:
Wall/CPU 10.459999999031425 2.4500000000000384
-------------------------
Time spent in Pget_ri3:
Wall/CPU 172.97999999579042 38.089999999992571
-------------------------
Time spent in Drvg1_ri:
Wall/CPU 186.62000000011176 46.850000000000001
-------------------------
Total Time for Density:
Wall/CPU 184.38999999500811 45.039999999992610
-------------------------
Total TIme for 2-center Derivatives:
Wall/CPU 8.9999999850988388E-002 4.0000000000000036E-002
-------------------------
Total TIme for 3-center Derivatives:
Wall/CPU 1.7100000036880374 1.2099999999999440
-------------------------
Total Time for Derivatives:
Wall/CPU 1.8000000035390258 1.2499999999999440
-------------------------
Derivative check:
Wall/CPU 2.2300000051036477 1.8100000000073919
-------------------------
**************************************************
* *
* Molecular gradients *
* *
**************************************************
Irreducible representation: a
------------------------------------------------------------------------------------------
X Y Z
------------------------------------------------------------------------------------------
O1 -1.98851141804481E-02 -1.14970627444819E-02 1.34153482050348E-16
C2 2.84865583651079E-02 -4.80631924610346E-02 3.93412174767494E-16
C3 -8.69113664611554E-03 -3.64423636099667E-02 -3.41544493186221E-16
C4 1.79977768112352E-02 9.64441213024152E-02 1.69789886280378E-16
H5 -8.45373739568611E-03 5.80609196099591E-04 -3.85830056140566E-16
N6 4.92010425435390E-02 -1.16000520185224E-02 1.75472094527186E-16
H7 -9.12343057081699E-03 7.36100690084696E-03 -4.64445304443001E-17
C8 -8.04589220796386E-02 1.54668253934322E-02 -2.62746169922673E-16
H9 -1.18165036385326E-02 -3.73286884605539E-02 5.98561466866316E-18
O10 3.13200213055254E-02 -8.09400572478403E-03 1.54070387109341E-16
N11 -2.48766844849779E-02 1.02322388622733E-02 -3.19222713738906E-17
H12 3.63001299709813E-02 2.29405633641160E-02 3.56038816642395E-17
------------------------------------------------------------------------------------------
++ I/O STATISTICS
I. General I/O information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Unit Name Flsize Write/Read MBytes Write/Read
(MBytes) Calls In/Out Time, sec.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 RUNFILE 14.39 . 77/ 882572 . 0.5/ 6397.6 . 0/ 2
2 CHRED 0.14 . 0/ 6 . 0.0/ 0.2 . 0/ 0
3 CHVEC1 38.71 . 0/ 1 . 0.0/ 38.7 . 0/ 0
4 CHORST 0.02 . 0/ 12 . 0.0/ 0.0 . 0/ 0
5 CHOMAP 0.00 . 0/ 0 . 0.0/ 0.0 . 0/ 0
6 CHODIAG 0.07 . 0/ 2 . 0.0/ 0.1 . 0/ 0
7 CHTA11 7.53 . 1/ 1 . 7.5/ 7.5 . 0/ 0
8 CHTB11 0.49 . 1/ 1 . 0.5/ 0.5 . 0/ 0
9 QVEC00 2.55 . 0/ 10 . 0.0/ 25.5 . 0/ 0
10 CVEA1 7.53 . 1/ 250760 . 7.5/ 5237.0 . 0/ 0
11 CVEB1 0.49 . 1/ 218116 . 0.5/ 298.7 . 0/ 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* TOTAL 71.94 . 81/ 1351481 . 16.6/ 12005.9 . 0/ 3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
II. I/O Access Patterns
- - - - - - - - - - - - - - - - - - - -
Unit Name % of random
Write/Read calls
- - - - - - - - - - - - - - - - - - - -
1 RUNFILE 28.6/ 7.4
2 CHRED 0.0/ 16.7
3 CHVEC1 0.0/ 0.0
4 CHORST 0.0/ 0.0
5 CHOMAP 0.0/ 0.0
6 CHODIAG 0.0/ 0.0
7 CHTA11 0.0/ 0.0
8 CHTB11 0.0/ 0.0
9 QVEC00 0.0/ 0.0
10 CVEA1 0.0/ 8.5
11 CVEB1 0.0/ 9.7
- - - - - - - - - - - - - - - - - - - -
--
--- Stop Module: alaska at Fri May 29 16:25:38 2020 /rc=_RC_ALL_IS_WELL_ ---
*** files: xmldump
saved to directory /nfs/home2/paradiz/home/Molcas/Uracil/ccpVDZ
--- Module alaska spent 3 minutes 7 seconds ---
.##################.
.# Happy landing! #.
.##################.
Timing: Wall=224.14 User=173.88 System=68.89
At least I now I can see unambiguously which steps slow down in the cluster. I will try to understand these steps and the source code to see if I can figure out precisely which is the slow operation - I suppose that there is a computational step that takes place in Pget_ri2, Pget_ri3, Drvg1_ri, and Density but not in other parts of the code. I tested this on several different nodes, with thre 3 different types of Intel Xeon processors E5-2690 v3, and E5-2697A v4), so I don't suspect a hardware problem. My guess is that something is wrong with a library that both MKL and OpenBLAS rely on... I'll keep trying, thanks for your help!
Last edited by MaxParadiz (2020-05-29 16:12:19)
Offline
How do you know it's not an I/O problem? Only because of the I/O statistics printed out by ALASKA? It could be that those are incomplete.
Offline
I have confirmed that virtually no data is being written in the scratch by checking how the output of 'ls -s' evolves over time. This is consistent from the write section.
In my desktop, I have followed an alaska run with 'iotop' and I could not see much I/O activity while alaska.exe was executed. I can't run iotop in the cluster because I don't have root access. Do you know of a way to test for I/O problems?
I have tried Uracil with cc-pVQZ (12 atoms, 560 basis functions) in the cluster, and it finishes in 30 minutes.
A total of 33873924. entities were prescreened and 33701764. were kept.
A total of 2431722654. entities were prescreened and 1448064781. were kept.
-------------------------
Time spent in Cho_get_grad:
Wall/CPU 81.459999999031425 329.07999999999998
-------------------------
Time spent in Mult_Rijk_Qkl:
Wall/CPU 9.9999999627470970E-002 0.57999999999998408
-------------------------
Time spent in Prepp:
Wall/CPU 0.16000000014901161 8.9999999999999969E-002
-------------------------
Time spent in Pget_ri2:
Wall/CPU 122.33999999426305 48.780000000000200
-------------------------
Time spent in Pget_ri3:
Wall/CPU 1552.0800000075251 530.30000000013524
-------------------------
Time spent in Drvg1_ri:
Wall/CPU 1859.9399999994785 1038.5000000000000
-------------------------
Total Time for Density:
Wall/CPU 1756.1400000005960 908.83000000013544
-------------------------
Total TIme for 2-center Derivatives:
Wall/CPU 1.5800000056624413 0.94999999999987494
-------------------------
Total TIme for 3-center Derivatives:
Wall/CPU 93.990000002086163 116.77999999998246
-------------------------
Total Time for Derivatives:
Wall/CPU 95.570000007748604 117.72999999998234
-------------------------
Derivative check:
Wall/CPU 103.79999999888241 129.66999999986456
-------------------------
++ I/O STATISTICS
I. General I/O information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Unit Name Flsize Write/Read MBytes Write/Read
(MBytes) Calls In/Out Time, sec.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 RUNFILE 35.12 . 77/ 6449270 . 0.5/ 46676.4 . 0/ 18
2 CHRED 2.41 . 0/ 6 . 0.0/ 3.6 . 0/ 0
3 CHVEC1 5144.65 . 0/ 4 . 0.0/ 5144.7 . 0/ 1
4 CHORST 0.16 . 0/ 12 . 0.0/ 0.2 . 0/ 0
5 CHOMAP 0.01 . 0/ 0 . 0.0/ 0.0 . 0/ 0
6 CHODIAG 1.19 . 0/ 2 . 0.0/ 2.4 . 0/ 0
7 CHTA11 99.34 . 3/ 1 . 99.3/ 99.3 . 0/ 0
8 CHTB11 3.68 . 3/ 1 . 3.7/ 3.7 . 0/ 0
9 QVEC00 141.46 . 0/ 10 . 0.0/ 1414.6 . 0/ 0
10 CVEA1 99.34 . 1/ 6130979 . 99.3/ 257020.9 . 0/ 39
11 CVEB1 3.68 . 1/ 5709809 . 3.7/ 8866.5 . 0/ 12
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* TOTAL 5531.06 . 85/18290094 . 206.6/ 319232.2 . 0/ 72
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
II. I/O Access Patterns
- - - - - - - - - - - - - - - - - - - -
Unit Name % of random
Write/Read calls
- - - - - - - - - - - - - - - - - - - -
1 RUNFILE 28.6/ 7.4
2 CHRED 0.0/ 16.7
3 CHVEC1 0.0/ 0.0
4 CHORST 0.0/ 0.0
5 CHOMAP 0.0/ 0.0
6 CHODIAG 0.0/ 0.0
7 CHTA11 0.0/ 0.0
8 CHTB11 0.0/ 0.0
9 QVEC00 0.0/ 0.0
10 CVEA1 0.0/ 5.6
11 CVEB1 0.0/ 6.0
- - - - - - - - - - - - - - - - - - - -
--
--- Stop Module: alaska at Fri May 29 17:37:20 2020 /rc=_RC_ALL_IS_WELL_ ---
*** files: xmldump
saved to directory /nfs/home2/paradiz/home/Molcas/Uracil/ccpVQZ
--- Module alaska spent 31 minutes 0 seconds ---
Offline
Pages: 1