Support and discussions for Molcas and OpenMolcas users and developers
You are not logged in.
Please note: The forum's URL has changed. The new URL is: https://molcasforum.univie.ac.at. Please update your bookmarks!
You can choose an avatar and change the default style by going to "Profile" → "Personality" or "Display".Pages: 1
Dear Users,
I have a peculiar problem with my Molcas calculations (build d7828899).
I have two different machines, two sets of identical inputs at each, two sets of identical binaries (checked) at each.
The systems are the same hardware, and run the same linux.
On one machine, everything is smooth.
On the other one, Seward returns two kinds of errors: "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." or "Program aborted. Backtrace:".
In the output, it is either "non-zero return code" or " *** Error in SORT2A *** An inconsistency has been deteced nInts1#nInts2".
Even if Seward completes with rc=0, the other codes complain about NaNs, for instance, SCF: " !!! WARNING !!! NANs encountered".
Given this information - is it a disk problem, or a memory problem, or what is this?
I am certain that my system does not run out of memory, hence I assume hardware bugs.
Thank you in advance.
Best regards.
Andrew
Offline
Recently I also encounter this issue with the following message:
...
Basis functions 55 33 50 30 23 12 20 10
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
seward.exe 0000000000899E7D Unknown Unknown Unknown
libpthread-2.26.s 00007F89BD5882D0 Unknown Unknown Unknown
seward.exe 000000000048F1DC dcr_ 48 dcr.f
seward.exe 00000000005195E4 ppint_ 126 ppint.f
seward.exe 00000000006C57D0 oneel_ij_ 428 oneel_ij.f
seward.exe 00000000004A4E00 oneel_internal_ 217 oneel_internal.f
seward.exe 00000000004A2EDF oneel_ 107 oneel.f
seward.exe 0000000000411ABB drv1el_ 477 drv1el.f
seward.exe 000000000041CE79 seward_ 298 seward.f
seward.exe 0000000000407140 MAIN__ 23 main.f
seward.exe 00000000004070DE Unknown Unknown Unknown
libc-2.26.so 00007F89BD1DE34A __libc_start_main Unknown Unknown
seward.exe 0000000000406FE9 Unknown Unknown Unknown
--- Stop Module: seward at Mon Nov 9 13:16:30 2020 /rc=-1 ---
*** files: xmldump
saved to directory /pool/bogdanov/calcs/Sr2CuO3/molcas/sym_cas_2i2plus/test_basis
--- Module seward spent 1 second ---
.########################.
.# Non-zero return code #.
.########################.
Aborting...
Timing: Wall=2.17 User=1.96 System=0.06
The issue arises on a host with slightly different hardware and runtime libraries compare to the build host, not nice to identify the issue...
Several times a change of the input filename caused the problem, which sounds like a voodoo drum to me...
Last edited by nikolay (2020-11-09 13:29:20)
Offline
Did you try to recompile?
In my case, the slight differences might as well be the reason.
Everything used to work before a system update.
I've updated both systems on the same day, and got this. The funny thing is that one of them works.
I'll try to recompile, or even better - try a new build.
Offline
I've compiled the most recent bulid with the most recent OpenBLAS, and got this:
M1xp
mat. size = 1x 15
573250.0000000000000000 124682.0000000000000000 38556.0000000000000000 13233.3999999999996362 4895.6000000000003638
1863.7000000000000455 729.9500000000000455 332.2300000000000182 125.6599999999999966 59.6109999999999971
29.1039999999999992 10.5899999999999999 5.0206000000000000 1.7962000000000000 0.8441200000000000
M1xp
mat. size = 1x 15
574860.0000000000000000 125310.0000000000000000 38821.0000000000000000 13344.0000000000000000 4942.1000000000003638
1882.9000000000000909 735.9700000000000273 333.9200000000000159 126.0400000000000063 59.5489999999999995
28.7459999999999987 10.6080000000000005 5.0860000000000003 1.8447000000000000 0.8846600000000000
M1xp
mat. size = 1x 15
573250.0000000000000000 124682.0000000000000000 38556.0000000000000000 13233.3999999999996362 4895.6000000000003638
1863.7000000000000455 729.9500000000000455 332.2300000000000182 125.6599999999999966 59.6109999999999971
29.1039999999999992 10.5899999999999999 5.0206000000000000 1.7962000000000000 0.8441200000000000
M1xp
mat. size = 1x 15
397010.0000000000000000 63067.0000000000000000 14702.0000000000000000 3955.1999999999998181 1221.9000000000000909
421.8700000000000045 157.9000000000000057 63.5090000000000003 26.5560000000000009 10.2409999999999997
4.0125000000000002 1.8339000000000001 0.7704600000000000 0.2324500000000000 0.2323900000000000
occupations not implemented
Edit: Ok, this one apparently is an issue with AIMPs implementation in build 5c26ab82.
Last edited by andrewshyichuk (2020-11-09 18:35:45)
Offline
With recently recompiled master the issue with that particular input is gone.
My fear is that it can eventually come back with something else.
Offline
I've recompiled the code with a newer OpenBLAS, and got the same errors. Noteworthy, they seem random: a given job can complete with either rc=-1 and a segfault, or rc=internal_error and the SORT2A error, or with rc=0 and the following NaNs in SCF.
I've recompiled again with the original OpenBLAS (i.e. the same exact configuration as before, just newly compiled), and am waiting for the results.
Offline
Just in case, do you use test pure serial version with no openMP?
Offline
Yes, serial OpenBLAS with serial OpenMolcas, 1 thread runs.
Offline
To rule out possible OpenBLAS issues, you could try with LINALG=Internal
Offline
Another update. Shortly - the same problem, this time on two different disks.
I.e. it's likely not a disk issue. And this time, with one of the "successful" Seward calculations, I got "WfCtl_SCF: negative two-electron energy" (-2.3466535415+278) in SCF.
Offline
Is the initial scratch directory clean in all cases?
Offline
The run script clears the scratch dir before running Seward.
Still, I also tried removing the old workdirs, or moving them to a new folder (to force the use of other disk areas) - same problem whatsoever.
I am running a test with the freshly recompliled OpenBLAS 0.3.9 and freshly recompiled OpenMolcas d7828899 (i.e. the same versions that used to work before the system update, and that still work on another (identical) machine, with the same updated system).
Edit: if this one fails, I will try the internal linalg.
Last edited by andrewshyichuk (2020-11-10 18:46:40)
Offline
Alrighty, I seem to have found the problem.
I've ran memtester and found errors. I.e. it was likely a bad RAM issue from the start, and I should have tried memory testing sooner.
Thank you for your comments.
Offline
Pages: 1