Weird errors in SEWARD and other

andrewshyichuk · 2020-11-09 12:58:56

Dear Users,

I have a peculiar problem with my Molcas calculations (build d7828899).

I have two different machines, two sets of identical inputs at each, two sets of identical binaries (checked) at each.
The systems are the same hardware, and run the same linux.

On one machine, everything is smooth.

On the other one, Seward returns two kinds of errors: "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." or "Program aborted. Backtrace:".
In the output, it is either "non-zero return code" or " *** Error in SORT2A *** An inconsistency has been deteced nInts1#nInts2".
Even if Seward completes with rc=0, the other codes complain about NaNs, for instance, SCF: " !!! WARNING !!! NANs encountered".

Given this information - is it a disk problem, or a memory problem, or what is this?
I am certain that my system does not run out of memory, hence I assume hardware bugs.

Thank you in advance.
Best regards.
Andrew

nikolay · 2020-11-09 13:25:09

Recently I also encounter this issue with the following message:

...
      Basis functions           55   33   50   30   23   12   20   10
 
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
seward.exe         0000000000899E7D  Unknown               Unknown  Unknown
libpthread-2.26.s  00007F89BD5882D0  Unknown               Unknown  Unknown
seward.exe         000000000048F1DC  dcr_                       48  dcr.f
seward.exe         00000000005195E4  ppint_                    126  ppint.f
seward.exe         00000000006C57D0  oneel_ij_                 428  oneel_ij.f
seward.exe         00000000004A4E00  oneel_internal_           217  oneel_internal.f
seward.exe         00000000004A2EDF  oneel_                    107  oneel.f
seward.exe         0000000000411ABB  drv1el_                   477  drv1el.f
seward.exe         000000000041CE79  seward_                   298  seward.f
seward.exe         0000000000407140  MAIN__                     23  main.f
seward.exe         00000000004070DE  Unknown               Unknown  Unknown
libc-2.26.so       00007F89BD1DE34A  __libc_start_main     Unknown  Unknown
seward.exe         0000000000406FE9  Unknown               Unknown  Unknown
--- Stop Module: seward at Mon Nov  9 13:16:30 2020 /rc=-1 ---
*** files: xmldump
    saved to directory /pool/bogdanov/calcs/Sr2CuO3/molcas/sym_cas_2i2plus/test_basis
--- Module seward spent 1 second ---

.########################.
.# Non-zero return code #.
.########################.


Aborting...
    Timing: Wall=2.17 User=1.96 System=0.06

The issue arises on a host with slightly different hardware and runtime libraries compare to the build host, not nice to identify the issue...
Several times a change of the input filename caused the problem, which sounds like a voodoo drum to me...

Last edited by nikolay (2020-11-09 13:29:20)

andrewshyichuk · 2020-11-09 14:55:17

Did you try to recompile?

In my case, the slight differences might as well be the reason.
Everything used to work before a system update.

I've updated both systems on the same day, and got this. The funny thing is that one of them works.

I'll try to recompile, or even better - try a new build.

andrewshyichuk · 2020-11-09 16:30:25

I've compiled the most recent bulid with the most recent OpenBLAS, and got this:

  M1xp
  mat. size =     1x   15
   573250.0000000000000000 124682.0000000000000000  38556.0000000000000000  13233.3999999999996362   4895.6000000000003638
     1863.7000000000000455    729.9500000000000455    332.2300000000000182    125.6599999999999966     59.6109999999999971
       29.1039999999999992     10.5899999999999999      5.0206000000000000      1.7962000000000000      0.8441200000000000

  M1xp
  mat. size =     1x   15
   574860.0000000000000000 125310.0000000000000000  38821.0000000000000000  13344.0000000000000000   4942.1000000000003638
     1882.9000000000000909    735.9700000000000273    333.9200000000000159    126.0400000000000063     59.5489999999999995
       28.7459999999999987     10.6080000000000005      5.0860000000000003      1.8447000000000000      0.8846600000000000

  M1xp
  mat. size =     1x   15
   573250.0000000000000000 124682.0000000000000000  38556.0000000000000000  13233.3999999999996362   4895.6000000000003638
     1863.7000000000000455    729.9500000000000455    332.2300000000000182    125.6599999999999966     59.6109999999999971
       29.1039999999999992     10.5899999999999999      5.0206000000000000      1.7962000000000000      0.8441200000000000

  M1xp
  mat. size =     1x   15
   397010.0000000000000000  63067.0000000000000000  14702.0000000000000000   3955.1999999999998181   1221.9000000000000909
      421.8700000000000045    157.9000000000000057     63.5090000000000003     26.5560000000000009     10.2409999999999997
        4.0125000000000002      1.8339000000000001      0.7704600000000000      0.2324500000000000      0.2323900000000000
 occupations not implemented

Edit: Ok, this one apparently is an issue with AIMPs implementation in build 5c26ab82.

Last edited by andrewshyichuk (2020-11-09 18:35:45)

nikolay · 2020-11-10 11:25:05

With recently recompiled master the issue with that particular input is gone.
My fear is that it can eventually come back with something else.

andrewshyichuk · 2020-11-10 13:45:45

I've recompiled the code with a newer OpenBLAS, and got the same errors. Noteworthy, they seem random: a given job can complete with either rc=-1 and a segfault, or rc=internal_error and the SORT2A error, or with rc=0 and the following NaNs in SCF.
I've recompiled again with the original OpenBLAS (i.e. the same exact configuration as before, just newly compiled), and am waiting for the results.

nikolay · 2020-11-10 14:11:42

Just in case, do you use test pure serial version with no openMP?

andrewshyichuk · 2020-11-10 14:52:06

Yes, serial OpenBLAS with serial OpenMolcas, 1 thread runs.

Ignacio · 2020-11-10 16:51:25

To rule out possible OpenBLAS issues, you could try with LINALG=Internal

andrewshyichuk · 2020-11-10 17:06:43

Another update. Shortly - the same problem, this time on two different disks.
I.e. it's likely not a disk issue. And this time, with one of the "successful" Seward calculations, I got "WfCtl_SCF: negative two-electron energy" (-2.3466535415+278) in SCF.

Ignacio · 2020-11-10 17:14:06

Is the initial scratch directory clean in all cases?

andrewshyichuk · 2020-11-10 18:39:28

The run script clears the scratch dir before running Seward.
Still, I also tried removing the old workdirs, or moving them to a new folder (to force the use of other disk areas) - same problem whatsoever.

I am running a test with the freshly recompliled OpenBLAS 0.3.9 and freshly recompiled OpenMolcas d7828899 (i.e. the same versions that used to work before the system update, and that still work on another (identical) machine, with the same updated system).

Edit: if this one fails, I will try the internal linalg.

Last edited by andrewshyichuk (2020-11-10 18:46:40)

andrewshyichuk · 2020-11-12 16:13:00

Alrighty, I seem to have found the problem.
I've ran memtester and found errors. I.e. it was likely a bad RAM issue from the start, and I should have tried memory testing sooner.

Thank you for your comments.

Molcas Forum

Announcement

#1 2020-11-09 12:58:56

Weird errors in SEWARD and other

#2 2020-11-09 13:25:09

Re: Weird errors in SEWARD and other

#3 2020-11-09 14:55:17

Re: Weird errors in SEWARD and other

#4 2020-11-09 16:30:25

Re: Weird errors in SEWARD and other

#5 2020-11-10 11:25:05

Re: Weird errors in SEWARD and other

#6 2020-11-10 13:45:45

Re: Weird errors in SEWARD and other

#7 2020-11-10 14:11:42

Re: Weird errors in SEWARD and other

#8 2020-11-10 14:52:06

Re: Weird errors in SEWARD and other

#9 2020-11-10 16:51:25

Re: Weird errors in SEWARD and other

#10 2020-11-10 17:06:43

Re: Weird errors in SEWARD and other

#11 2020-11-10 17:14:06

Re: Weird errors in SEWARD and other

#12 2020-11-10 18:39:28

Re: Weird errors in SEWARD and other

#13 2020-11-12 16:13:00

Re: Weird errors in SEWARD and other

Board footer