Jobs hanging/freezing

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
Ben_Ellis1
Newbie
Newbie
Posts: 1
Joined: Mon Mar 28, 2022 2:54 pm

Jobs hanging/freezing

#1 Post by Ben_Ellis1 » Fri Jul 04, 2025 1:20 pm

I am experiencing VASP simulation/job issues when running standard VASP std calculations. The jobs will periodically freeze or hang and stop writing output. Ssh’ing onto the HPC node shows the std VASP instances are still running and the RAM haven’t maxed out. This seems to occur randomly, and is not dependent on number of nodes or number of cores. This problem also seems to occur independently of the VASP version, I have run this with both VASP/6.1.1. and /6.4.2. This has happened on a range of materials from metal oxides, metals, and inorganics, and seems to be unaffected by INCAR settings (with and without KPAR).

The systems that we have experienced this on are both Red Hat Linux x86.64 bit platforms, with up to 64 or 128 cores per node. We are using the gcc/12.3.0 compiler with openmpi.

Has anyone seen this issue before, or have any suggestions as to why this is happening? I am happy to provide more information if needed.


andreas.singraber
Global Moderator
Global Moderator
Posts: 371
Joined: Mon Apr 26, 2021 7:40 am

Re: Jobs hanging/freezing

#2 Post by andreas.singraber » Mon Jul 14, 2025 10:28 am

Hello!

Sorry for this very late reply! Since the problem you describe seems to appear over many different systems, INCAR setups, and even VASP versions it is hard to imagine that there is a direct cause inside the VASP source code. Can you be more specific about how you identify that the jobs "freeze"? If you log into via ssh and run top can you still see the VASP instances running at 100%? The file output is not a good indicator whether a program actually hangs because it is usually buffered via the OS. That means, that even when the VASP source code already issued a write statement you may not see the file output immediately. Instead the data will first be written to an OS-controlled buffer and not be written to the file. Only when the buffer is full the OS will write the data finally to their corresponding files. On HPC systems this can take a considerable amount of time (depending on the amount of data that is written) and it may look like the job is frozen. For example, the output in the OUTCAR file may abruptly stop (even in the middle of a line) and continue only after a longer calculation when enough output lines have accumulated in the OS buffer. Although annoying this behavior is totally normal and there is no need to be worried.

A quick test to see if there are any actual hang-ups in the middle of a VASP run could be to look at the LOOP or LOOP+ timings provided in the OUTCAR file. Compare the output of different machines, one with and one without "freezing". Check if you can spot unusual long iterations on the machines with "freezing". Or are the numbers comparable?

All the best,
Andreas Singraber


ramon_bergua
Newbie
Newbie
Posts: 9
Joined: Thu Jan 12, 2023 6:02 pm

Re: Jobs hanging/freezing

#3 Post by ramon_bergua » Fri Nov 07, 2025 8:09 am

Dear Support team,

I am experiencing same issue running my calculations in an HPC. Troubleshooting is more difficult due to the randomness for this error to arise.

In my case I have several calculations (let us say 40), the system is composed by a certain surface and a platinum monomer deposited on it. Each calculation correspond to different area of deposition of the surface.

I launch them with a batch script and approximately half of them finish without problem, and the rest of them are afffected by this issue at different stages of the calculation, (Iteration 17 ( 4), Iteration 18(8), Iteration 10 (27), Iteration 1(33), Iteration 6(11). However, as explained in the previous post, they are still running in the slurm queue. So I restarted them by hand, CONTCAR to POSCAR, and queue them again. In this second try, some of them are able to finish smoothly while some others are affected again by the same issue.

The batch script is shown below:
#SBATCH --cpus-per-task=1
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=112
module load hdf5/1.10.11
module load vasp/6.4.2
srun vasp_gam

And the INCAR file:
ISTART = 0
ICHARG=2
ENCUT = 400
IBRION = 2
ALGO = Fast
EDIFF = 1e-06
IVDW = 11
ISPIN = 2
ISMEAR = 0
LWAVE = .FALSE.
LCHARG = .FALSE.
SIGMA = 0.1
NSW = 800
NCORE=56
NELM=200
MAXMIX=40

I hope this helps to identify the issue


ramon_bergua
Newbie
Newbie
Posts: 9
Joined: Thu Jan 12, 2023 6:02 pm

Re: Jobs hanging/freezing

#4 Post by ramon_bergua » Fri Nov 07, 2025 2:33 pm

I can add something to my previous reply, in all the failed jobs the OUTCAR always stops after printing 'SETDIJ cpu time' and before printing 'EDDAV cpu time', as follows:

--------------------------------------- Iteration 5( 4) ---------------------------------------

POTLOK: cpu time 0.3185: real time 0.3220
SETDIJ: cpu time 0.0329: real time 0.0333

I have notice also that the jobs are failing when using 3 or more nodes (with 112 cores per node) while for 1 or 2 nodes the jobs always finish.

I thought that setting NCORE=56 in the INCAR could be a large value, however the jobs keep failing when submitting them on 3 or 4 nodes and setting NCORE=28.
In the other hand, NCORE=56, and nodes=1 or 2 in the batch script finish without problems.


andreas.singraber
Global Moderator
Global Moderator
Posts: 371
Joined: Mon Apr 26, 2021 7:40 am

Re: Jobs hanging/freezing

#5 Post by andreas.singraber » Mon Nov 10, 2025 8:57 am

Hello!

At first sight it is unclear to me whether the two problems described in this topic are related. Anyway, it seems the symptoms (hangups) are similar, so let's try to dive deeper into the issue. You mentioned that half of the jobs are affected which is quite substantial and likely cannot be explained by problems on the HPC system itself (then there would be a lot of reports to the system administrators by now). However, I would not completely rule out an issue with MPI libraries at this point. The most important step is now to collect as much information on your system and VASP setup as possible. Ultimately, if I can reproduce the hangups on my side it is very likely that we can fix the issue. Hence, I would like to ask you for the following pieces of information:

  • All relevant input files: INCAR, POSCAR, POTCAR, KPOINTS and whatever else is needed by VASP to start one of the simulations that gave you a hangup and one which was fine.

  • The submit script and a list of all modules which are loaded during execution (command module list).

  • The relevant output files, typically OUTCAR, submit script output, of a failing and a successful run.

  • The makefile.include which was used to compile VASP and a list of compiler and and libraries and their respective version numbers. Typically, this can be inferred from the loaded modules but it would be great if you can also post the output of this command (ideally running as part of your job script):

    Code: Select all

    ldd /path/to/your/vasp_std
  • Specification of your hardware, i.e., CPU, memory, node interconnect hardware (e.g. Infiniband, etc.). You can get a lot of this by adding these commands to your submit script (if they exist on your HPC system):

    Code: Select all

    lscpu
    lsmem
    lstopo
    
  • Ideally, also try to find out which operating systems the HPC is using (system libraries could also play a role). Is there any virtualization, i.e., are you running VASP on a cloud computing platform?

I am sorry that I have to ask for such detailed information but because of the randomness involved in the problem without this precise data the debugging will be only guesswork and it is unlikely that we can find the origin of this issue. Thank you for your efforts!

All the best,
Andreas Singraber


ramon_bergua
Newbie
Newbie
Posts: 9
Joined: Thu Jan 12, 2023 6:02 pm

Re: Jobs hanging/freezing

#6 Post by ramon_bergua » Tue Nov 11, 2025 10:45 am

Dear Andrea,

Thank you very much for your kind reply, I will try to gather all the requested info as I am currently in contact with the HPC support team.
I attach two folders, one that finished (Pt*_d07), however OUTCAR file is too large 58MB, and compressed is 20 MB and the file size limit to upload is 8MB.
The other one (after many tries) still did not finished (Pt*_dx1).

In both files, the batch script is 'slurm_mnv' file. Both calculations were launched at the same HPC, with same batch script except for number of nodes. in the succesful run, nodes=2 and in the error run I tried with several requested nodes, you will find many tries in the error.zip file if you run $: head Pt1_0def.out*

Here is the requested info :

$user:

Code: Select all

 ldd /apps/GPP/VASP/6.4.2/INTEL/IMPI/bin/vasp_gam
	linux-vdso.so.1 (0x00007fffe3527000)
	/opt/.snoopy/lib/libsnoopy.so (0x00007ffa11f8e000)
	libmkl_intel_lp64.so.2 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mkl/2023.2.0/lib/intel64/libmkl_intel_lp64.so.2 (0x00007ffa10a92000)
	libmkl_sequential.so.2 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mkl/2023.2.0/lib/intel64/libmkl_sequential.so.2 (0x00007ffa0f10e000)
	libmkl_core.so.2 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mkl/2023.2.0/lib/intel64/libmkl_core.so.2 (0x00007ffa0ad96000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ffa0ab61000)
	libmkl_scalapack_lp64.so.2 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mkl/2023.2.0/lib/intel64/libmkl_scalapack_lp64.so.2 (0x00007ffa0a436000)
	libmkl_blacs_intelmpi_lp64.so.2 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mkl/2023.2.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so.2 (0x00007ffa0a3f2000)
	libhdf5_fortran.so.102 => /apps/GPP/HDF5/1.10.11/INTEL/IMPI/lib/libhdf5_fortran.so.102 (0x00007ffa0a394000)
	libmpifort.so.12 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mpi/2021.10.0/lib/libmpifort.so.12 (0x00007ffa09fe2000)
	libmpi.so.12 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/mpi/2021.10.0/lib/release/libmpi.so.12 (0x00007ffa08511000)
	libm.so.6 => /lib64/libm.so.6 (0x00007ffa08436000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffa0822d000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffa08212000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007ffa0820d000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffa08206000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffa11f9f000)
	libhdf5.so.103 => /apps/GPP/HDF5/1.10.11/INTEL/IMPI/lib/libhdf5.so.103 (0x00007ffa07d47000)
	libz.so.1 => /lib64/libz.so.1 (0x00007ffa07d2d000)
	libifport.so.5 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/compiler/latest/linux/compiler/lib/intel64_lin/libifport.so.5 (0x00007ffa07d03000)
	libifcoremt.so.5 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/compiler/latest/linux/compiler/lib/intel64_lin/libifcoremt.so.5 (0x00007ffa07b89000)
	libimf.so => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/compiler/latest/linux/compiler/lib/intel64_lin/libimf.so (0x00007ffa0779d000)
	libintlc.so.5 => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/compiler/latest/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00007ffa07725000)
	libsvml.so => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so (0x00007ffa060f6000)
	librt.so.1 => /lib64/librt.so.1 (0x00007ffa060f1000)
	libirng.so => /gpfs/apps/MN5/GPP/ONEAPI/2023.2.0/compiler/latest/linux/compiler/lib/intel64_lin/libirng.so (0x00007ffa05dda000)

Loading modules:
$user: module load hdf5/1.10.11
$user: module load vasp/6.4.2
$user: module list

Code: Select all

Currently Loaded Modules:
  1) intel/2023.2.0   3) mkl/2023.2.0   5) oneapi/2023.2.0   7) hdf5/1.10.11
  2) impi/2021.10.0   4) ucx/1.15.0     6) bsc/1.0           8) vasp/6.4.2

$ lscpu

Code: Select all

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  224
  On-line CPU(s) list:   0-223
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8480+
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  56
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx 
                         fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bt
                         s rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds
                         _cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
                          tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l
                         3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_s
                         hadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid 
                         cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_n
                         i avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mb
                         m_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi um
                         ip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_v
                         popcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize
                          tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabili
                         ties
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   5.3 MiB (112 instances)
  L1i:                   3.5 MiB (112 instances)
  L2:                    224 MiB (112 instances)
  L3:                    210 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-55,112-167
  NUMA node1 CPU(s):     56-111,168-223
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

$ lsmem

Code: Select all

RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x000000407fffffff  254G online       yes 2-128

Memory block size:         2G
Total online memory:     256G
Total offline memory:      0B

$lstopo

Code: Select all

-bash: lstopo: command not found

I found another command that can spit the required information: "$ lstopo-no-graphics " however I reached the maximum characters size in this post so I will continue in the next post.

You do not have the required permissions to view the files attached to this post.

ramon_bergua
Newbie
Newbie
Posts: 9
Joined: Thu Jan 12, 2023 6:02 pm

Re: Jobs hanging/freezing

#7 Post by ramon_bergua » Tue Nov 11, 2025 10:54 am

$ lstopo-no-graphics

Code: Select all


Machine (252GB total)
  Package L#0
    NUMANode L#0 (P#0 126GB)
    L3 L#0 (105MB)
      L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#112)
      L2 L#1 (2048KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#113)
      L2 L#2 (2048KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#114)
      L2 L#3 (2048KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#115)
      L2 L#4 (2048KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#116)
      L2 L#5 (2048KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#117)
      L2 L#6 (2048KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#118)
      L2 L#7 (2048KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#119)
      L2 L#8 (2048KB) + L1d L#8 (48KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#120)
      L2 L#9 (2048KB) + L1d L#9 (48KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#121)
      L2 L#10 (2048KB) + L1d L#10 (48KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#122)
      L2 L#11 (2048KB) + L1d L#11 (48KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#123)
      L2 L#12 (2048KB) + L1d L#12 (48KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#124)
      L2 L#13 (2048KB) + L1d L#13 (48KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#125)
      L2 L#14 (2048KB) + L1d L#14 (48KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#126)
      L2 L#15 (2048KB) + L1d L#15 (48KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#127)
      L2 L#16 (2048KB) + L1d L#16 (48KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#128)
      L2 L#17 (2048KB) + L1d L#17 (48KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#129)
      L2 L#18 (2048KB) + L1d L#18 (48KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#130)
      L2 L#19 (2048KB) + L1d L#19 (48KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#131)
      L2 L#20 (2048KB) + L1d L#20 (48KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#132)
      L2 L#21 (2048KB) + L1d L#21 (48KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#133)
      L2 L#22 (2048KB) + L1d L#22 (48KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#134)
      L2 L#23 (2048KB) + L1d L#23 (48KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#135)
      L2 L#24 (2048KB) + L1d L#24 (48KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#136)
      L2 L#25 (2048KB) + L1d L#25 (48KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#137)
      L2 L#26 (2048KB) + L1d L#26 (48KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#138)
      L2 L#27 (2048KB) + L1d L#27 (48KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#139)
      L2 L#28 (2048KB) + L1d L#28 (48KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#140)
      L2 L#29 (2048KB) + L1d L#29 (48KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#141)
      L2 L#30 (2048KB) + L1d L#30 (48KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#142)
      L2 L#31 (2048KB) + L1d L#31 (48KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#143)
      L2 L#32 (2048KB) + L1d L#32 (48KB) + L1i L#32 (32KB) + Core L#32
        PU L#64 (P#32)
        PU L#65 (P#144)
      L2 L#33 (2048KB) + L1d L#33 (48KB) + L1i L#33 (32KB) + Core L#33
        PU L#66 (P#33)
        PU L#67 (P#145)
      L2 L#34 (2048KB) + L1d L#34 (48KB) + L1i L#34 (32KB) + Core L#34
        PU L#68 (P#34)
        PU L#69 (P#146)
      L2 L#35 (2048KB) + L1d L#35 (48KB) + L1i L#35 (32KB) + Core L#35
        PU L#70 (P#35)
        PU L#71 (P#147)
      L2 L#36 (2048KB) + L1d L#36 (48KB) + L1i L#36 (32KB) + Core L#36
        PU L#72 (P#36)
        PU L#73 (P#148)
      L2 L#37 (2048KB) + L1d L#37 (48KB) + L1i L#37 (32KB) + Core L#37
        PU L#74 (P#37)
        PU L#75 (P#149)
      L2 L#38 (2048KB) + L1d L#38 (48KB) + L1i L#38 (32KB) + Core L#38
        PU L#76 (P#38)
        PU L#77 (P#150)
      L2 L#39 (2048KB) + L1d L#39 (48KB) + L1i L#39 (32KB) + Core L#39
        PU L#78 (P#39)
        PU L#79 (P#151)
      L2 L#40 (2048KB) + L1d L#40 (48KB) + L1i L#40 (32KB) + Core L#40
        PU L#80 (P#40)
        PU L#81 (P#152)
      L2 L#41 (2048KB) + L1d L#41 (48KB) + L1i L#41 (32KB) + Core L#41
        PU L#82 (P#41)
        PU L#83 (P#153)
      L2 L#42 (2048KB) + L1d L#42 (48KB) + L1i L#42 (32KB) + Core L#42
        PU L#84 (P#42)
        PU L#85 (P#154)
      L2 L#43 (2048KB) + L1d L#43 (48KB) + L1i L#43 (32KB) + Core L#43
        PU L#86 (P#43)
        PU L#87 (P#155)
      L2 L#44 (2048KB) + L1d L#44 (48KB) + L1i L#44 (32KB) + Core L#44
        PU L#88 (P#44)
        PU L#89 (P#156)
      L2 L#45 (2048KB) + L1d L#45 (48KB) + L1i L#45 (32KB) + Core L#45
        PU L#90 (P#45)
        PU L#91 (P#157)
      L2 L#46 (2048KB) + L1d L#46 (48KB) + L1i L#46 (32KB) + Core L#46
        PU L#92 (P#46)
        PU L#93 (P#158)
      L2 L#47 (2048KB) + L1d L#47 (48KB) + L1i L#47 (32KB) + Core L#47
        PU L#94 (P#47)
        PU L#95 (P#159)
      L2 L#48 (2048KB) + L1d L#48 (48KB) + L1i L#48 (32KB) + Core L#48
        PU L#96 (P#48)
        PU L#97 (P#160)
      L2 L#49 (2048KB) + L1d L#49 (48KB) + L1i L#49 (32KB) + Core L#49
        PU L#98 (P#49)
        PU L#99 (P#161)
      L2 L#50 (2048KB) + L1d L#50 (48KB) + L1i L#50 (32KB) + Core L#50
        PU L#100 (P#50)
        PU L#101 (P#162)
      L2 L#51 (2048KB) + L1d L#51 (48KB) + L1i L#51 (32KB) + Core L#51
        PU L#102 (P#51)
        PU L#103 (P#163)
      L2 L#52 (2048KB) + L1d L#52 (48KB) + L1i L#52 (32KB) + Core L#52
        PU L#104 (P#52)
        PU L#105 (P#164)
      L2 L#53 (2048KB) + L1d L#53 (48KB) + L1i L#53 (32KB) + Core L#53
        PU L#106 (P#53)
        PU L#107 (P#165)
      L2 L#54 (2048KB) + L1d L#54 (48KB) + L1i L#54 (32KB) + Core L#54
        PU L#108 (P#54)
        PU L#109 (P#166)
      L2 L#55 (2048KB) + L1d L#55 (48KB) + L1i L#55 (32KB) + Core L#55
        PU L#110 (P#55)
        PU L#111 (P#167)
    HostBridge
      PCIBridge
        PCIBridge
          PCI 02:00.0 (VGA)
      PCI 00:17.0 (SATA)
      PCI 00:18.0 (SATA)
      PCI 00:19.0 (SATA)
    HostBridge
      PCIBridge
        PCI 16:00.0 (Ethernet)
          Net "ens13f0np0"
        PCI 16:00.1 (Ethernet)
          Net "ens13f1np1"
        PCI 16:00.2 (Ethernet)
          Net "ens13f2np2"
        PCI 16:00.3 (Ethernet)
          Net "ens13f3np3"
    HostBridge
      PCIBridge
        PCI 38:00.0 (InfiniBand)
          Net "ib0"
          OpenFabrics "mlx5_0"
        PCI 38:00.1 (InfiniBand)
          Net "ib1"
          OpenFabrics "mlx5_1"
    HostBridge
      PCI 6b:00.0 (Co-Processor)
    HostBridge
      PCI 6d:00.0 (Co-Processor)
  Package L#1
    NUMANode L#1 (P#1 126GB)
    L3 L#1 (105MB)
      L2 L#56 (2048KB) + L1d L#56 (48KB) + L1i L#56 (32KB) + Core L#56
        PU L#112 (P#56)
        PU L#113 (P#168)
      L2 L#57 (2048KB) + L1d L#57 (48KB) + L1i L#57 (32KB) + Core L#57
        PU L#114 (P#57)
        PU L#115 (P#169)
      L2 L#58 (2048KB) + L1d L#58 (48KB) + L1i L#58 (32KB) + Core L#58
        PU L#116 (P#58)
        PU L#117 (P#170)
      L2 L#59 (2048KB) + L1d L#59 (48KB) + L1i L#59 (32KB) + Core L#59
        PU L#118 (P#59)
        PU L#119 (P#171)
      L2 L#60 (2048KB) + L1d L#60 (48KB) + L1i L#60 (32KB) + Core L#60
        PU L#120 (P#60)
        PU L#121 (P#172)
      L2 L#61 (2048KB) + L1d L#61 (48KB) + L1i L#61 (32KB) + Core L#61
        PU L#122 (P#61)
        PU L#123 (P#173)
      L2 L#62 (2048KB) + L1d L#62 (48KB) + L1i L#62 (32KB) + Core L#62
        PU L#124 (P#62)
        PU L#125 (P#174)
      L2 L#63 (2048KB) + L1d L#63 (48KB) + L1i L#63 (32KB) + Core L#63
        PU L#126 (P#63)
        PU L#127 (P#175)
      L2 L#64 (2048KB) + L1d L#64 (48KB) + L1i L#64 (32KB) + Core L#64
        PU L#128 (P#64)
        PU L#129 (P#176)
      L2 L#65 (2048KB) + L1d L#65 (48KB) + L1i L#65 (32KB) + Core L#65
        PU L#130 (P#65)
        PU L#131 (P#177)
      L2 L#66 (2048KB) + L1d L#66 (48KB) + L1i L#66 (32KB) + Core L#66
        PU L#132 (P#66)
        PU L#133 (P#178)
      L2 L#67 (2048KB) + L1d L#67 (48KB) + L1i L#67 (32KB) + Core L#67
        PU L#134 (P#67)
        PU L#135 (P#179)
      L2 L#68 (2048KB) + L1d L#68 (48KB) + L1i L#68 (32KB) + Core L#68
        PU L#136 (P#68)
        PU L#137 (P#180)
      L2 L#69 (2048KB) + L1d L#69 (48KB) + L1i L#69 (32KB) + Core L#69
        PU L#138 (P#69)
        PU L#139 (P#181)
      L2 L#70 (2048KB) + L1d L#70 (48KB) + L1i L#70 (32KB) + Core L#70
        PU L#140 (P#70)
        PU L#141 (P#182)
      L2 L#71 (2048KB) + L1d L#71 (48KB) + L1i L#71 (32KB) + Core L#71
        PU L#142 (P#71)
        PU L#143 (P#183)
      L2 L#72 (2048KB) + L1d L#72 (48KB) + L1i L#72 (32KB) + Core L#72
        PU L#144 (P#72)
        PU L#145 (P#184)
      L2 L#73 (2048KB) + L1d L#73 (48KB) + L1i L#73 (32KB) + Core L#73
        PU L#146 (P#73)
        PU L#147 (P#185)
      L2 L#74 (2048KB) + L1d L#74 (48KB) + L1i L#74 (32KB) + Core L#74
        PU L#148 (P#74)
        PU L#149 (P#186)
      L2 L#75 (2048KB) + L1d L#75 (48KB) + L1i L#75 (32KB) + Core L#75
        PU L#150 (P#75)
        PU L#151 (P#187)
      L2 L#76 (2048KB) + L1d L#76 (48KB) + L1i L#76 (32KB) + Core L#76
        PU L#152 (P#76)
        PU L#153 (P#188)
      L2 L#77 (2048KB) + L1d L#77 (48KB) + L1i L#77 (32KB) + Core L#77
        PU L#154 (P#77)
        PU L#155 (P#189)
      L2 L#78 (2048KB) + L1d L#78 (48KB) + L1i L#78 (32KB) + Core L#78
        PU L#156 (P#78)
        PU L#157 (P#190)
      L2 L#79 (2048KB) + L1d L#79 (48KB) + L1i L#79 (32KB) + Core L#79
        PU L#158 (P#79)
        PU L#159 (P#191)
      L2 L#80 (2048KB) + L1d L#80 (48KB) + L1i L#80 (32KB) + Core L#80
        PU L#160 (P#80)
        PU L#161 (P#192)
      L2 L#81 (2048KB) + L1d L#81 (48KB) + L1i L#81 (32KB) + Core L#81
        PU L#162 (P#81)
        PU L#163 (P#193)
      L2 L#82 (2048KB) + L1d L#82 (48KB) + L1i L#82 (32KB) + Core L#82
        PU L#164 (P#82)
        PU L#165 (P#194)
      L2 L#83 (2048KB) + L1d L#83 (48KB) + L1i L#83 (32KB) + Core L#83
        PU L#166 (P#83)
        PU L#167 (P#195)
      L2 L#84 (2048KB) + L1d L#84 (48KB) + L1i L#84 (32KB) + Core L#84
        PU L#168 (P#84)
        PU L#169 (P#196)
      L2 L#85 (2048KB) + L1d L#85 (48KB) + L1i L#85 (32KB) + Core L#85
        PU L#170 (P#85)
        PU L#171 (P#197)
      L2 L#86 (2048KB) + L1d L#86 (48KB) + L1i L#86 (32KB) + Core L#86
        PU L#172 (P#86)
        PU L#173 (P#198)
      L2 L#87 (2048KB) + L1d L#87 (48KB) + L1i L#87 (32KB) + Core L#87
        PU L#174 (P#87)
        PU L#175 (P#199)
      L2 L#88 (2048KB) + L1d L#88 (48KB) + L1i L#88 (32KB) + Core L#88
        PU L#176 (P#88)
        PU L#177 (P#200)
      L2 L#89 (2048KB) + L1d L#89 (48KB) + L1i L#89 (32KB) + Core L#89
        PU L#178 (P#89)
        PU L#179 (P#201)
      L2 L#90 (2048KB) + L1d L#90 (48KB) + L1i L#90 (32KB) + Core L#90
        PU L#180 (P#90)
        PU L#181 (P#202)
      L2 L#91 (2048KB) + L1d L#91 (48KB) + L1i L#91 (32KB) + Core L#91
        PU L#182 (P#91)
        PU L#183 (P#203)
      L2 L#92 (2048KB) + L1d L#92 (48KB) + L1i L#92 (32KB) + Core L#92
        PU L#184 (P#92)
        PU L#185 (P#204)
      L2 L#93 (2048KB) + L1d L#93 (48KB) + L1i L#93 (32KB) + Core L#93
        PU L#186 (P#93)
        PU L#187 (P#205)
      L2 L#94 (2048KB) + L1d L#94 (48KB) + L1i L#94 (32KB) + Core L#94
        PU L#188 (P#94)
        PU L#189 (P#206)
      L2 L#95 (2048KB) + L1d L#95 (48KB) + L1i L#95 (32KB) + Core L#95
        PU L#190 (P#95)
        PU L#191 (P#207)
      L2 L#96 (2048KB) + L1d L#96 (48KB) + L1i L#96 (32KB) + Core L#96
        PU L#192 (P#96)
        PU L#193 (P#208)
      L2 L#97 (2048KB) + L1d L#97 (48KB) + L1i L#97 (32KB) + Core L#97
        PU L#194 (P#97)
        PU L#195 (P#209)
      L2 L#98 (2048KB) + L1d L#98 (48KB) + L1i L#98 (32KB) + Core L#98
        PU L#196 (P#98)
        PU L#197 (P#210)
      L2 L#99 (2048KB) + L1d L#99 (48KB) + L1i L#99 (32KB) + Core L#99
        PU L#198 (P#99)
        PU L#199 (P#211)
      L2 L#100 (2048KB) + L1d L#100 (48KB) + L1i L#100 (32KB) + Core L#100
        PU L#200 (P#100)
        PU L#201 (P#212)
      L2 L#101 (2048KB) + L1d L#101 (48KB) + L1i L#101 (32KB) + Core L#101
        PU L#202 (P#101)
        PU L#203 (P#213)
      L2 L#102 (2048KB) + L1d L#102 (48KB) + L1i L#102 (32KB) + Core L#102
        PU L#204 (P#102)
        PU L#205 (P#214)
      L2 L#103 (2048KB) + L1d L#103 (48KB) + L1i L#103 (32KB) + Core L#103
        PU L#206 (P#103)
        PU L#207 (P#215)
      L2 L#104 (2048KB) + L1d L#104 (48KB) + L1i L#104 (32KB) + Core L#104
        PU L#208 (P#104)
        PU L#209 (P#216)
      L2 L#105 (2048KB) + L1d L#105 (48KB) + L1i L#105 (32KB) + Core L#105
        PU L#210 (P#105)
        PU L#211 (P#217)
      L2 L#106 (2048KB) + L1d L#106 (48KB) + L1i L#106 (32KB) + Core L#106
        PU L#212 (P#106)
        PU L#213 (P#218)
      L2 L#107 (2048KB) + L1d L#107 (48KB) + L1i L#107 (32KB) + Core L#107
        PU L#214 (P#107)
        PU L#215 (P#219)
      L2 L#108 (2048KB) + L1d L#108 (48KB) + L1i L#108 (32KB) + Core L#108
        PU L#216 (P#108)
        PU L#217 (P#220)
      L2 L#109 (2048KB) + L1d L#109 (48KB) + L1i L#109 (32KB) + Core L#109
        PU L#218 (P#109)
        PU L#219 (P#221)
      L2 L#110 (2048KB) + L1d L#110 (48KB) + L1i L#110 (32KB) + Core L#110
        PU L#220 (P#110)
        PU L#221 (P#222)
      L2 L#111 (2048KB) + L1d L#111 (48KB) + L1i L#111 (32KB) + Core L#111
        PU L#222 (P#111)
        PU L#223 (P#223)
    HostBridge
      PCIBridge
        PCI a8:00.0 (Ethernet)
          Net "ens4f0np0"
          OpenFabrics "mlx5_bond_0"
        PCI a8:00.1 (Ethernet)
          Net "ens4f1np1"
    HostBridge
      PCIBridge
        PCI c8:00.0 (NVMExp)
          Block(Disk) "nvme0n1"
    HostBridge
      PCI e8:00.0 (Co-Processor)
    HostBridge
      PCI ea:00.0 (Co-Processor)

Also I am currently in a simultaneous conversation with the HPC support team and I let them know about this forum thread. They are making their own tries to fix the bug and today they answered and here is the info they gave me:

Hello Ramon,

We’ve been running some tests, since an issue this global (as the VASP developers correctly pointed out) is usually linked to a bug in the MPI or MKL libraries. While reviewing one of the jobs that got stuck, we saw that it happened during an MKL call, so we tried the following:

Code: Select all

module purge
module load intel impi mkl/2024.2
module load hdf5/1.10.11 ucx
module load vasp/6.4.2

With this setup, we launched a new batch of 9 jobs, and so far none of them have crashed. Could you try this combination yourself to see if your jobs run more stably?

I will try this new combination and I will let you know with the results. Thank you very much Andrea for your kind help!

Best regards,

Ramon Bergua


Post Reply