VASP 6.6.0 AMD GPU offload: CRAY_ACC_ERROR - Host region overlaps present region
I was asked to post this issue here by the HPE Cray support team at our site (EPCC, Edinburgh).
We have just compiled VASP 6.6.0 for our small number of MI210 AMD GPU using the provided AMD GPU offload "makefile.include" supplied with VASP 6.6.0. At runtime, on a single node with 4 MI210, for a couple of benchmark cases we regularly use, we consistently see the calculations fail with the following error:
Code: Select all
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffcb137c540 to 7ffcb20fc540) overlaps present region (7ffcae79c540 to 7ffcb1d9c540 index 237) but is not contained for 'cr(:)' from fft_base.f90:652I have included the full error stack below.
Benchmarks available at:
https://github.com/aturner-epcc/2026-01 ... erformance
Modules loaded at compile/run time:
Code: Select all
libfabric/1.12.1.2.2.0.0
craype-network-ofi
perftools-base/25.09.0
xpmem/0.2.119-1.3_0_gnoinfo
cce/20.0.0
craype/2.7.35
cray-dsmml/0.3.1
cray-mpich/9.0.1
cray-libsci/25.09.0
PrgEnv-cray/8.6.0
rocm/6.3.4
craype-accel-amd-gfx90a
craype-x86-milan
cray-fftw/3.3.10.11
HPE Cray support say:
This indicates that you're trying to map a variable into device memory but it's already partially there. Both OpenMP and OpenACC disallow this. Off the top of my head I can think of a couple situations where this can occur.
The first, as noted above, is when you transfer something like X(10:20) and then try to map X(1:15). It's possible you can also see this if you have implicit maps for compute regions. I believe OpenMP addresses this and allows certain things. I believe we handle this but it's possible you're hitting a bug. Is the error occurring on a compute region or data region?
The second, is an issue with stack based arrays. If you don't unmap a stack based array before it goes out of scope then you end up with a bad present table entry. When the stack is reused it's possible to trigger this type of error.
If you generate the runtime debug output with CRAY_ACC_DEBUG=3, it will output when something is added to the present table. Since, the error says index 237, I would search backwards from the error for something like "add to present table index 237". That would allow you find which allocation it is conflicting with.
A simple example of this error might be:Code: Select all
!$omp target enter data map(to: x(2:n-1)) !$omp target map(x) ! -> x(:) and mapped x(2:n-1) overlap -> error x = x + 1 !$omp end target
Full error stack
Code: Select all
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffd827192c0 to 7ffd902d32c0) overlaps present region (7ffd827192c0 to 7ffd85e07ac0 index 177) but is not contained for 'cr(:)' from fft_base.f90:666
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffe83083e80 to 7ffe90c3de80) overlaps present region (7ffe83083e80 to 7ffe86772680 index 177) but is not contained for 'cr(:)' from fft_base.f90:666
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffef14a0300 to 7ffeff05a300) overlaps present region (7ffef14a0300 to 7ffef4b8eb00 index 177) but is not contained for 'cr(:)' from fft_base.f90:666
ACC: libcrayacc/acc_present.c:762 CRAY_ACC_ERROR - Host region (7ffc271b0e00 to 7ffc34d6ae00) overlaps present region (7ffc271b0e00 to 7ffc2a89f600 index 177) but is not contained for 'cr(:)' from fft_base.f90:666
running 4 mpi-ranks, with 1 threads/rank, on 1 nodes
distrk: each k-point on 4 cores, 1 groups
distr: one band on 4 cores, 1 groups
Offloading initialized ... 4 GPUs detected
RCCL MPI communication initialized ...
vasp.6.6.0 06Mar2026 (build Mar 30 2026 15:06:08) gamma-only
POSCAR found : 2 types and 1080 ions
scaLAPACK will be used
LDA part: xc-table for (Slater(with rela. corr.)+CA(PZ)), standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
srun: error: nid200004: tasks 0-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=13093507.0