collective abort

Problems running VASP: crashes, internal errors, "wrong" results.

Moderators: Global Moderator, Moderator

Post Reply
Message
Author
zhangyg
Newbie
Newbie
Posts: 17
Joined: Wed Nov 07, 2007 3:36 am
License Nr.: 781

collective abort

#1 Post by zhangyg » Fri Feb 29, 2008 7:02 am

Dear all:

I had VASP running smoothly on a home made PC cluster made of AMD CPUs with system sizes of 512 and 256 atoms. Recently we bought a new cluster from a company made of intel CPUs. VASP runs smoothly with 256 atoms, but once the system has 512 atoms, VASP stops with following message:
rank 3 in job 1 ... caused collective abort of all ranks
exit status of rank 3: killed by signal 11.
The error occurs at the beginning, during the preparation of WAVECAR.
It occurs even we use one node.

We have changed the dyna.f to 512 atoms.

It is a little strange that system of 256 atoms can run but system of 512 cannot.

Please kindly help us.

<span class='smallblacktext'>[ Edited ]</span>
Last edited by zhangyg on Fri Feb 29, 2008 7:02 am, edited 1 time in total.

admin
Administrator
Administrator
Posts: 2922
Joined: Tue Aug 03, 2004 8:18 am
License Nr.: 458

collective abort

#2 Post by admin » Fri Feb 29, 2008 1:46 pm

maybe you ran out of CPU-memory?
Last edited by admin on Fri Feb 29, 2008 1:46 pm, edited 1 time in total.

zhangyg
Newbie
Newbie
Posts: 17
Joined: Wed Nov 07, 2007 3:36 am
License Nr.: 781

collective abort

#3 Post by zhangyg » Sun Mar 02, 2008 1:53 am

Thanks for your reply.

I checked the memory. We have 4 GB memory on one node. VASP used 2.5 GB when 4 works are running on one node. So the memory is enough.

The run crashes even if I use only one work on one node.

We use Xeon E5335 intel CPU. It is 64 bits and has a cache size of 4 MB.
Last edited by zhangyg on Sun Mar 02, 2008 1:53 am, edited 1 time in total.

zhangyg
Newbie
Newbie
Posts: 17
Joined: Wed Nov 07, 2007 3:36 am
License Nr.: 781

collective abort

#4 Post by zhangyg » Sun Mar 02, 2008 2:31 am

I also checked with the number of atoms of 300 and 343 (all runs have the same starting CAR files except of course the POSCAR). They all run smoothly! The problem only occurs with 512 atoms. But my old AMD system with also 4 GB memory can run 512 atoms. strange!

Looking forward to your answer.
Last edited by zhangyg on Sun Mar 02, 2008 2:31 am, edited 1 time in total.

zhangyg
Newbie
Newbie
Posts: 17
Joined: Wed Nov 07, 2007 3:36 am
License Nr.: 781

collective abort

#5 Post by zhangyg » Sun Mar 02, 2008 7:13 am

I checked sytem size from 256 to 512 systematically. The VASP runs smoothly from 256 to 508. Beyong that it crashes during the reading of the wavecar. I added the memory to 8 GB, which did not solve my problem.
Last edited by zhangyg on Sun Mar 02, 2008 7:13 am, edited 1 time in total.

zhangyg
Newbie
Newbie
Posts: 17
Joined: Wed Nov 07, 2007 3:36 am
License Nr.: 781

collective abort

#6 Post by zhangyg » Sun Mar 02, 2008 12:13 pm

I have one more question: when I make tests using smaller system (256 atoms), there is a drop of computer performance if the number of nodes is larger than 4. If nodes=4 (32 ranks in total, each node runs 8 ranks), the CPUs are running at its full speed. But when nodes=6 or 8, the use of CPUs is almost zero.

I use infiniband and MPICH2. Each node has two quad-core intel Xeons.

The same thing happens if I use 8 nodes but run only 4 jobs on each node (thus totally 32 ranks). In this case, the CPUs are at their full use. Once I use more ranks on each node, the use of CPUs drops.

It seem that VASP I compiled for the new system does not like more than 32 ranks. It may not be a memory problem as 4node x 8rank or 8node x 4rank both ran nicely. The phenomenon does not appear on my old home made system, which use a AMD opteron and Gigabit ethernet.

Any suggestions for the way to look for the reasons of that?
Thanks greatly!
<span class='smallblacktext'>[ Edited Sun Mar 02 2008, 02:34PM ]</span>
Last edited by zhangyg on Sun Mar 02, 2008 12:13 pm, edited 1 time in total.

Post Reply