nrn_timeout error

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

Post Reply
jtmoyer
Posts: 14
Joined: Fri Jun 17, 2005 10:01 am
Location: Philadelphia, PA, USA
Contact:

nrn_timeout error

Post by jtmoyer »

Hi -
I'm having a problem with a large network simulation that we are running on our XServe cluster. I think it might be a memory error but I'm not sure. We're running NEURON release 6.1.1 (1894), using the Inquiry system, on Mac OS 10.4 with 32 processors on 8 nodes. The simulation runs ok with 512 cells; when I increase the number of cells to 729 or 1000, I get this error:

nrn_timeout t=115
p0_12978: p4_error: : 0
p0_12978: (2061.670912) net_send: could not write to fd=5, errno = 32

The error occurs at different times, sooner with 1000 cells than with 729 cells. Is this error the result of the nodes running out of memory? If so, is there anything I can do to reduce the memory load?

Thanks!
hines
Site Admin
Posts: 1619
Joined: Wed May 18, 2005 3:32 pm

Post by hines »

Code: Select all

nrn_timeout t=115
This indicates that the simulation got
to 115 ms and then for some reason
went 20 seconds without advancing
in time. This generally means it is
hanging on an mpi collective failure
due to an internal error and prevents
wasting a lot of supercomputer time.
Or it may be a node failure on your
machine.
Is it possible your simulation really takes
more than 20 seconds to advance
one time step? If so, change the

Code: Select all

nrn_timeout(20);
in the
src/nrniv/netpar.cpp file to a larger value.
But if you think it is a bug, send me
<michael.hines@yale.edu> all the
hoc,ses,mod files in a zip file and I'll see
if I can reproduce the problem on my
4 core workstation or 12 cpu cluster.
jtmoyer
Posts: 14
Joined: Fri Jun 17, 2005 10:01 am
Location: Philadelphia, PA, USA
Contact:

Post by jtmoyer »

The problem turned out to be load imbalance. I was using variable dt (cvode.active(1)); one processor would finish much sooner than another, and would have to wait for the other to catch up. Turning off variable dt solved the problem and also reduced simulation time. It seemed to owe primarily to high frequency of input (1000 Hz per cell, 30+ cells per processor).

This code, courtesy of Dr. Hines, helped track down the error:

Code: Select all

objref hines
hinest1 = startsw()
hinest2 = startsw()
hines = new FInitializeHandler(2, "hinest1=startsw() hinest2=startsw() hines1()")
proc hines1() {
        printf("%d t=%g dt=%g dreal=%g treal=%g\n", \
                pnm.pc.id, t, dt, startsw()-hinest2, startsw()-hinest1)
        hinest2 = startsw()
        cvode.event(t + 1, "hines1()")
}
Post Reply