nrn_timeout error

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

Post Reply
jackfolla
Posts: 48
Joined: Wed Jul 07, 2010 7:42 am

nrn_timeout error

Post by jackfolla »

Dear all,
during my runs I have the nrn_timeout error.

Code: Select all

nrn_timeout t=6525.15
[gozer3:21108] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 0
mpiexec noticed that job rank 1 with PID 21109 on node gozer3 exited on signal 15 (Terminated). 
6 additional processes aborted (not shown)
I tried to modify in /src/nrniv/netpar.cpp
nrn_timeout(20) in nrn_timeout(500).

With nrn_timeout(20) the error occurred at t=3276.75 ms.

Code: Select all

nrn_timeout t=3276.75
[gozer3:21108] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 0
mpiexec noticed that job rank 1 with PID 21109 on node gozer3 exited on signal 15 (Terminated). 
6 additional processes aborted (not shown)
I tried also with nrn_timeout(1000), but the following error is occurred:

Code: Select all

mpiexec noticed that job rank 1 with PID 18859 on node gozer3 exited on signal 15 (Terminated). 
6 additional processes aborted (not shown)
1 process killed (possibly by Open MPI)
The runs was performed on a 8-core machine (2 quad-core).

I see that the problem depends by a procedure in particular (if I comment this proc, the problem do not occours):

Code: Select all

proc a_record() {local j
	rec_time = new Vector()
	listrec_a = new List()
	rec_time.record(&t)
 	for (j=1; j<ncslist.count;j=j+2) {	// loop over possible target cells
		rec_a = new Vector()
		rec_a.record(&ncslist.o(j).weight[1])
		listrec_a.append(rec_a)
	}
}
Maybe the data amount collected is very high...
hines
Site Admin
Posts: 1687
Joined: Wed May 18, 2005 3:32 pm

Re: nrn_timeout error

Post by hines »

timeout is off if you set it to 0.
hines
Site Admin
Posts: 1687
Joined: Wed May 18, 2005 3:32 pm

Re: nrn_timeout error

Post by hines »

Assuming that dt=.025 then the problem is occurring when the Vectors reach a size of
oc>6525.15/dt
261006
oc>3276.75/dt
131070
I don't know how many NetCons are involved. It sounds like the time for copying vectors
is taking the time when they run out of memory and twice the memory is reallocated.
Clearly, recording all the weights every time step is not very space efficient.
Post Reply