ParallelNetManager Example does not work

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

Post Reply
oren
Posts: 55
Joined: Fri Mar 22, 2013 1:03 am

ParallelNetManager Example does not work

Post by oren »

Hello,
We have a new cluster in our lab, and it seems to work with some parallel code. But fail's with other codes:
For example, when I try to run this example:
http://www.neuron.yale.edu/neuron/stati ... NetManager

Code: Select all

load_file("stdrun.hoc")
tstop = 1000

load_file("netparmpi.hoc")
objref pnm
ncell = 128
pnm = new ParallelNetManager(ncell)
pnm.round_robin()

for i=0, ncell-1 if (pnm.gid_exists(i)) {
	pnm.register_cell(i, new IntFire1())
	}
for i=0, ncell-1 {
	pnm.nc_append(i, (i+1)%ncell, -1, 1.1, 2)
	}
// stimulate
objref stim, ncstim
if (pnm.gid_exists(4)) {
        stim = new NetStim(.5)
	ncstim = new NetCon(stim, pnm.pc.gid2obj(4))
	ncstim.weight = 1.1
	ncstim.delay = 0
	stim.number=1
	stim.start=1
	}
pnm.set_maxstep(100)
pnm.want_all_spikes()

stdinit()
runtime = startsw()
print "Till herhe"
pnm.psolve(tstop)
print "Till here2"
runtime = startsw() - runtime

for i=0, pnm.spikevec.size-1 {
	print pnm.spikevec.x[i], pnm.idvec.x[i]
	}


pnm.pc.runworker
pnm.pc.done
I receive this error:

Code: Select all

numprocs=123
NEURON -- Release 7.3 (849:5be3d097b917) 2013-04-11
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2013
See http://www.neuron.yale.edu/neuron/credits

	1 
	1 
	1 
	1 
	1 
.
.
.
.
.
Till herhe
Till herhe
Till herhe
nrn_timeout t=2
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 47117 on
node illll-48 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
As we can see, the simulation stop at pnm.psolve(tstop).

How can I try to diagnose the problem further?

Thank You.
oren
Posts: 55
Joined: Fri Mar 22, 2013 1:03 am

Re: ParallelNetManager Example does not work

Post by oren »

I managed to advance in finding what the problem is.
I tired to run the simulation each time with different number of nodes/cell's

Code: Select all

mpiexec -n $NSLOTS ${NRNIV} -mpi -c "{ncell=$NSLOTS}" bstest.hoc
NSLOTS = [4,8,12,20,25,35,45,60,80,90,100,120,123]

In our cluster there are 228 nodes.
It seems that the code works for 4,8,12,20,25,35 nodes, but when I try to use 45,60,80,90,100,120,123 I receive the error:

Code: Select all

numprocs=100
NEURON -- Release 7.3 (849:5be3d097b917) 2013-04-11
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2013
See http://www.neuron.yale.edu/neuron/credits

	1 
	1 
.
.
.
.
Till herhe
Till herhe
Till herhe
Till herhe
nrn_timeout t=2
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 24898 on
node illl-38 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
Any ideas?

Thanks
ted
Site Admin
Posts: 6289
Joined: Wed May 18, 2005 4:50 pm
Location: Yale University School of Medicine
Contact:

Re: ParallelNetManager Example does not work

Post by ted »

I think the problem is a misreading of the documentation. The code for the demonstration program is supposed to end after step "8. Print the results." That is, the last three lines in the program are supposed to be

Code: Select all

for i=0, pnm.spikevec.size-1 {
        print pnm.spikevec.x[i], pnm.idvec.x[i]
}
The two paragraphs that begin with the one that starts
A perhaps more flexible alternative is to separate the master from all the workers somewhere after item 4) and before item 8) using ParallelContext.runworker() . . .
are a very sketchy outline of a different program. Imagine that you are listening to the person who wrote the documentation of ParallelNetManager while he is thinking out loud, musing to himself. The statements in those paragraphs are not to be taken as detailed instructions for the implementation of that alternative program.

Two final comments:
1. I had to stop and read these paragraphs a couple of times myself, to make sure I understood their author's intent.
2. ParallelNetManager is hardly ever used. It is a tempting class, because it has many potentially useful methods. However, it isn't used much--as of today I find only 12 model entries in ModelDB use it, as opposed to 33 that use ParallelContext.
oren
Posts: 55
Joined: Fri Mar 22, 2013 1:03 am

Re: ParallelNetManager Example does not work

Post by oren »

Thank you Ted,
I removed the lines :

Code: Select all

pnm.pc.runworker
pnm.pc.done
But it still does not work,Same error Code...

And as I mention the code work on the old cluster. But I forgot to mention that the old cluster runs

Code: Select all

NEURON -- VERSION 7.3 (728:52f3a2a66b5f) 2012-08-17
And the new cluster runs

Code: Select all

NEURON -- Release 7.3 (884:dfac2b0cef43) 2013-06-15
The behavior is really strange, because the code does work for some number of nodes, and give timeout error for some node number,
I wrote this python code to dignose the problem:
run.py

Code: Select all

import os
dir1  = os.getcwd()
queue = '1queue.q'
#'''
for i in [4,8,12,20,25,35,45,64,80,90,112,120,128]:
    os.system('qsub -pe pnrn ' + `i` + ' -o "' + dir1 + '/log/' + queue + 'Ncell' +`i` + '" -q ' + queue + ' "bstest.sh"')
bstest.sh

Code: Select all

#!/bin/bash
#$ -cwd
#$ -j y


export MPIEXEC_RSH=ssh
NRNIV=/opt/nrn/x86_64/bin/nrniv
TMPDIR=/tmp

mpiexec -n $NSLOTS ${NRNIV} -mpi -c "{ncell=$NSLOTS}" 'bstest.hoc'
exit 0 
EOF
and I removed the line ncell = 128 from the original code ( I know that I could have just left the number of cells at 128.. ) and also changed the stim.start=1->stim.start=15

Now for [4,8,12,20,64,128] nodes it works
but for [80,90,112,120] nodes it does not work nrn_timeout t=16 ( meaning just after the first spike)
and for [25] nodes it also does not work nrn_timeout t=58

So I saw that in 25 nodes the behavior is a bit different, so changed for i in [4,8,12,20,25,35,45,64,80,90,112,120,128] -> for i in [25,25,25,25,25,25,25,25,25] so now each run will be on 25 nodes.. and the result I get is that some of the runs work and other not.. the runs that have not worked always stop at nrn_timeout t=58

So there are 2 options:
1. We have some kind of problem with the new cluster
2. ParallelNetManager is not working with Release 7.3 (884:dfac2b0cef43) 2013-06-15


[UPDTAE]
I tried running the script on
156,173,182,200,222,233,256 nodes..
It only worked for 256 nodes.
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: ParallelNetManager Example does not work

Post by hines »

The default timeout is 20 seconds and so what is happenng is that during the simulation (at the time printed) there was no increase in t for that 20 second period on rank 0.
You can turn off the timeout using the method
pc.timeout(0)
http://www.neuron.yale.edu/neuron/stati ... xt.timeout
but I would guess that you would hang until you manually stopped the program or your time limit for the process was reached.
I know of only one circumstance that gives a spurious timeout and that was fixed in
http://www.neuron.yale.edu/hg/neuron/nr ... 45790e5785

Anyway, I'm guessing that there is a problem with your cluster in that the MPI_Allgather or MPI_Allgatherv collective does not complete with certain numbers of nodes.
Your simulation should run almost instantly. If the timeout occurs at the end of the first "minimum interprocessor NetCon delay" interval (the 2 is suggestive of that)
then it is very likely an MPI_Allgather problem.
Post Reply