Bulletin board - updating context when using a Matrix

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

Post Reply
timrumbell
Posts: 3
Joined: Wed Apr 24, 2013 11:15 am

Bulletin board - updating context when using a Matrix

Post by timrumbell »

Hi,

I am having some trouble with updating a Matrix across worker nodes in my bulletin board style implementation.

The main use of my ParallelContext is to submit simulation runs with different parameters across all the nodes I have. This works fine. Then when these simulation runs return I do some processing on a Matrix, called 'pop', using the simulation results.

Then I want to use my worker nodes to do some more processing that requires access to the Matrix values that I just updated with the results of the simulation runs. This means that I need to make sure all workers have access to the latest version of the Matrix, which is now only on the master.

I have tried passing all of the entries of the matrix as parameters in the pc.submit() call using a small test simulation, and this works fine. However, in my real simulation the matrix could contain up to several thousand entries. I can pass up to about 12 parameters ok, but as soon as I pass more I get errors:
1 unpack size=448 upkpos=74 type[0]=1 datatype=0 type[1]=1 count=1
nrniv: /home/hines/neuron/nrn/src/nrnmpi/bbsmpipack.c:60: unpack: Assertion `type[0] == my_datatype' failed.
I get one of these errors for each worker. Then I get:
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
I have come up with the following, probably broken in some way, code to send the matrix out to the workers in pieces (i.e. each col as a Vector), as I can't see any functions to share a Matrix to the workers, only Vectors:

Code: Select all

proc send_pop(){local j,key		localobj pop_col

	key = 123456

	pop_col = new Vector()
	
	for j=0,pop.ncol()-1 {
		pop_col = pop.getcol(j)
		pc.look_take(key)
		pc.pack(j)
		pc.pack(pop_col)
		pc.post(key)
		pc.context("pc.look(123456)\npop.setcol(pc.upkscalar(),pc.upkvec())\n")
	}
}
I get the following message when calling this:
some workers did not receive previous context
Presumably this is something to do with the for loop? The only column that gets set on the worker nodes is the final one, i.e. when j=pop.ncol-1. How would I ensure that the pc.context is set each time on every worker, and that each column is set?

Is there a better way than this to ensure I have a consistent context across all workers? Am I trying to use the bulletin board in a crazy way? It seems like I shouldn't be trying to pass a large amount of inforrmation between the workers (the pack/unpack size maybe has some limit?), but I need them to have access to the same matrix - is it possible to get them to look up entries in the version of the matrix on the master node from the workers?

Thanks for any help,
Tim.

edit: I noticed I had "pc.upkscalar() where I should have had pc.upkvec()", which solves the datatype error, now I get the previous context error instead...
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Bulletin board - updating context when using a Matrix

Post by hines »

Sadly, as you found, only Vectors are easily transferred via the Bulletin Board with HOC. Python is much more flexible since any pickleable object can be transferred.
Anyway, I recommend you encode the matrix as a single vector and transfer that. The following example gives the idea. The only practical limit on transfer size is available
memory. I'm doing this with function calls. Each jobset fills m.x[0][0] with the jobset number, sends the full 1000x500 matrix to all the workers, submits $2 jobs, waits for those
jobs to finish. Note that each job has the proper matrix for the jobset it is involved in. I call the file 'transfer.hoc' and run with
mpiexec -n 8 nrniv -mpi transfer.hoc

Code: Select all

$ cat transfer.hoc
objref pc
{pc = new ParallelContext()}

objref m, vv
nrow = 1000
ncol = 500
m = new Matrix(nrow, ncol)

func get_from_master() { local i, j
  for i=0, nrow-1 {
    for j=0, ncol-1 {
      m.x[i][j] = $o1.x[i*ncol + j]
    }
  }
  return 0
}

func job() {local i, x
  printf("%d job %g,%g m.x[0][0]=%g\n", pc.id, $1, $2, m.x[0][0])
  for i=0, 10000000 { x = i } // waste time
  return 0
}

{pc.runworker()}

func send_from_master() { local i, j
  vv = new Vector(nrow*ncol)
  m.x[0][0] = $1
  for i=0, nrow-1 {
    for j=0, ncol-1 {
      vv.x[i*ncol + j] = m.x[i][j]
    }
  }
  pc.context("get_from_master", vv)
  return 0
}

func jobset() {local i
  send_from_master($1)
  for i=0, $2 {
    pc.submit("job", $1, i)
  }
  while(pc.working()) {
    pc.retval
  }
  return $1
}

for i=1, 5 {
  tt = startsw()
  jobset(i, 10)
  printf("master jobset #%d finished in %g s\n", i, startsw()-tt)
}

{pc.done()}
quit()

timrumbell
Posts: 3
Joined: Wed Apr 24, 2013 11:15 am

Re: Bulletin board - updating context when using a Matrix

Post by timrumbell »

Thanks for this - the code is very straightforward and I was able to adapt it to work with my existing code in no time.

Do you have any idea for what mistake I am making with calling pc.context() in the for loop, and getting the error 'some workers did not receive previous context'? It seems like it is probably that the master is not waiting for the workers to pick up the previous context before sending the next one in the loop. The outcome is that all workers only receive the context from the final iteration of the for loop. I don't need to use this any more thanks to your solution, but it would be interesting to know.
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Bulletin board - updating context when using a Matrix

Post by hines »

You probably imagined that pc.context would not return until all workers executed the context argument. However,that is not the case. the execution is assynchronous and pc.context returns immediately
and starts the next block of the loop which removes the key,value from the bulletin board so that many of the workers get a return value of 0 when they execute pc.look since the message no longer exists
(or has the wrong value, since you keep using the same key). It is likely you could fix the algorithm with loop after the pc.context that would look like
for i = 0, pc.nhost-2 {pc.take("done")}
and let the last part of the pc.context statement be pc.post("done")

It is helpful when developing code that uses the bulletin board to freely use
printf("%d enter function name ...\n", pc.id, ...)
...
printf("%d leave function name...\n", pc.id, ...)

so you can see the sequence of actions. When I did this with the code I sent earlier and before I added the submit stuff to it, I noticed that the master call of pc.context is generally always one step ahead of the workers execution of the request. ie
the second pc.context was made before any workers started on the first request.

I must say that I much prefer the easiness of using python with the bulletin board since there is no limit to what can be passed as arguments for a submit callable and no limit to what can be passed
back as the return.
timrumbell
Posts: 3
Joined: Wed Apr 24, 2013 11:15 am

Re: Bulletin board - updating context when using a Matrix

Post by timrumbell »

OK, thanks again. That makes sense - nearly the next action in my previous code is to take the key off of the board again...
Post Reply