I am having some trouble with updating a Matrix across worker nodes in my bulletin board style implementation.
The main use of my ParallelContext is to submit simulation runs with different parameters across all the nodes I have. This works fine. Then when these simulation runs return I do some processing on a Matrix, called 'pop', using the simulation results.
Then I want to use my worker nodes to do some more processing that requires access to the Matrix values that I just updated with the results of the simulation runs. This means that I need to make sure all workers have access to the latest version of the Matrix, which is now only on the master.
I have tried passing all of the entries of the matrix as parameters in the pc.submit() call using a small test simulation, and this works fine. However, in my real simulation the matrix could contain up to several thousand entries. I can pass up to about 12 parameters ok, but as soon as I pass more I get errors:
I get one of these errors for each worker. Then I get:1 unpack size=448 upkpos=74 type[0]=1 datatype=0 type[1]=1 count=1
nrniv: /home/hines/neuron/nrn/src/nrnmpi/bbsmpipack.c:60: unpack: Assertion `type[0] == my_datatype' failed.
I have come up with the following, probably broken in some way, code to send the matrix out to the workers in pieces (i.e. each col as a Vector), as I can't see any functions to share a Matrix to the workers, only Vectors:=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
Code: Select all
proc send_pop(){local j,key localobj pop_col
key = 123456
pop_col = new Vector()
for j=0,pop.ncol()-1 {
pop_col = pop.getcol(j)
pc.look_take(key)
pc.pack(j)
pc.pack(pop_col)
pc.post(key)
pc.context("pc.look(123456)\npop.setcol(pc.upkscalar(),pc.upkvec())\n")
}
}
Presumably this is something to do with the for loop? The only column that gets set on the worker nodes is the final one, i.e. when j=pop.ncol-1. How would I ensure that the pc.context is set each time on every worker, and that each column is set?some workers did not receive previous context
Is there a better way than this to ensure I have a consistent context across all workers? Am I trying to use the bulletin board in a crazy way? It seems like I shouldn't be trying to pass a large amount of inforrmation between the workers (the pack/unpack size maybe has some limit?), but I need them to have access to the same matrix - is it possible to get them to look up entries in the version of the matrix on the master node from the workers?
Thanks for any help,
Tim.
edit: I noticed I had "pc.upkscalar() where I should have had pc.upkvec()", which solves the datatype error, now I get the previous context error instead...