Load balancing with heterogeneous morphologies and synapses
Moderator: hines
Load balancing with heterogeneous morphologies and synapses
Hello,
I have a large network model where each principal cell model has a unique morphology and distribution of synapses. Is multisplit the most appropriate way to load balance this network, and if so, is the multisplit version of the Traub model the principal example on how to do that? Thanks,
Ivan
I have a large network model where each principal cell model has a unique morphology and distribution of synapses. Is multisplit the most appropriate way to load balance this network, and if so, is the multisplit version of the Traub model the principal example on how to do that? Thanks,
Ivan

 Site Admin
 Posts: 5796
 Joined: Wed May 18, 2005 4:50 pm
 Location: Yale University School of Medicine
 Contact:
Re: Load balancing with heterogeneous morphologies and synap
The first question is: how bad is imbalance if you use the simplest approach to distributing the cells. If the number of principal cells is very large compared to the number of hosts, and the variance of their complexity is not very large, you may not have to do anything specialsimply deal out the principal cells to the hosts without splitting them.
Re: Load balancing with heterogeneous morphologies and synap
Hi Ted,
Using roundrobin assignment of cells to ranks, short simulations of 100300 ms have a load balancing factor of about 0.75 on 16384 ranks (1e6 principal cells in a model of the rat dentate gyrus), and 0.73 for a longer 1000 ms simulation. The difference in dendritic area and number of synapses can be large, about 4x difference between the smaller and larger cells, resulting in a range of 3000  12000 synapses. I am of course open to suggestions if there are more detailed measurements of cell complexity and load balancing I can do to determine if multisplit might be useful here.
Using roundrobin assignment of cells to ranks, short simulations of 100300 ms have a load balancing factor of about 0.75 on 16384 ranks (1e6 principal cells in a model of the rat dentate gyrus), and 0.73 for a longer 1000 ms simulation. The difference in dendritic area and number of synapses can be large, about 4x difference between the smaller and larger cells, resulting in a range of 3000  12000 synapses. I am of course open to suggestions if there are more detailed measurements of cell complexity and load balancing I can do to determine if multisplit might be useful here.
ted wrote:The first question is: how bad is imbalance if you use the simplest approach to distributing the cells. If the number of principal cells is very large compared to the number of hosts, and the variance of their complexity is not very large, you may not have to do anything specialsimply deal out the principal cells to the hosts without splitting them.
Re: Load balancing with heterogeneous morphologies and synap
a rough proxy for relative computation time is number of states (less accurate is number of compartments). Multisplit is useful if TotalNumberOfStatesForAllCells/nhost < NumberOfStatesOfLargestCell.
Here is a way to determine that:
Then call that after you setup your model. eg from a hoc file with
Then, when I run the 356 cell traub model with 8 cores and round robin distibution, I see
max_cx = 4093 average cx per rank = 120997
Clearly in this case there is no good reason to use mutlistplit.
Here is a way to determine that:
Code: Select all
~/models/nrntraub$ cat cx.py
from neuron import h
pc = h.ParallelContext()
nhost = int(pc.nhost())
rank = int(pc.id())
h.load_file("loadbal.hoc")
def cx():
lb = h.LoadBalance()
# all cell complexity
cell_cx = []
for sec in h.allsec():
if sec.parentseg() == None: # root section
cell_cx.append(lb.cell_complexity(sec=sec))
#local
max_cx = max(cell_cx) if len(cell_cx) > 0 else 0.0
sum_cx = sum(cell_cx)
#global
max_cx = pc.allreduce(max_cx, 2)
sum_cx = pc.allreduce(sum_cx, 1)
if rank is 0:
print ("max_cx = %g average cx per rank = %g\n" % (max_cx, sum_cx/nhost))
cx()
pc.barrier()
h.quit()
Code: Select all
nrnpython("import cx")
max_cx = 4093 average cx per rank = 120997
Clearly in this case there is no good reason to use mutlistplit.

 Site Admin
 Posts: 5796
 Joined: Wed May 18, 2005 4:50 pm
 Location: Yale University School of Medicine
 Contact:
Re: Load balancing with heterogeneous morphologies and synap
So the question becomes: is the simple round robin strategy for distributing cells over hosts good enough for this particular case, and if not, what alternative to try (e.g. round robin for all cell classes whose complexities are fixed, but maybe a "whole cell" variant of the least processing time algorithm (Korf1998, cited in discussion of load balance by Hines2008) for each cell class in which complexity varies from one instance to another).
@ARTICLE{Hines2008,
author = {Hines, M.L. and Markram, H. and Schürmann, F.},
title = {Fully implicit parallel simulation of single neurons},
journal = {Journal of Computational Neuroscience},
year = {2008},
volume = {25},
pages = {439448},
note = {PMC2760991},
pmid = {18379867},
}
downloadable from link at http://www.neuron.yale.edu/neuron/nrnpubs
@Article{Korf1998,
author = {Korf, R.E.},
title = {A complete anytime algorithm for number partitioning},
journal = {Artificial Intelligence},
year = {1998},
volume = {106},
number = {2},
pages = {181203},
note = {least processing time algorithm for load balance},
publisher = {Elsevier},
}
@ARTICLE{Hines2008,
author = {Hines, M.L. and Markram, H. and Schürmann, F.},
title = {Fully implicit parallel simulation of single neurons},
journal = {Journal of Computational Neuroscience},
year = {2008},
volume = {25},
pages = {439448},
note = {PMC2760991},
pmid = {18379867},
}
downloadable from link at http://www.neuron.yale.edu/neuron/nrnpubs
@Article{Korf1998,
author = {Korf, R.E.},
title = {A complete anytime algorithm for number partitioning},
journal = {Artificial Intelligence},
year = {1998},
volume = {106},
number = {2},
pages = {181203},
note = {least processing time algorithm for load balance},
publisher = {Elsevier},
}
Re: Load balancing with heterogeneous morphologies and synap
Thanks, Michael, I have added this code to the model and will calculate the cell complexity. Just to make sure I understand, the `allsec' function will return all sections of all cells on the current rank, right? I assumed from the code that it just needs to be run on the toplevel, but let me know if this is incorrect.
hines wrote:a rough proxy for relative computation time is number of states (less accurate is number of compartments). Multisplit is useful if TotalNumberOfStatesForAllCells/nhost < NumberOfStatesOfLargestCell.
Last edited by iraikov on Fri Mar 24, 2017 2:14 pm, edited 1 time in total.
Re: Load balancing with heterogeneous morphologies and synap
Hi Ted,
The simple round robin strategy for all cells results in a load balancing factor of 0.730.75 (when the stimulus is from homogeneous Poisson sources), which is not terrible but I think can be improved. Thanks for the references, I will look through the Korf paper and see if I can figure out how to apply this least processing time algorithm to load balance the cell population with variable complexity.
The simple round robin strategy for all cells results in a load balancing factor of 0.730.75 (when the stimulus is from homogeneous Poisson sources), which is not terrible but I think can be improved. Thanks for the references, I will look through the Korf paper and see if I can figure out how to apply this least processing time algorithm to load balance the cell population with variable complexity.
ted wrote:So the question becomes: is the simple round robin strategy for distributing cells over hosts good enough for this particular case, and if not, what alternative to try (e.g. round robin for all cell classes whose complexities are fixed, but maybe a "whole cell" variant of the least processing time algorithm (Korf1998, cited in discussion of load balance by Hines2008) for each cell class in which complexity varies from one instance to another).

 Site Admin
 Posts: 5796
 Joined: Wed May 18, 2005 4:50 pm
 Location: Yale University School of Medicine
 Contact:
Re: Load balancing with heterogeneous morphologies and synap
In pseudocode
In "Fully implicit parallel simulation of single neurons" see Fig. 3 and its legend, and imagine replacing the word "piece" with "cell". I would suggest using round robin for all cell classes whose members are of uniform size.
Code: Select all
REPEAT
place the most complex cell on the host that has the least total complexity
UNTIL no more cells remain to be placed
Re: Load balancing with heterogeneous morphologies and synap
Yes. I used the allsec as I did not know what you used for a cell list or a gid vector on each rank. It was just a dirty and not so quick way to figure out what wereI understand, the `allsec' function will return all sections of all cells on the current rank, right?
the cells. As written, it runs on import and quits but you can modify so that you can call the cx.cx() from anywhere and perhaps introduce a list of
your cells as an argument. The only requirement is that everyone participate in the call due to the use of the allreduce collectives. It had more of a throw away
purpose just to determine whether your model could benefit from multisplit.
For the overall whole cell load balance problem, you need to actually measure the load balance of a simulation, which I guess you have already done.
The current best practice for that takes advantage of the recent introduction of a barrier just before spike exchange so that the load balance measurement
is not just static load balance but includes dynamic effects as well. i.e
Code: Select all
def prun():
...
h.stdinit()
...
runtime=startsw()
pc.psolve(h.tstop)
runtime = startsw()  runtime
computation_time = pc.step_time()
cw_time = computation_time + pc.step_wait()
max_cw_time = pc.allreduce(cw_time, 2)
avg_comp_time = pc.allreduce(computation_time, 1)/nhost
load_balance = avg_comp_time/max_cw_time
return runtime, load_balance, avg_comp_time
Re: Load balancing with heterogeneous morphologies and synap
Here is a python module that prints existing expected load balance and what would be expected if LPT is used to distribute the cells.
To test with the nrntraub model (written in hoc, which happens to have a gidvec, I used the fragment after setup in the init.hoc file of
Code: Select all
$ cat cx.py
# usage:
# import cx
# test(gidvec)
# Where gidvec is the (round robin) distribution of gids on this rank
# and cell exists for each of those gids.
# The current expected load balance and expected load balance
# if the LPT distributon is used, will be printed.
from neuron import h
pc = h.ParallelContext()
nhost = int(pc.nhost())
rank = int(pc.id())
h.load_file("loadbal.hoc")
# given gids, what are the complexities
def cx(gidvec):
''' gidvec is Hoc Vector, return complexity as Hoc Vector '''
lb = h.LoadBalance()
cxvec = gidvec.c()
for i, gid in enumerate(gidvec):
cxvec.x[i] = lb.cell_complexity(pc.gid2cell(gid))
return cxvec
# for given cxvec on each rank what is the fractional load balance.
def ld_bal(cxvec):
sum_cx = sum(cxvec)
max_sum_cx = pc.allreduce(sum_cx, 2)
sum_cx = pc.allreduce(sum_cx, 1)
if rank is 0:
print ("expected load_balance %.2f" % (sum_cx/nhost / max_sum_cx))
#each rank has gidvec, cxvec,
# gather everything to rank 0 and do lpt algorithm and scatter
# proper gidvec, cxvec back to ranks and return the new balanced
# gidvec, cxvec. Note, perhaps should write a balance file as well.
def lpt_bal(gidvec, cxvec):
#gather gidvec, cxvec to rank 0
src = [None]*nhost
src[0] = (gidvec, cxvec)
dest = pc.py_alltoall(src)
del src
if rank == 0:
# organize dest into single allgidvec, allcxvec Hoc Vectors.
allgidvec = h.Vector()
allcxvec = h.Vector()
for pair in dest:
allgidvec.append(pair[0])
allcxvec.append(pair[1])
del dest
#rankvec specifies the rank where each cell should be
lb = h.LoadBalance()
rankvec = lb.lpt(allcxvec, nhost, 0) # third arg suppresses a print
#send back a balanced gidvec, cxvec to each rank
# start out with empty vectors
src = [(h.Vector(), h.Vector()) for _ in range(nhost)]
for i in range(len(allcxvec)):
pair = src[int(rankvec.x[i])]
pair[0].append(allgidvec.x[i])
pair[1].append(allcxvec.x[i])
del allgidvec
del allcxvec
else: # all other ranks send nothing
src = [None]*nhost
dest = pc.py_alltoall(src)
# dest[0] contains the balanced (gidvec, cxvec) pair
del src
balanced_gidvec = dest[0][0]
balanced_cxvec = dest[0][1]
del dest
return balanced_gidvec, balanced_cxvec
def test(gidvec):
# cell complexities (the cells exist)
cxvec = cx(gidvec)
# present expected balance
if rank == 0: print ("\ncurrent distribution")
ld_bal(cxvec)
# lpt balanced
balanced_gidvec, balanced_cxvec = lpt_bal(gidvec, cxvec)
# new expected balance
if rank == 0: print ("\nafter LPT distribution")
ld_bal(balanced_cxvec)
Code: Select all
{
nrnpython("import cx")
nrnpython("cx.test(cx.h.gidvec)")
pc.barrier()
quit()
}
prun()
Re: Load balancing with heterogeneous morphologies and synap
Hi Michael,
Thanks a lot, this is tremendously helpful. It is good to learn about the new load balancing additions. It looks like LPT is the way to go, as my cell complexity measurement showed that the largest cell still has much lower complexity than the average per rank. I will try out your code and see if there is an improvement.
Thanks a lot, this is tremendously helpful. It is good to learn about the new load balancing additions. It looks like LPT is the way to go, as my cell complexity measurement showed that the largest cell still has much lower complexity than the average per rank. I will try out your code and see if there is an improvement.
hines wrote:Here is a python module that prints existing expected load balance and what would be expected if LPT is used to distribute the cells.
Re: Load balancing with heterogeneous morphologies and synap
A better proxy for complexity (than a count of number of states) can be generated with
It is interesting to take a look at the relative computation times of the various mechanisms in the mcomplex.dat file
Then when you later want to balance using a LoadBalance instance, read the file with
lb.read_mcomplex()
Prior to using this more accurate proxy, I mentioned a performance puzzle in the paper
http://neuron.yale.edu/neuron/static/pa ... isplit.pdf
Code: Select all
from neuron import h
pc = h.ParallelContext()
h.load_file("loadbal.hoc")
lb = h.LoadBalance()
lb.ExperimentalMechComplex()
pc.barrier()
h.quit()
Then when you later want to balance using a LoadBalance instance, read the file with
lb.read_mcomplex()
Prior to using this more accurate proxy, I mentioned a performance puzzle in the paper
http://neuron.yale.edu/neuron/static/pa ... isplit.pdf
Take a look at the image at http://neuron.yale.edu/ftp/hines/expmechcomplex.pdfThough the
predicted load imbalance is 5% for the 8192 processor case the measured maximum
and average computation time exhibits a load imbalance of 18%. While
these results are already very satisfying, further speedup may be achievable
if the reasons for the deviation of the measured computation time balance
and the expected balance from the complexity proxy can be identified.
Re: Load balancing with heterogeneous morphologies and synap
Just to close this discussion, I'd like to mention that the hoc implementation of the lpt algorithm is not very efficient. It is much faster to use a python implemenation that make use of heapq.
Here is an lpt.py file that you can run with
python lpt.py
Here is an lpt.py file that you can run with
python lpt.py
Code: Select all
import heapq
def lpt(cx, npart):
''' from the list of (cx, gid) return a npart length list with each partition
being a total_cx followed by a list of (cx, gid).
'''
cx.sort(key=lambda x:x[0], reverse=True)
# initialize a priority queue for fast determination of current
# partition with least complexity. The priority queue always has
# npart items in it. At this time we do not care which partition will
# be associated with which rank so a partition on the heap is just
# (totalcx, [list of (cx, gid)]
h = []
for i in range(npart):
heapq.heappush(h, (0.0, []))
#each cx item goes into the current least complex partition
for c in cx:
lp = heapq.heappop(h) # least partition
lp[1].append(c)
heapq.heappush(h, (lp[0]+c[0], lp[1]))
parts = [heapq.heappop(h) for i in range(len(h))]
return parts
def statistics(parts):
npart = len(parts)
total_cx = 0
max_part_cx = 0
ncx = 0
max_cx = 0
for part in parts:
ncx += len(part[1])
total_cx += part[0]
if part[0] > max_part_cx:
max_part_cx = part[0]
for cx in part[1]:
if cx[0] > max_cx:
max_cx = cx[0]
avg_part_cx =total_cx/npart
loadbal = 1.0
if max_part_cx > 0.:
loadbal = avg_part_cx/max_part_cx
s = "loadbal=%g total_cx=%g npart=%d ncx=%d max_part_cx=%g max_cx=%g"%(loadbal,total_cx,npart,ncx,max_part_cx, max_cx)
return s
if __name__ == '__main__':
for cx in ([(i, i) for i in range(10)],[]):
print len(cx), ' complexity items ', cx
pinfo = lpt(cx, 3)
print len(pinfo), ' lpt partitions ', pinfo
print statistics(pinfo)
Re: Load balancing with heterogeneous morphologies and synap
Hi Michael,
Thanks a lot for the code and explanations, I just have one question: in the routine lpt_bal, is it really necessary to send back a balanced gidvec to each rank? I am thinking about a simple scheme where lpt_bal writes the file with gid to rank assignments, and then the simulation code can read it and allocate gids accordingly. Does that make sense, or am I misunderstanding lpt_bal? Thank you,
Ivan
Thanks a lot for the code and explanations, I just have one question: in the routine lpt_bal, is it really necessary to send back a balanced gidvec to each rank? I am thinking about a simple scheme where lpt_bal writes the file with gid to rank assignments, and then the simulation code can read it and allocate gids accordingly. Does that make sense, or am I misunderstanding lpt_bal? Thank you,
Ivan
hines wrote:Just to close this discussion, I'd like to mention that the hoc implementation of the lpt algorithm is not very efficient. It is much faster to use a python implemenation that make use of heapq.
Here is an lpt.py file that you can run with
python lpt.py
Re: Load balancing with heterogeneous morphologies and synap
The difference is indeed quite striking. So what do I do after the lb.read_mcomplex call? The current complexity estimate procedure calls lb.cell_complexity for each cell/gid. How would this change when using ExperimentalMechComplex and read_mcomplex?
hines wrote:A better proxy for complexity (than a count of number of states) can be generated withIt is interesting to take a look at the relative computation times of the various mechanisms in the mcomplex.dat fileCode: Select all
from neuron import h pc = h.ParallelContext() h.load_file("loadbal.hoc") lb = h.LoadBalance() lb.ExperimentalMechComplex() pc.barrier() h.quit()
Then when you later want to balance using a LoadBalance instance, read the file with
lb.read_mcomplex()