Load balancing with heterogeneous morphologies and synapses

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Load balancing with heterogeneous morphologies and synapses

Post by iraikov »

Hello,

I have a large network model where each principal cell model has a unique morphology and distribution of synapses. Is multisplit the most appropriate way to load balance this network, and if so, is the multisplit version of the Traub model the principal example on how to do that? Thanks,

-Ivan
ted
Site Admin
Posts: 6286
Joined: Wed May 18, 2005 4:50 pm
Location: Yale University School of Medicine
Contact:

Re: Load balancing with heterogeneous morphologies and synap

Post by ted »

The first question is: how bad is imbalance if you use the simplest approach to distributing the cells. If the number of principal cells is very large compared to the number of hosts, and the variance of their complexity is not very large, you may not have to do anything special--simply deal out the principal cells to the hosts without splitting them.
iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Re: Load balancing with heterogeneous morphologies and synap

Post by iraikov »

Hi Ted,

Using round-robin assignment of cells to ranks, short simulations of 100-300 ms have a load balancing factor of about 0.75 on 16384 ranks (1e6 principal cells in a model of the rat dentate gyrus), and 0.73 for a longer 1000 ms simulation. The difference in dendritic area and number of synapses can be large, about 4x difference between the smaller and larger cells, resulting in a range of 3000 - 12000 synapses. I am of course open to suggestions if there are more detailed measurements of cell complexity and load balancing I can do to determine if multisplit might be useful here.
ted wrote:The first question is: how bad is imbalance if you use the simplest approach to distributing the cells. If the number of principal cells is very large compared to the number of hosts, and the variance of their complexity is not very large, you may not have to do anything special--simply deal out the principal cells to the hosts without splitting them.
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Load balancing with heterogeneous morphologies and synap

Post by hines »

a rough proxy for relative computation time is number of states (less accurate is number of compartments). Multisplit is useful if TotalNumberOfStatesForAllCells/nhost < NumberOfStatesOfLargestCell.

Here is a way to determine that:

Code: Select all

~/models/nrntraub$ cat cx.py
from neuron import h
pc = h.ParallelContext()
nhost = int(pc.nhost())
rank = int(pc.id())

h.load_file("loadbal.hoc")

def cx():
  lb = h.LoadBalance()

  # all cell complexity
  cell_cx = []
  for sec in h.allsec():
    if sec.parentseg() == None: # root section
      cell_cx.append(lb.cell_complexity(sec=sec))

  #local
  max_cx = max(cell_cx) if len(cell_cx) > 0 else 0.0
  sum_cx = sum(cell_cx)

  #global
  max_cx = pc.allreduce(max_cx, 2)
  sum_cx = pc.allreduce(sum_cx, 1)

  if rank is 0:
    print ("max_cx = %g  average cx per rank = %g\n" % (max_cx, sum_cx/nhost))

cx()

pc.barrier()
h.quit()
Then call that after you setup your model. eg from a hoc file with

Code: Select all

nrnpython("import cx")
Then, when I run the 356 cell traub model with 8 cores and round robin distibution, I see
max_cx = 4093 average cx per rank = 120997
Clearly in this case there is no good reason to use mutlistplit.
ted
Site Admin
Posts: 6286
Joined: Wed May 18, 2005 4:50 pm
Location: Yale University School of Medicine
Contact:

Re: Load balancing with heterogeneous morphologies and synap

Post by ted »

So the question becomes: is the simple round robin strategy for distributing cells over hosts good enough for this particular case, and if not, what alternative to try (e.g. round robin for all cell classes whose complexities are fixed, but maybe a "whole cell" variant of the least processing time algorithm (Korf1998, cited in discussion of load balance by Hines2008) for each cell class in which complexity varies from one instance to another).

@ARTICLE{Hines2008,
author = {Hines, M.L. and Markram, H. and Schürmann, F.},
title = {Fully implicit parallel simulation of single neurons},
journal = {Journal of Computational Neuroscience},
year = {2008},
volume = {25},
pages = {439--448},
note = {PMC2760991},
pmid = {18379867},
}
downloadable from link at http://www.neuron.yale.edu/neuron/nrnpubs

@Article{Korf1998,
author = {Korf, R.E.},
title = {A complete anytime algorithm for number partitioning},
journal = {Artificial Intelligence},
year = {1998},
volume = {106},
number = {2},
pages = {181--203},
note = {least processing time algorithm for load balance},
publisher = {Elsevier},
}
iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Re: Load balancing with heterogeneous morphologies and synap

Post by iraikov »

Thanks, Michael, I have added this code to the model and will calculate the cell complexity. Just to make sure I understand, the `allsec' function will return all sections of all cells on the current rank, right? I assumed from the code that it just needs to be run on the top-level, but let me know if this is incorrect.
hines wrote:a rough proxy for relative computation time is number of states (less accurate is number of compartments). Multisplit is useful if TotalNumberOfStatesForAllCells/nhost < NumberOfStatesOfLargestCell.
Last edited by iraikov on Fri Mar 24, 2017 2:14 pm, edited 1 time in total.
iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Re: Load balancing with heterogeneous morphologies and synap

Post by iraikov »

Hi Ted,

The simple round robin strategy for all cells results in a load balancing factor of 0.73-0.75 (when the stimulus is from homogeneous Poisson sources), which is not terrible but I think can be improved. Thanks for the references, I will look through the Korf paper and see if I can figure out how to apply this least processing time algorithm to load balance the cell population with variable complexity.
ted wrote:So the question becomes: is the simple round robin strategy for distributing cells over hosts good enough for this particular case, and if not, what alternative to try (e.g. round robin for all cell classes whose complexities are fixed, but maybe a "whole cell" variant of the least processing time algorithm (Korf1998, cited in discussion of load balance by Hines2008) for each cell class in which complexity varies from one instance to another).
ted
Site Admin
Posts: 6286
Joined: Wed May 18, 2005 4:50 pm
Location: Yale University School of Medicine
Contact:

Re: Load balancing with heterogeneous morphologies and synap

Post by ted »

In pseudocode

Code: Select all

REPEAT
  place the most complex cell on the host that has the least total complexity
UNTIL no more cells remain to be placed
In "Fully implicit parallel simulation of single neurons" see Fig. 3 and its legend, and imagine replacing the word "piece" with "cell". I would suggest using round robin for all cell classes whose members are of uniform size.
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Load balancing with heterogeneous morphologies and synap

Post by hines »

I understand, the `allsec' function will return all sections of all cells on the current rank, right?
Yes. I used the allsec as I did not know what you used for a cell list or a gid vector on each rank. It was just a dirty and not so quick way to figure out what were
the cells. As written, it runs on import and quits but you can modify so that you can call the cx.cx() from anywhere and perhaps introduce a list of
your cells as an argument. The only requirement is that everyone participate in the call due to the use of the allreduce collectives. It had more of a throw away
purpose just to determine whether your model could benefit from multisplit.

For the overall whole cell load balance problem, you need to actually measure the load balance of a simulation, which I guess you have already done.
The current best practice for that takes advantage of the recent introduction of a barrier just before spike exchange so that the load balance measurement
is not just static load balance but includes dynamic effects as well. i.e

Code: Select all

def prun():
   ... 
  h.stdinit()
  ...
  runtime=startsw() 
  pc.psolve(h.tstop) 
  runtime = startsw() - runtime
  computation_time = pc.step_time()
  cw_time = computation_time + pc.step_wait()
  max_cw_time = pc.allreduce(cw_time, 2)
  avg_comp_time = pc.allreduce(computation_time, 1)/nhost
  load_balance = avg_comp_time/max_cw_time
  return runtime, load_balance, avg_comp_time
The load balance calculation is accurate even if there is significant dynamic spike handling imbalance on a per integration interval basis.
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Load balancing with heterogeneous morphologies and synap

Post by hines »

Here is a python module that prints existing expected load balance and what would be expected if LPT is used to distribute the cells.

Code: Select all

$ cat cx.py
# usage:
# import cx
# test(gidvec)
# Where gidvec is the (round robin) distribution of gids on this rank
# and cell exists for each of those gids.
# The current expected load balance and expected load balance
# if the LPT distributon is used, will be printed.

from neuron import h
pc = h.ParallelContext()
nhost = int(pc.nhost())
rank = int(pc.id())

h.load_file("loadbal.hoc")

# given gids, what are the complexities
def cx(gidvec):
  ''' gidvec is Hoc Vector, return complexity as Hoc Vector '''
  lb = h.LoadBalance()
  cxvec = gidvec.c()
  for i, gid in enumerate(gidvec):
    cxvec.x[i] = lb.cell_complexity(pc.gid2cell(gid))
  return cxvec

# for given cxvec on each rank what is the fractional load balance.
def ld_bal(cxvec):
  sum_cx = sum(cxvec)
  max_sum_cx = pc.allreduce(sum_cx, 2)
  sum_cx = pc.allreduce(sum_cx, 1)
  if rank is 0:
    print ("expected load_balance %.2f" % (sum_cx/nhost / max_sum_cx))

#each rank has gidvec, cxvec,
# gather everything to rank 0 and do lpt algorithm and scatter
# proper gidvec, cxvec back to ranks and return the new balanced
# gidvec, cxvec. Note, perhaps should write a balance file as well.
def lpt_bal(gidvec, cxvec):
  #gather gidvec, cxvec to rank 0
  src = [None]*nhost
  src[0] = (gidvec, cxvec)
  dest = pc.py_alltoall(src)
  del src

  if rank == 0:
    # organize dest into single allgidvec, allcxvec Hoc Vectors.
    allgidvec = h.Vector()
    allcxvec = h.Vector()
    for pair in dest:
      allgidvec.append(pair[0])
      allcxvec.append(pair[1])
    del dest

    #rankvec specifies the rank where each cell should be
    lb = h.LoadBalance()
    rankvec = lb.lpt(allcxvec, nhost, 0) # third arg suppresses a print

    #send back a balanced gidvec, cxvec to each rank
    # start out with empty vectors
    src = [(h.Vector(), h.Vector()) for _ in range(nhost)]
    for i in range(len(allcxvec)):
      pair = src[int(rankvec.x[i])]
      pair[0].append(allgidvec.x[i])
      pair[1].append(allcxvec.x[i])
    del allgidvec
    del allcxvec
  else: # all other ranks send nothing
    src = [None]*nhost

  dest = pc.py_alltoall(src)
  # dest[0] contains the balanced (gidvec, cxvec) pair
  del src
  balanced_gidvec = dest[0][0]
  balanced_cxvec = dest[0][1]
  del dest

  return balanced_gidvec, balanced_cxvec

def test(gidvec):
  # cell complexities (the cells exist)
  cxvec = cx(gidvec)

  # present expected balance
  if rank == 0: print ("\ncurrent distribution")
  ld_bal(cxvec)

  # lpt balanced
  balanced_gidvec, balanced_cxvec = lpt_bal(gidvec, cxvec)

  # new expected balance
  if rank == 0: print ("\nafter LPT distribution")
  ld_bal(balanced_cxvec)
To test with the nrntraub model (written in hoc, which happens to have a gidvec, I used the fragment after setup in the init.hoc file of

Code: Select all

{
nrnpython("import cx")
nrnpython("cx.test(cx.h.gidvec)")
pc.barrier()
quit()
}

prun()
iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Re: Load balancing with heterogeneous morphologies and synap

Post by iraikov »

Hi Michael,
Thanks a lot, this is tremendously helpful. It is good to learn about the new load balancing additions. It looks like LPT is the way to go, as my cell complexity measurement showed that the largest cell still has much lower complexity than the average per rank. I will try out your code and see if there is an improvement.
hines wrote:Here is a python module that prints existing expected load balance and what would be expected if LPT is used to distribute the cells.
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Load balancing with heterogeneous morphologies and synap

Post by hines »

A better proxy for complexity (than a count of number of states) can be generated with

Code: Select all

from neuron import h
pc = h.ParallelContext()
h.load_file("loadbal.hoc")
lb = h.LoadBalance()
lb.ExperimentalMechComplex()
pc.barrier()
h.quit()
It is interesting to take a look at the relative computation times of the various mechanisms in the mcomplex.dat file
Then when you later want to balance using a LoadBalance instance, read the file with
lb.read_mcomplex()

Prior to using this more accurate proxy, I mentioned a performance puzzle in the paper
http://neuron.yale.edu/neuron/static/pa ... isplit.pdf
Though the
predicted load imbalance is 5% for the 8192 processor case the measured maximum
and average computation time exhibits a load imbalance of 18%. While
these results are already very satisfying, further speedup may be achievable
if the reasons for the deviation of the measured computation time balance
and the expected balance from the complexity proxy can be identified.
Take a look at the image at http://neuron.yale.edu/ftp/hines/exp-mech-complex.pdf
hines
Site Admin
Posts: 1682
Joined: Wed May 18, 2005 3:32 pm

Re: Load balancing with heterogeneous morphologies and synap

Post by hines »

Just to close this discussion, I'd like to mention that the hoc implementation of the lpt algorithm is not very efficient. It is much faster to use a python implemenation that make use of heapq.
Here is an lpt.py file that you can run with
python lpt.py

Code: Select all

import heapq

def lpt(cx, npart):
  ''' from the list of (cx, gid) return a npart length list with each partition
      being a total_cx followed by a list of (cx, gid).
  '''
  cx.sort(key=lambda x:x[0], reverse=True)
  # initialize a priority queue for fast determination of current
  # partition with least complexity. The priority queue always has
  # npart items in it. At this time we do not care which partition will
  # be associated with which rank so a partition on the heap is just
  # (totalcx, [list of (cx, gid)]
  h = []
  for i in range(npart):
    heapq.heappush(h, (0.0, []))
  #each cx item goes into the current least complex partition
  for c in cx:
    lp = heapq.heappop(h) # least partition
    lp[1].append(c)
    heapq.heappush(h, (lp[0]+c[0], lp[1]))
  parts = [heapq.heappop(h) for i in range(len(h))]
  return parts

def statistics(parts):
  npart = len(parts)
  total_cx = 0
  max_part_cx = 0
  ncx = 0
  max_cx = 0
  for part in parts:
    ncx += len(part[1])
    total_cx += part[0]
    if part[0] > max_part_cx:
      max_part_cx = part[0]
    for cx in part[1]:
      if cx[0] > max_cx:
        max_cx = cx[0]
  avg_part_cx =total_cx/npart
  loadbal = 1.0
  if max_part_cx > 0.:
    loadbal = avg_part_cx/max_part_cx
  s = "loadbal=%g total_cx=%g npart=%d ncx=%d max_part_cx=%g max_cx=%g"%(loadbal,total_cx,npart,ncx,max_part_cx, max_cx)
  return s

if __name__ == '__main__':
  for cx in ([(i, i) for i in range(10)],[]):
    print len(cx), ' complexity items ', cx
    pinfo = lpt(cx, 3)
    print len(pinfo), ' lpt partitions ', pinfo
    print statistics(pinfo)

iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Re: Load balancing with heterogeneous morphologies and synap

Post by iraikov »

Hi Michael,

Thanks a lot for the code and explanations, I just have one question: in the routine lpt_bal, is it really necessary to send back a balanced gidvec to each rank? I am thinking about a simple scheme where lpt_bal writes the file with gid to rank assignments, and then the simulation code can read it and allocate gids accordingly. Does that make sense, or am I misunderstanding lpt_bal? Thank you,

-Ivan
hines wrote:Just to close this discussion, I'd like to mention that the hoc implementation of the lpt algorithm is not very efficient. It is much faster to use a python implemenation that make use of heapq.
Here is an lpt.py file that you can run with
python lpt.py
iraikov
Posts: 17
Joined: Wed Mar 18, 2015 11:53 am
Location: University of California, Irvine

Re: Load balancing with heterogeneous morphologies and synap

Post by iraikov »

The difference is indeed quite striking. So what do I do after the lb.read_mcomplex call? The current complexity estimate procedure calls lb.cell_complexity for each cell/gid. How would this change when using ExperimentalMechComplex and read_mcomplex?
hines wrote:A better proxy for complexity (than a count of number of states) can be generated with

Code: Select all

from neuron import h
pc = h.ParallelContext()
h.load_file("loadbal.hoc")
lb = h.LoadBalance()
lb.ExperimentalMechComplex()
pc.barrier()
h.quit()
It is interesting to take a look at the relative computation times of the various mechanisms in the mcomplex.dat file
Then when you later want to balance using a LoadBalance instance, read the file with
lb.read_mcomplex()
Post Reply