Issue running model with 20 cores or more

Moderator: tom_morse

Post Reply
bwjmbb17
Posts: 16
Joined: Mon Apr 10, 2017 10:09 am

Issue running model with 20 cores or more

Post by bwjmbb17 » Tue Aug 08, 2017 1:56 pm

I have an issue when running my model with more than 20 cores. I can run with 12 cores just fine but when I run with 20 I get this error when gathering data. Everything before Gathering data runs properly.

Gathering data...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

Have anyone ran into this same issue?
This issue doesn't seem like this is dependent on cell number because when I run a 100 or 1000 cell model, it still happens. I can add my code if needed.

salvadord
Posts: 57
Joined: Tue Aug 18, 2015 3:49 pm

Re: Issue running model with 20 cores or more

Post by salvadord » Tue Aug 15, 2017 11:18 pm

Yes, please share your code; also any info on the machine/software you are running this on would be useful -- e.g. OS, number of cores, NEURON version, netpyne version, etc.

We have tested netpyne simulations on up to 512 cores on supercomputers, but I will check running your code on 20+ cores and let you know if there are any issues.

Thanks

ted
Site Admin
Posts: 5106
Joined: Wed May 18, 2005 4:50 pm
Location: Yale University School of Medicine
Contact:

Re: Issue running model with 20 cores or more

Post by ted » Wed Aug 16, 2017 12:35 pm

If the code amounts to more than 10-20 lines, it would be best to communicate it via direct email between the interested parties, or by posting a URL that points to a compressed source code file that resides somewhere else than this forum.

bwjmbb17
Posts: 16
Joined: Mon Apr 10, 2017 10:09 am

Re: Issue running model with 20 cores or more

Post by bwjmbb17 » Tue Aug 22, 2017 12:59 pm

Here is the link to my code, this includes my main file - 100_Cell_LA.py, my template - LA_Template.py, my output file with error - results*.out, and my batch file used - batch_file.sh.

https://drive.google.com/open?id=0B0cn- ... nJfVVROM2M

The information about my machine is as follows:
OS: CentOS 7
Number of Cores I am trying to use: 24
NEURON Version: 7.4
NetPyNe Version: 0.7.1

Thanks

bwjmbb17
Posts: 16
Joined: Mon Apr 10, 2017 10:09 am

Re: Issue running model with 20 cores or more

Post by bwjmbb17 » Tue Oct 03, 2017 4:21 pm

Hi guys,

I am still running into issue when I try to run more than 20 cores and in some cases the issue starts after 12 cores. I can run my programs fine with 12 but when I try to run 18 I get the same error. I have updated my code a little bit and have attached a link here: https://drive.google.com/drive/folders/ ... sp=sharing. Please take a look if you get a chance. Also here are some more details on the machine that I am using:

OS: CentOS 7
Number of Cores I am trying to use: 18
NEURON Version: 7.4
NetPyNe Version: 0.7.1
Modules I load: intel/intel-2016-update2, nrn/nrn-mpi-7.4, and openmpi/openmpi-2.0.1

Let me know if you need anything else from me.

Thanks

salvadord
Posts: 57
Joined: Tue Aug 18, 2015 3:49 pm

Re: Issue running model with 20 cores or more

Post by salvadord » Mon Oct 16, 2017 3:56 pm

I'm looking at this issue now, sorry for the delay. Tried running your code but its missing LA_Template.py -- can you please send the missing file (or a full updated version of your model) ?

I also noticed that you are trying to run the simulation for 276 seconds, which is a relatively long time -- do you get the same issues on 18 cores with a shorter duration (e.g. 1 sec) ?

Something else you could try is updating NEURON to version 7.5 and NetPyNE to 0.7.4 . I used an old LA_Template.py, reduced cell types to just 'PYR_A', changed sim time to 10 sec, and the model ran ok on Mac OS and Ubuntu machines using 18 and 24 cores. In any case, I can try your full model once you send the missing file.

thanks

bwjmbb17
Posts: 16
Joined: Mon Apr 10, 2017 10:09 am

Re: Issue running model with 20 cores or more

Post by bwjmbb17 » Mon Oct 16, 2017 4:11 pm

That is okay, thank you for taking a look. I apologize that the LA_Template.py file was missing. It is now in the folder that I shared with you. I have not tried running it for a shorter amount of time but I will give it a try and see if that is the issue. I will also try and update to NEURON version 7.5 and NetPyNe version 0.7.4. Thanks for the input.

salvadord
Posts: 57
Joined: Tue Aug 18, 2015 3:49 pm

Re: Issue running model with 20 cores or more

Post by salvadord » Wed Oct 18, 2017 7:06 pm

FYI, I tested your full model using 24 cores and also got an mpi error during gathering for 276sec and 50sec sims; but it ran ok for 10sec sims. So seems to be related to a memory issue when gathering too much output data (long sims) from many cores. If you definitely need those long simulation times, the only solution seems to be to use less cores, or maybe try on large machines with more memory. Otherwise, should work ok with more cores and shorter sim durations.

Post Reply