Parallel NEURON on HPC help

General issues of interest both for network and
individual cell parallelization.

Moderator: hines

Post Reply
harrisonZ
Posts: 8
Joined: Thu Jan 23, 2020 3:45 pm

Parallel NEURON on HPC help

Post by harrisonZ »

Hi,

I have a problem with running NEURON in parallel on HPC. I have configured the NEURON 7.6 with '−−with−paranrn'. On the clusters, there are two MPI modules(impi and ompi) and I have MPI installed under my virtual environment. According to viewtopic.php?t=1711, I want to make sure that mpicc, mpic++ and mpicxx are from the same directory. Therefore, I did some tests.

Code: Select all

zhao1505@ln0005 [~] % module load python3
zhao1505@ln0005 [~] % source activate py3
(py3) zhao1505@ln0005 [~] % which mpicc
~/.conda/envs/py3/bin/mpicc
(py3) zhao1505@ln0005 [~] % which mpic++
~/.conda/envs/py3/bin/mpic++
(py3) zhao1505@ln0005 [~] % which mpicxx
~/.conda/envs/py3/bin/mpicxx
(py3) zhao1505@ln0005 [~] % cd neuron/nrn/src/parallel/
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
It seems that NEURON wanted to load MPI from impi module, so I loaded the impi and tried again.

Code: Select all

(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % module load impi
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % which mpicc
/panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpicc
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % which mpic++
~/.conda/envs/py3/bin/mpic++
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % which mpicxx
/panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpicxx
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
Fatal error in PMPI_Comm_dup: Invalid communicator, error stack:
PMPI_Comm_dup(192): MPI_Comm_dup(comm=0x9f771ce0, new_comm=0x7f0ea2fdd2d0) failed
PMPI_Comm_dup(144): Invalid communicator
Fatal error in PMPI_Comm_dup: Invalid communicator, error stack:
PMPI_Comm_dup(192): MPI_Comm_dup(comm=0xd25cdce0, new_comm=0x7f83d5e392d0) failed
PMPI_Comm_dup(144): Invalid communicator
Fatal error in PMPI_Comm_dup: Invalid communicator, error stack:
PMPI_Comm_dup(192): MPI_Comm_dup(comm=0x957bce0, new_comm=0x7f3f0cde72d0) failed
PMPI_Comm_dup(144): Invalid communicator
Then I realized that the 'impi' module does not have the mpic++. It has the mpicxx instead. So I tried it again with 'ompi'

Code: Select all

(py3) zhao1505@ln0004 [~] % module load ompi
(py3) zhao1505@ln0004 [~] % which mpicc
/panfs/roc/msisoft/openmpi/el6/3.1.6/gnu-8.2.0/bin/mpicc
(py3) zhao1505@ln0004 [~] % which mpic++
/panfs/roc/msisoft/openmpi/el6/3.1.6/gnu-8.2.0/bin/mpic++
(py3) zhao1505@ln0004 [~] % which mpicxx
/panfs/roc/msisoft/openmpi/el6/3.1.6/gnu-8.2.0/bin/mpicxx
(py3) zhao1505@ln0004 [~] % cd neuron/nrn/src/parallel/
(py3) zhao1505@ln0004 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
[ln0004:2884856] mca_base_component_repository_open: unable to open mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
[ln0004:2884856] mca_base_component_repository_open: unable to open mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[55744,1],1]
  Exit code:    127
Now, I don't know why NEURON still reads MPI from impi module. I would be very appreciated if anybody would help me with this.

Thank you!
ramcdougal
Posts: 267
Joined: Fri Nov 28, 2008 3:38 pm
Location: Yale School of Public Health

Re: Parallel NEURON on HPC help

Post by ramcdougal »

Sounds like NEURON might have been built without dynamic MPI, in which case it'll try to use whichever version of MPI was picked at compile time.

Before we try debugging the compilation instructions for 7.6, have you tested to see if the current release (8.2.1) binary works? Even if you want the older version or something optimized to the specific HPC, this could still give useful debugging clues:

Code: Select all

pip3 install neuron
mpiexec -n 3 nrniv -mpi test0.hoc
Post Reply