I have a problem with running NEURON in parallel on HPC. I have configured the NEURON 7.6 with '−−with−paranrn'. On the clusters, there are two MPI modules(impi and ompi) and I have MPI installed under my virtual environment. According to viewtopic.php?t=1711, I want to make sure that mpicc, mpic++ and mpicxx are from the same directory. Therefore, I did some tests.
Code: Select all
zhao1505@ln0005 [~] % module load python3
zhao1505@ln0005 [~] % source activate py3
(py3) zhao1505@ln0005 [~] % which mpicc
~/.conda/envs/py3/bin/mpicc
(py3) zhao1505@ln0005 [~] % which mpic++
~/.conda/envs/py3/bin/mpic++
(py3) zhao1505@ln0005 [~] % which mpicxx
~/.conda/envs/py3/bin/mpicxx
(py3) zhao1505@ln0005 [~] % cd neuron/nrn/src/parallel/
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
Code: Select all
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % module load impi
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % which mpicc
/panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpicc
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % which mpic++
~/.conda/envs/py3/bin/mpic++
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % which mpicxx
/panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/bin/mpicxx
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
(py3) zhao1505@ln0005 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
Fatal error in PMPI_Comm_dup: Invalid communicator, error stack:
PMPI_Comm_dup(192): MPI_Comm_dup(comm=0x9f771ce0, new_comm=0x7f0ea2fdd2d0) failed
PMPI_Comm_dup(144): Invalid communicator
Fatal error in PMPI_Comm_dup: Invalid communicator, error stack:
PMPI_Comm_dup(192): MPI_Comm_dup(comm=0xd25cdce0, new_comm=0x7f83d5e392d0) failed
PMPI_Comm_dup(144): Invalid communicator
Fatal error in PMPI_Comm_dup: Invalid communicator, error stack:
PMPI_Comm_dup(192): MPI_Comm_dup(comm=0x957bce0, new_comm=0x7f3f0cde72d0) failed
PMPI_Comm_dup(144): Invalid communicator
Code: Select all
(py3) zhao1505@ln0004 [~] % module load ompi
(py3) zhao1505@ln0004 [~] % which mpicc
/panfs/roc/msisoft/openmpi/el6/3.1.6/gnu-8.2.0/bin/mpicc
(py3) zhao1505@ln0004 [~] % which mpic++
/panfs/roc/msisoft/openmpi/el6/3.1.6/gnu-8.2.0/bin/mpic++
(py3) zhao1505@ln0004 [~] % which mpicxx
/panfs/roc/msisoft/openmpi/el6/3.1.6/gnu-8.2.0/bin/mpicxx
(py3) zhao1505@ln0004 [~] % cd neuron/nrn/src/parallel/
(py3) zhao1505@ln0004 [~/neuron/nrn/src/parallel] % mpiexec -n 3 nrniv -mpi test0.hoc
[ln0004:2884856] mca_base_component_repository_open: unable to open mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
[ln0004:2884856] mca_base_component_repository_open: unable to open mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
nrniv: symbol lookup error: /panfs/roc/intel/x86_64/2018/impi_msi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/libmpifort.so.12: undefined symbol: i_realloc
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55744,1],1]
Exit code: 127
Thank you!