-------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: c404-001 Local adapter: hfi1_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: c404-001 Local device: hfi1_0 -------------------------------------------------------------------------- [c404-001.stampede2.tacc.utexas.edu:232572] mca_base_component_repository_open: unable to open mca_coll_basic: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_basic.so: failed to map segment from shared object (ignored) [c404-001.stampede2.tacc.utexas.edu:232572] mca_base_component_repository_open: unable to open mca_coll_inter: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_inter.so: failed to map segment from shared object (ignored) [c404-001.stampede2.tacc.utexas.edu:232572] mca_base_component_repository_open: unable to open mca_coll_libnbc: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_libnbc.so: failed to map segment from shared object (ignored) -------------------------------------------------------------------------- Although some coll components are available on your system, none of them said that they could be used for reduce_scatter_block on a new communicator. This is extremely unusual -- either the "basic", "libnbc" or "self" components should be able to be chosen for any communicator. As such, this likely means that something else is wrong (although you should double check that the "basic", "libnbc" and "self" coll components are available on your system -- check the output of the "ompi_info" command). A coll module failed to finalize properly when a communicator that was using it was destroyed. This is somewhat unusual: the module itself may be at fault, or this may be a symptom of another issue (e.g., a memory problem). -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_coll_base_comm_select(MPI_COMM_WORLD) failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- [c404-001:232572] *** An error occurred in MPI_Init [c404-001:232572] *** reported by process [1792147457,0] [c404-001:232572] *** on a NULL communicator [c404-001:232572] *** Unknown error [c404-001:232572] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [c404-001:232572] *** and potentially your MPI job) Command exited with non-zero status 1 {"realtime":28.69,"usertime":33.23,"systime":33.73,"memmax":137680,"memavg":0}