-------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: acn100 Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: acn100 -------------------------------------------------------------------------- [acn100:1330072] *** Process received signal *** [acn100:1330072] Signal: Segmentation fault (11) [acn100:1330072] Signal code: Address not mapped (1) [acn100:1330072] Failing at address: 0x50 [acn100:1330072] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f9435e4b090] [acn100:1330072] [ 1] /usr/local/lib/libtorch_cpu.so(+0x46b846e)[0x7f940b42346e] [acn100:1330072] [ 2] /usr/local/lib/libtorch_cpu.so(+0x46b9b70)[0x7f940b424b70] [acn100:1330072] [ 3] /usr/local/lib/libtorch_cpu.so(+0x46c71fb)[0x7f940b4321fb] [acn100:1330072] [ 4] /usr/local/lib/libtorch_cpu.so(_ZN5torch3jit16InterpreterState3runERSt6vectorIN3c106IValueESaIS4_EE+0x52)[0x7f940b41f212] [acn100:1330072] [ 5] /usr/local/lib/libtorch_cpu.so(+0x46a71a6)[0x7f940b4121a6] [acn100:1330072] [ 6] /usr/local/lib/libtorch_cpu.so(_ZNK5torch3jit6MethodclESt6vectorIN3c106IValueESaIS4_EERKSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_St4hashISD_ESt8equal_toISD_ESaISt4pairIKSD_S4_EEE+0x180)[0x7f940b095170] [acn100:1330072] [ 7] /scratch.global/bwaters/jobs/bwaters/job-d5a7251a-bb78-44e6-baa8-b8055902562e-007-fe3ed7df-4696-43de-89d4-9fa8f3a0f2c2/TE_136053930099_003-and-MO_781946209112_000-1711747208/staged_job_files/repository/md/TorchML__MD_173118614730_000/libkim-api-model-driver.so(_ZN12PytorchModel3RunERN3c106IValueE+0x381)[0x7f942bd07fb1] [acn100:1330072] [ 8] /scratch.global/bwaters/jobs/bwaters/job-d5a7251a-bb78-44e6-baa8-b8055902562e-007-fe3ed7df-4696-43de-89d4-9fa8f3a0f2c2/TE_136053930099_003-and-MO_781946209112_000-1711747208/staged_job_files/repository/md/TorchML__MD_173118614730_000/libkim-api-model-driver.so(_ZN32TorchMLModelDriverImplementation3RunEPKN3KIM21ModelComputeArgumentsE+0x5c)[0x7f942bd05d7c] [acn100:1330072] [ 9] /scratch.global/bwaters/jobs/bwaters/job-d5a7251a-bb78-44e6-baa8-b8055902562e-007-fe3ed7df-4696-43de-89d4-9fa8f3a0f2c2/TE_136053930099_003-and-MO_781946209112_000-1711747208/staged_job_files/repository/md/TorchML__MD_173118614730_000/libkim-api-model-driver.so(_ZN32TorchMLModelDriverImplementation7ComputeEPKN3KIM21ModelComputeArgumentsE+0xd)[0x7f942bd05e3d] [acn100:1330072] [10] /scratch.global/bwaters/jobs/bwaters/job-d5a7251a-bb78-44e6-baa8-b8055902562e-007-fe3ed7df-4696-43de-89d4-9fa8f3a0f2c2/TE_136053930099_003-and-MO_781946209112_000-1711747208/staged_job_files/repository/md/TorchML__MD_173118614730_000/libkim-api-model-driver.so(_ZN18TorchMLModelDriver7ComputeEPKN3KIM12ModelComputeEPKNS0_21ModelComputeArgumentsE+0x33)[0x7f942bcfd8e3] [acn100:1330072] [11] /usr/local/lib/libkim-api.so.2(_ZNK3KIM19ModelImplementation12ModelComputeEPKNS_16ComputeArgumentsE+0x4a0)[0x7f943578f900] [acn100:1330072] [12] /usr/local/lib/libkim-api.so.2(_ZNK3KIM19ModelImplementation7ComputeEPKNS_16ComputeArgumentsE+0x667)[0x7f943579de67] [acn100:1330072] [13] /usr/local/lib/liblammps.so.0(_ZN9LAMMPS_NS7PairKIM7computeEii+0x20d)[0x7f9436ad4c2d] [acn100:1330072] [14] /usr/local/lib/liblammps.so.0(_ZN9LAMMPS_NS6Verlet5setupEi+0x3a2)[0x7f9436a1a982] [acn100:1330072] [15] /usr/local/lib/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xc7e)[0x7f94369afb8e] [acn100:1330072] [16] /usr/local/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xc0f)[0x7f943682796f] [acn100:1330072] [17] /usr/local/lib/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x175)[0x7f9436827c25] [acn100:1330072] [18] lammps(+0x13fe)[0x564c5f73e3fe] [acn100:1330072] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f9435e2c083] [acn100:1330072] [20] lammps(+0x148e)[0x564c5f73e48e] [acn100:1330072] *** End of error message *** Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 46, in run_lammps lammps_process = subprocess.check_call( File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['lammps', '-in', 'isolated_atom.lammps.Si.in']' died with . During handling of the above exception, another exception occurred: Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 217, in isolated_atom_energies[symbol] = get_isolated_atom_energy( File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 72, in get_isolated_atom_energy run_lammps(templated_input, lammps_output) File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 56, in run_lammps raise Exception("LAMMPS did not exit properly:\n" + extrainfo) Exception: LAMMPS did not exit properly: LAMMPS (2 Aug 2023 - Update 1) Command exited with non-zero status 1 {"realtime":3.95,"usertime":2.51,"systime":0.97,"memmax":273780,"memavg":0}