-------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: acn07 Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: acn07 -------------------------------------------------------------------------- Exception: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__.py", line 16, in forward contributions: Tensor) -> Tuple[Tensor, Optional[Tensor]]: model = self.model _0 = (model).forward(species, coords, edge_index0, edge_index1, edge_index2, contributions, ) ~~~~~~~~~~~~~~ <--- HERE E, F, = _0 n_atoms = torch.sum(torch.sub(contributions, 1)) File "code/__torch__/mace/modules/models.py", line 31, in AD_sum_backward num_graphs = torch.numel(n_contributing) num_elements = self.num_elements node_attr = torch.one_hot(x, annotate(int, num_elements)) ~~~~~~~~~~~~~ <--- HERE _2 = torch.to(node_attr, ops.prim.dtype(pos)) node_attr0 = torch.to(_2, ops.prim.device(pos)) Traceback of TorchScript, original code (most recent call last): File "", line 40, in forward def forward(self, species, coords, edge_index0, edge_index1, edge_index2, contributions): E, F = self.model(species, coords, edge_index0, edge_index1, edge_index2, contributions) ~~~~~~~~~~ <--- HERE n_atoms = torch.sum(contributions - 1) E = E + self.si_ref * n_atoms File "/home/amit/Projects/COLABFIT/mace/mace/mace/modules/models.py", line 195, in AD_sum_backward num_graphs = n_contributing.numel() node_attr = torch.nn.functional.one_hot(x, num_classes=self.num_elements) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE node_attr = node_attr.to(pos.dtype).to(pos.device) node_attr.requires_grad_(True) RuntimeError: Class values must be smaller than num_classes. -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 46, in run_lammps lammps_process = subprocess.check_call( File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['lammps', '-in', 'isolated_atom.lammps.Si.in']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 217, in isolated_atom_energies[symbol] = get_isolated_atom_energy( File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 72, in get_isolated_atom_energy run_lammps(templated_input, lammps_output) File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 56, in run_lammps raise Exception("LAMMPS did not exit properly:\n" + extrainfo) Exception: LAMMPS did not exit properly: LAMMPS (2 Aug 2023 - Update 1) Command exited with non-zero status 1 {"realtime":4.68,"usertime":3.14,"systime":1.14,"memmax":273936,"memavg":0}