-------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: acn100 Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: acn100 Local device: mlx5_0 -------------------------------------------------------------------------- Exception: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__.py", line 16, in forward contributions: Tensor) -> Tuple[Tensor, Optional[Tensor]]: model = self.model _0 = (model).forward(species, coords, edge_index0, edge_index1, edge_index2, contributions, ) ~~~~~~~~~~~~~~ <--- HERE E, F, = _0 n_atoms = torch.sum(torch.sub(contributions, 1)) File "code/__torch__/mace/modules/models.py", line 31, in batch_norm num_graphs = torch.numel(n_contributing) num_elements = self.num_elements node_attr = torch.one_hot(x, annotate(int, num_elements)) ~~~~~~~~~~~~~ <--- HERE _2 = torch.to(node_attr, ops.prim.dtype(pos)) node_attr0 = torch.to(_2, ops.prim.device(pos)) Traceback of TorchScript, original code (most recent call last): File "", line 40, in forward def forward(self, species, coords, edge_index0, edge_index1, edge_index2, contributions): E, F = self.model(species, coords, edge_index0, edge_index1, edge_index2, contributions) ~~~~~~~~~~ <--- HERE n_atoms = torch.sum(contributions - 1) E = E + self.si_ref * n_atoms File "/home/amit/Projects/COLABFIT/mace/mace/mace/modules/models.py", line 195, in batch_norm num_graphs = n_contributing.numel() node_attr = torch.nn.functional.one_hot(x, num_classes=self.num_elements) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE node_attr = node_attr.to(pos.dtype).to(pos.device) node_attr.requires_grad_(True) RuntimeError: Class values must be smaller than num_classes. -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 46, in run_lammps lammps_process = subprocess.check_call( File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['lammps', '-in', 'isolated_atom.lammps.Si.in']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 217, in isolated_atom_energies[symbol] = get_isolated_atom_energy( File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 72, in get_isolated_atom_energy run_lammps(templated_input, lammps_output) File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 56, in run_lammps raise Exception("LAMMPS did not exit properly:\n" + extrainfo) Exception: LAMMPS did not exit properly: LAMMPS (2 Aug 2023 - Update 1) Command exited with non-zero status 1 {"realtime":3.91,"usertime":2.39,"systime":0.88,"memmax":273652,"memavg":0}