-------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: acn29 Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: acn29 Local device: mlx5_0 -------------------------------------------------------------------------- Exception: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__.py", line 17, in forward contributions: Tensor) -> Tuple[Tensor, Optional[Tensor]]: model = self.model _0 = (model).forward(x, pos, edge_graph0, edge_graph1, edge_graph2, contributions, ) ~~~~~~~~~~~~~~ <--- HERE scale_by = self.scale_by energy = torch.mul(_0, scale_by) File "code/__torch__/nequip/nn/_graph_mixin.py", line 26, in forward contributing: Tensor) -> Tensor: one_hot = self.one_hot x_embed = (one_hot).forward(torch.squeeze(x, -1), ) ~~~~~~~~~~~~~~~~ <--- HERE _0 = ops.prim.dtype(pos) x_embed0 = torch.to(x_embed, ops.prim.device(pos), _0) File "code/__torch__/nequip/nn/embedding/_one_hot.py", line 13, in forward x: Tensor) -> Tensor: num_types = self.num_types return torch.one_hot(x, num_types) ~~~~~~~~~~~~~ <--- HERE Traceback of TorchScript, original code (most recent call last): File "/home/amit/Projects/COLABFIT/nequip/Si/trained_model/what_worked.py", line 62, in forward def forward(self,x,pos,edge_graph0,edge_graph1,edge_graph2,contributions): energy = self.model(x,pos,edge_graph0,edge_graph1,edge_graph2,contributions) * self.scale_by ~~~~~~~~~~ <--- HERE forces, = torch.autograd.grad([energy.sum()], [pos]) energy = energy - self.Si_REF File "/home/amit/Projects/COLABFIT/nequip/nequip/nequip/nn/_graph_mixin.py", line 594, in forward # Embedding x_embed = self[0](x.squeeze(-1)) ~~~~~~~ <--- HERE x_embed = x_embed.to(dtype=pos.dtype, device=pos.device) h = x_embed File "/home/amit/Projects/COLABFIT/nequip/nequip/nequip/nn/embedding/_one_hot.py", line 40, in forward def forward(self, x): one_hot = torch.nn.functional.one_hot(x, num_classes=self.num_types) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return one_hot RuntimeError: Class values must be smaller than num_classes. -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 46, in run_lammps lammps_process = subprocess.check_call( File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['lammps', '-in', 'isolated_atom.lammps.Si.in']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 217, in isolated_atom_energies[symbol] = get_isolated_atom_energy( File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 72, in get_isolated_atom_energy run_lammps(templated_input, lammps_output) File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 56, in run_lammps raise Exception("LAMMPS did not exit properly:\n" + extrainfo) Exception: LAMMPS did not exit properly: LAMMPS (2 Aug 2023 - Update 1) Command exited with non-zero status 1 {"realtime":3.82,"usertime":2.37,"systime":0.68,"memmax":348388,"memavg":0}