-------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: acn219 Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: acn219 -------------------------------------------------------------------------- Exception: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__.py", line 17, in forward contributions: Tensor) -> Tuple[Tensor, Optional[Tensor]]: model = self.model _0 = (model).forward(x, pos, edge_graph0, edge_graph1, edge_graph2, contributions, ) ~~~~~~~~~~~~~~ <--- HERE scale_by = self.scale_by energy = torch.mul(_0, scale_by) File "code/__torch__/nequip/nn/_graph_mixin.py", line 26, in forward contributing: Tensor) -> Tensor: one_hot = self.one_hot x_embed = (one_hot).forward(torch.squeeze(x, -1), ) ~~~~~~~~~~~~~~~~ <--- HERE _0 = ops.prim.dtype(pos) x_embed0 = torch.to(x_embed, ops.prim.device(pos), _0) File "code/__torch__/nequip/nn/embedding/_one_hot.py", line 13, in forward x: Tensor) -> Tensor: num_types = self.num_types return torch.one_hot(x, num_types) ~~~~~~~~~~~~~ <--- HERE Traceback of TorchScript, original code (most recent call last): File "/home/amit/Projects/COLABFIT/nequip/Si/trained_model/what_worked.py", line 62, in forward def forward(self,x,pos,edge_graph0,edge_graph1,edge_graph2,contributions): energy = self.model(x,pos,edge_graph0,edge_graph1,edge_graph2,contributions) * self.scale_by ~~~~~~~~~~ <--- HERE forces, = torch.autograd.grad([energy.sum()], [pos]) energy = energy - self.Si_REF File "/home/amit/Projects/COLABFIT/nequip/nequip/nequip/nn/_graph_mixin.py", line 594, in forward # Embedding x_embed = self[0](x.squeeze(-1)) ~~~~~~~ <--- HERE x_embed = x_embed.to(dtype=pos.dtype, device=pos.device) h = x_embed File "/home/amit/Projects/COLABFIT/nequip/nequip/nequip/nn/embedding/_one_hot.py", line 40, in forward def forward(self, x): one_hot = torch.nn.functional.one_hot(x, num_classes=self.num_types) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return one_hot RuntimeError: Class values must be smaller than num_classes. -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 46, in run_lammps lammps_process = subprocess.check_call( File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['lammps', '-in', 'isolated_atom.lammps.Si.in']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 217, in isolated_atom_energies[symbol] = get_isolated_atom_energy( File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 72, in get_isolated_atom_energy run_lammps(templated_input, lammps_output) File "../../td/ClusterEnergyAndForces__TD_000043093022_003/runner", line 56, in run_lammps raise Exception("LAMMPS did not exit properly:\n" + extrainfo) Exception: LAMMPS did not exit properly: LAMMPS (2 Aug 2023 - Update 1) Command exited with non-zero status 1 {"realtime":3.97,"usertime":2.20,"systime":0.94,"memmax":348720,"memavg":0}