# Torch ML Model Driver The `TorchML` model driver is the base driver needed for machine learning based portable models. This driver provides an interface between the KIM-API and TorchScript models. It works with any model that is TorchScript compatible and follows one of the following calling conventions in their `forward` method`: 1. `forward(self, species, coords, n_neigh, nlist, contributing)` 2. `forward(self, descriptor)` 3. `forward(self, species, coords, graph_layer1, graph_layer2,..., contributing)` Pattern 1 is for more conventional models that require raw information about the system, namely species, coordinates, number of neighbors, neighbor list and contributing atoms. Pattern 2 is for descriptor based models where the descriptor is computed by the KIM-API and passed to the model. The descriptor computation is done by the `libdescriptor` library, which is a C++ library for computing descriptors and their gradients. Pattern 3 is for graph neural networks (GNNs) where the model takes the species, coordinates and the graph layers as input. The graph layers are computed by the model driver itself and uses the staged graphs approach for parallelization. For GNN models, only the Pytorch Geometric library is supported, as the Deep Graph Library (DGL) does not include support for TorchScript at the time of release of this driver. Models supported by this driver provide either energy as output, or energy and forces as output. The model driver will compute the forces from the energy, if the model does not provide the forces. If the energy is not a scalar, the model driver will sum the energy tensor to get the total energy, and assign the per atom energy to all contributing atoms. Below is a diagram showing the flow of information in the model driver. ## Dependencies This model driver depends on several libraries that must be provided by the user at runtime. The core requirement of the ML model driver is the `libtorch` library that provides the interface between the driver's C++ API and TorchScript models. For GNNs, the Torch model uses [Pytorch Geometric Library](https://github.com/pyg-team/pytorch_geometric). The C++ API of the Pytorch Geometric library depends on the `torch-scatter` and `torch-sparse` libraries. The `libdescriptor` library is used for descriptor based models. Summary of dependencies: - libtorch (CXX11 ABI, v1.13) - KIM-API (v2.3) - libdescriptor (0.0.7) - Enzyme AD (0.0.94) - libtorchscatter - libtorchsparse > If your compute environment does not contain these dependencies, they can be installed using the `install_dependencies.sh` script provided with the model driver source. This script installs all dependencies in the current working directory. To activate the environment, source the generated `env.sh` file (or copy its contents into a `.bashrc` file to autmatically initialize the environment in the future). For more detailed instructions on installing dependencies, see below. ## Install If all dependencies are met, installation should be as simple as calling the appropriate `kim-api-collections-management install` command. Your shell environment should provide the required variables for dependency resolution namely, 1. `TORCH_ROOT` 2. `TorchScatter_ROOT` 3. `TorchSparse_ROOT` 4. `LIBDESCRIPTOR_ROOT` `libtorch` is simple to install. Download the libtorch binaries from the PyTorch website, and place them in appropriate system paths (i.e. PATH, LD_LIBRARY_PATH and INCLUDE). For GPU support, download the CUDA enabled binaries for libtorch, along with the CUDNN library that libtorch depends upon. For the CUDNN library, register and download them from the [NVIDIA website](https://developer.nvidia.com/rdp/cudnn-archive). ## Environment Variables The model driver provides several environment variables for enhanced functionality. The following environment variables are used by the model driver: ### Compile Time Variables 1. `KIM_MODEL_MPI_AWARE` - If set to `yes` (*case-sensitive*) during driver installation, the model driver will be built with additional MPI support and will require a valid MPI environment to be present at installation time. With this additional MPI support enabled, the driver will determine the number `n` of available GPUs on each node and the MPI rank of each parallel process at runtime, using which it will assign those GPUs, in round-robin fashion, to MPI ranks on the node. That is, if the simulator was launched with `m` MPI ranks, (e.g. m=10 if the job was launched with `mpirun -np=10`) on a node with `n` GPUs, then each MPI rank will be assigned GPU number [rank_id mod n] assigned (where 0 <=rank_id < `m`). 2. `KIM_MODEL_DISBALE_GRAPH` - If this environment variable is defined (irrespective of value), the model driver will be built without graph support. This means during build time it will not try to find and link against `libtorchscatter` and `libtorchsparse` libraries, and will not support models with pattern 3. ### Runtime Variables 1. `KIM_MODEL_ELEMENTS_MAP` - If set to any value during runtime, will enable mapping of elements to their atomic numbers. 2. `KIM_MODEL_EXECUTION_DEVICE` - If set to `cuda` during runtime, will enable evaluation of the Torch Model on GPU. ```shell export KIM_MODEL_EXECUTION_DEVICE="cuda" # Set visible devices if needed export CUDA_VISIBLE_DEVICES=0,1,2 ``` The TorchML model driver is inherently compatible with LAMMPS domain decomposition. Therefore enabling distributed GPU support simply involves running LAMMPS with multiple ranks. At present the Torch model resides on GPUs, independent of LAMMPS, an so following points shall be kept in mind 1. You need not compile LAMMPS with GPU enabled, as the TorchML driver only interacts with LAMMPS via the KIM interface, which is CPU only. 2. As every evaluation requires copying data from CPU to GPU and vice versa, to see the benefits of the GPU may require large system sizes. ## Known Installation Issues During installation, you might encounter following error messages: ### 1. `Could not locate pthreads/ Threads.cmake` etc. Modern Linux installations come with a valid posix threads library. On some HPCsystems with minimal or older linux installations, you might get the above error because the default compiler might be a minimal CC wrapper that cannot detect it. The easiest way forward is to provide a valid C and C++ compilers as, ```shell CC=mpicc CXX=mpic++ bash install_dependencies.sh # or CC=gcc CXX=g++ cmake .. # etc. etc. ``` ### 2. `libcuda.so.1: cannot open shared object file` Cuda installations come with two sets of libraries, `libcudart.so`, which is the actual cuda implementation, and an older `libcuda.so`, which are stubs for legacy purposes. Easiest workaround is 1. compile your code on an execution node, or 2. symlink libcudart to libcuda at a local location for compiling purposes, 3. setup cuda env properly, the stubs are kept at `$CUDA_ROOT/lib64/stubs`, and add it to `LD_LIBRARY_PATH` ### 3. Compilation runs in infinite loop cmake > 3.18 has a bug that that leasd to infinite compilation loops in some cases. Use cmake <= 3.18