Machine Learning Interatomic Potential (MLIP) project using the ChemDX database
🏆 Developed as part of the KRICT ChemDX Hackathon 2025
Can we systematically supply missing data that greatly improves MLIPs?
-
Many machine learning interatomic potentials (MLIPs) struggle with poor transferability and instability when simulations explore configurations far from equilibrium. Traditional datasets are heavily biased toward near-equilibrium or randomly-perturbed structures, leaving transition states and reaction pathways underrepresented.
-
This project investigates whether explicitly adding minimum energy path (MEP) data from Nudged Elastic Band (NEB) calculations can systematically enhance MLIP performance, particularly for dynamic simulations.
We focused on surface adsorption systems:
| System | Surface Type | Adsorbate | Stable Site |
|---|---|---|---|
| Au on Al(100) | Pure metal surface | Au | Hollow |
| Au on AlPd(100) | Alloy surface | Au | Hollow |
These systems provide well-defined diffusion pathways that are ideal for evaluating the impact of NEB data on MLIP performance.
Atomic configurations were generated using Atomic Simulation Environment (ASE) with three complementary sampling strategies:
| Method | Purpose | Configuration Type |
|---|---|---|
| Relaxation | Stable geometries | Energy minima |
| Molecular Dynamics (MD) | Thermal fluctuations | Near-equilibrium |
| NEB | Diffusion pathways | Transition states |
Two datasets were constructed:
| Dataset | Sampling Methods | Configuration Space Coverage |
|---|---|---|
| Set #1 | Relaxation + MD | Near-equilibrium only |
| Set #2 | Relaxation + MD + NEB | Near-equilibrium + Transition states |
We trained neural network potentials using the Atomistic Machine-learning Package (AMP).
| Model | Training Data | Purpose |
|---|---|---|
| Model #1 | Set #1 (Relax + MD) | Baseline MLIP |
| Model #2 | Set #2 (Relax + MD + NEB) | NEB-enhanced MLIP |
Both models used identical network architectures to ensure that performance differences arise from data quality, not model complexity.
Models were evaluated using ANN-driven Molecular Dynamics simulations.
Evaluation metrics included:
- Force prediction error
- Energy conservation during MD
- Structural stability under finite temperature simulations
Including NEB configurations significantly expands coverage of high-energy and transition-state regions, which are missing in conventional MD-only datasets.
The NEB-enhanced MLIP demonstrated:
- Lower force errors even for near-equilibrium MD trajectories without diffusion process
- Stable long MD trajectories
- Proper energy conservation
| Model Type | Force Error | Energy Conservation |
|---|---|---|
| Relax + MD | High | ❌ |
| Relax + MD + NEB | Low | ✅ |
- Tutorial: Au on Al(100) trained on set 1
- Tutorial: Au on Al(100) trained on set 2
- Tutorial: Au on AlPd(100) trained on set 1
- Tutorial: Au on AlPd(100) trained on set 2
Strategically adding transition-state data can be more effective than increasing model size.
This work highlights a data-centric pathway to improving MLIPs: identifying physically important but under-sampled regions (like saddle points) and systematically incorporating them into the training set.
| Tool | Role |
|---|---|
| ASE | Structure generation, Relaxation, MD, NEB |
| AMP | Neural network potential training |
| ChemDX Database | Initial structures and metadata |
| Python | Workflow orchestration |
| NumPy / Matplotlib | Data analysis and visualization |
jupyter_notebook/
├── Au_on_Al/
│ ├── 01_data_generation.ipynb
│ ├── 02_training_MD_set_1.ipynb
│ └── 02_training_MD_set_2.ipynb
│
├── Au_on_AlPd/
│ ├── 01_data_generation.ipynb
│ ├── 02_training_MD_set_1.ipynb
│ └── 02_training_MD_set_2.ipynb
docs/images/ # Figures and animations used in README
- Install dependencies
- Run data generation notebooks
- Train MLIP models for Set #1 and Set #2
- Run MD evaluation notebooks to compare stability and accuracy
Adding NEB-based transition-state data leads to substantial improvements in MLIP performance by:
- Expanding configuration space coverage
- Reducing force errors in off-equilibrium regions
- Enabling stable and physically reliable MD simulations
This demonstrates that targeted data augmentation is a powerful strategy for building more transferable and stable ML interatomic potentials.
Developed during the KRICT ChemDX Hackathon 2025.
We thank the ChemDX team for organizing the event and providing access to the database.
For questions or contributions, please contact:
- In Won Yeu (GitHub: @InWonYeu)
- Jaeseon Kim (GitHub: @kasee0)
- Seojin Jeon (GitHub: @sjjeon6783)




