Atompack: A Fast Storage Layer for Atomistic ML Training

Community Article Published June 11, 2026

Atompack is a storage format and library for atomistic ML datasets. It is designed for the workload that dominates training: repeated shuffled reads, multiprocessing dataloaders, large immutable snapshots, and exportable dataset releases.

We built Atompack as part of LeMaterial, where the same datasets need to move between curation, training, benchmarking, and public distribution. The goal is to write atomistic structures once, reopen them efficiently for training, and publish the same artifacts without converting through several intermediate formats.

The project combines:

  • a Python API
  • a Rust storage engine
  • an append-only .atp format
  • read-only mmap-backed access for serving static datasets
  • batch ingestion paths for NumPy and ASE (Atomic Simulation Environment)
  • Hugging Face upload and download support in the base package

Repository: https://github.com/LeMaterial/atompack.

Why We Built Atompack

Atomistic ML pipelines often start with tools that are a great fit for scientific workflows. But the requirements change once the dataset is feeding dataloaders and training loops: shuffled reads across many epochs become the workload that matters most.

This gets even more noticeable once datasets are distributed as large collections of shards. In practice, some training splits end up with thousands of files, for example more than 6,000 shards for the train split of OMAT24. On shared filesystems such as Lustre, many small files and random reads can create substantial metadata and I/O pressure. In many cases, that sharding pattern is partly an artifact of slow write paths and export workflows rather than something the training setup actually needs.

Atompack is aimed at that workload. The core storage unit is the whole molecule, with direct indexing into an immutable dataset snapshot. The main workflow is:

  1. write molecules or stacked array batches into an append-only .atp file
  2. flush() to publish a new committed trailing index
  3. reopen in read-only mode with Database.open(...)
  4. read by molecule index, convert to ASE when needed

Under the hood, the file layout is simple:

  • two 4 KiB header slots
  • a data region containing molecule records
  • a trailing index written on flush()

That layout keeps appends straightforward while giving Atompack O(1) lookup through the committed index. For read-mostly datasets, Database.open(path) uses mmap-backed read-only mode by default.

That focus also reflects the broader LeMaterial workflow. The project is not just about storing one dataset efficiently; it is about making large atomistic datasets easier to build, benchmark, publish, share, and reuse across a shared open-science ecosystem.

Try It From Hugging Face

Install from PyPI:

pip install atompack-db

The quickest way to try Atompack is to open one of the public datasets already packaged on the Hub:

import atompack

db = atompack.hub.open(
    repo_id="LeMaterial/Atompack",
    path_in_repo="lematbulk/pbe",
)

print(len(db))
mol = db[0]
print(mol.energy)
print(mol.positions.shape)

API for Dataset Workflows

Write a dataset and reopen it for reads:

import atompack

### Create a dataset with your data...
import numpy as np

positions = np.random.rand(32, 64, 3).astype(np.float32)
atomic_numbers = np.full((32, 64), 6, dtype=np.uint8)

db = atompack.Database("train.atp", overwrite=True)
db.add_arrays_batch(positions, atomic_numbers)
db.flush()

db = atompack.Database.open("train.atp")
for i in range(4):
    mol = db[i]
    print(i, len(mol), mol.positions.shape)

### ... Or use existing datasets from Hugging Face
remote_db = atompack.hub.open(
    repo_id="LeMaterial/Atompack",
    path_in_repo="omat/train",
)
print(len(remote_db))
print(remote_db[0].energy)

If your pipeline already uses ASE, you can ingest structures directly:

import atompack
from ase import Atoms

structures = [
    Atoms("H2O", positions=[[0, 0, 0], [1, 0, 0], [0, 1, 0]]),
    Atoms("CO2", positions=[[0, 0, 0], [1.16, 0, 0], [-1.16, 0, 0]]),
]

db = atompack.Database("ase_data.atp", overwrite=True)
atompack.add_ase_batch(db, structures, batch_size=256)
db.flush()

For uploading the datasets on Hugging Face:

import atompack

atompack.hub.upload(
    "exports/omat/train",
    repo_id="org/atompack-demo",
    path_in_repo="omat/train",
)

Performance on Read-Heavy Workloads

The benchmarks show that Atompack performs well on read-heavy dataset serving. Write throughput is strong with the native batch APIs, and artifact size stays close to HDF5 SOA while remaining much smaller than the LMDB and ASE baselines used in this repository.

Atompack read throughput benchmark figure

Benchmark setup: this slice uses synthetic fixed-size records with 64 atoms per molecule. The high-throughput read benchmark uses 1M generated molecules and reports read loops on local NVMe storage (Samsung 990 EVO Plus SSD). The random/shuffled number is the single-worker shuffled-read path.

  • 646k mol/s on sequential reads
  • 446k mol/s on the random/shuffled read path (single worker)
  • 24.0x faster than HDF5 SOA on the random or shuffled path
  • 2.81x faster than LMDB Packed on the random or shuffled path
  • 3.82x faster than LMDB Pickle on the random or shuffled path

Atompack write throughput overview

Write throughput is strong as well. On the same 64-atom NVMe slice, Atompack reaches:

  • 105,473 mol/s for builtin-field writes
  • 77,193 mol/s when writing additional custom properties

Atompack write storage efficiency comparison

Storage footprint stays near the compact end of the comparison set:

  • HDF5 SOA: 0.96x Atompack size on builtins and 0.95x on the custom-property slice
  • Atompack: 1.00x
  • LMDB Packed: 2.34x builtins and 1.35x custom
  • LMDB Pickle: 2.35x builtins and 1.35x custom
  • ASE SQLite: 3.05x builtins and 2.08x custom
  • ASE LMDB: 4.69x builtins and 2.69x custom

While Atompack is not always the absolute smallest representation, the main result is that it stays in the compact-storage regime while pairing that with much stronger read behavior.

Similar behaviours were observed on Lustre / NFS / GPFS filesystems.

Atompack filesystems reads

Public Datasets on Hugging Face

We also provide some public datasets in the Atompack format through Hugging Face https://huggingface.co/datasets/LeMaterial/Atompack. The main dataset paths currently exposed there include:

If you use any of these datasets, please cite the original dataset authors. The Atompack repository is a packing and serving layer, not the original source of the data.

When to Use Atompack

Atompack is a good fit when the storage layer itself has become a bottleneck: large datasets, random reads, many worker processes, and repeated conversion or publish steps. It is not trying to replace the rest of the scientific Python ecosystem. It is focused for atomistic ML workloads that need a faster and simpler path between dataset creation, dataset serving, and publication. When the bottleneck is not the storage layer and is rather in the graph construction, feature computation, or model training, then existing tools are a good fit already. We built Atompack to fill that specific gap and hope it can support faster, more efficient training pipelines that push the state of the art in atomistic ML.

Additional Resources

Citations

If you use one of the packaged datasets or mentioned tools, please cite the original authors.

Community

Sign up or log in to comment