Model Card for alex_mp_20_base
Model Details
Model Description
alex_mp_20_base is an unconditional generative model designed for the generation of valid inorganic crystal structures. It serves as a foundational pre-trained model for the CrystaLLM-pi framework. Based on a GPT-2 decoder-only architecture, it is trained on a massive combined corpus of Crystallographic Information Files (CIFs) to learn the syntax, symmetry, and chemical rules governing crystalline matter.
This model does not accept property conditioning vectors. It generates structures based on text prompts (e.g., chemical composition or space group) or unconditionally (ab-initio generation).
- Developed by: Bone et al. (University College London)
- Model type: Autoregressive Transformer (GPT-2)
- Language(s): CIF (Crystallographic Information File) syntax
- License: MIT
Model Sources
- Repository: GitHub: CrystaLLM-pi
- Paper: Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)
- Dataset: HuggingFace: c-bone/alex_mp_20
Uses
Direct Use
The model is intended for:
- Unconditional Generation: Exploring the general chemical space of stable crystals with 20 atoms or fewer in the unit cell.
- Composition/Space Group Completion: Generating valid structures given a partial prompt (e.g., a chemical formula).
- Fine-tuning base: Serving as the pre-trained initialization for property-conditional models.
Out-of-Scope Use
- Property Conditioning: This model cannot be steered by properties like band gap or density. Use the specific fine-tuned variants for those tasks.
- Large Unit Cells: The model is strictly trained on and intended for unit cells containing 20 atoms or fewer.
Bias, Risks, and Limitations
- Training Distribution: The model reflects the biases present in the Alexandria and Materials Project datasets. It is heavily biased toward theoretical, DFT-relaxed inorganic compounds rather than experimentally synthesized disordered structures.
- Size Constraint Bias: Because it is trained exclusively on the
alex_mp_20subset, the model has a strong prior for generating small, highly symmetric unit cells (≤ 20 atoms) and will struggle to extrapolate to larger, more complex systems. - Validity: While it learns CIF syntax robustly, it may still generate physically invalid structures (e.g., overlapping atoms) or chemically unstable compositions.
Training Details
Training Data
The model was pre-trained on the alex_mp_20 dataset (c-bone/alex_mp_20), a massive curated subset combining the Alexandria and Materials Project databases, restricted to crystal structures containing 20 atoms or fewer per unit cell.
- Source: Alexandria and Materials Project (via
c-bone/alex_mp_20) - Preprocessing: CIFs are filtered for size (≤ 20 atoms), deduplicated, augmented (with symmetry operations and fractional coordinate shifts), and tokenized.
Training Procedure
- Architecture: GPT-2 Small (~25.9M parameters).
- Objective: Causal Language Modeling (Next-token prediction).
- Loss Function: Cross-entropy with specific weighting for fixed syntax tokens to accelerate learning of the CIF format.
Evaluation
Metrics
The model is evaluated based on:
- Validity: The rate at which generated sequences can be parsed as valid CIF files.
- Structural Consistency: Adherence to space group symmetry and reasonable bond lengths.
Results
By leveraging the much larger chemical space of the combined Alexandria + MP datasets, the base model achieves high validity rates for small unit cells and effectively learns to generate chemically plausible, diverse structures. It serves as a robust foundation for downstream tasks requiring rigid size constraints and high theoretical discovery rates.
Citation
@misc{bone2025discoveryrecoverycrystallinematerials,
title={Discovery and recovery of crystalline materials with property-conditioned transformers},
author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
year={2025},
eprint={2511.21299},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)},
}
- Downloads last month
- 314