𧬠MoLLaMA-Small
MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings.
This model uses DeepChem's SmilesTokenizer and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.
π Model Performance
The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.
| Metric | Score |
|---|---|
| Parameters | 57.2 M |
| Validity | 100.0% |
| Average QED | 0.6400 |
| Diversity | 0.8363 |
ποΈ Model Architecture
A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:
- Hidden Size: 768
- Intermediate Size: 2048
- Number of Hidden Layers: 8
- Number of Attention Heads: 8
- Max Position Embeddings: 1024
π How to Use
You can easily load this model using the standard transformers library. The model generates SMILES strings by prompting it with the [bos] (Beginning of Sequence) token.
Prerequisites
Make sure you have the required libraries installed:
pip install transformers torch deepchem
Generation Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Load Model and Tokenizer
model_id = "jonghyunlee/MoLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# 2. Prepare Prompt for Unconditional Generation
prompt = "[bos]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
# 3. Generate SMILES
model.eval()
with torch.no_grad():
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
# 4. Decode the output
generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
print(f"Generated SMILES: {generated_smiles}")
π Training Details
- Dataset:
ZINC15+MuMOInstruct(Parquet format) - Epochs: 5
- Batch Size: 512 (with gradient accumulation steps of 4)
- Learning Rate: 1e-4 (Cosine scheduler, 10% Warmup)
- Precision: bf16 (Mixed Precision)
- Early Stopping Patience: 5 epochs
- Downloads last month
- 34