🧬 MoLLaMA-Small

MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings.

This model uses DeepChem's SmilesTokenizer and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.

📊 Model Performance

The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.

Metric	Score
Parameters	57.2 M
Validity	100.0%
Average QED	0.6400
Diversity	0.8363

🏗️ Model Architecture

A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:

Hidden Size: 768
Intermediate Size: 2048
Number of Hidden Layers: 8
Number of Attention Heads: 8
Max Position Embeddings: 1024

🚀 How to Use

You can easily load this model using the standard transformers library. The model generates SMILES strings by prompting it with the [bos] (Beginning of Sequence) token.

Prerequisites

Make sure you have the required libraries installed:

pip install transformers torch deepchem

Generation Code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Load Model and Tokenizer
model_id = "jonghyunlee/MoLLaMA"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

# 2. Prepare Prompt for Unconditional Generation
prompt = "[bos]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

# 3. Generate SMILES
model.eval()
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

# 4. Decode the output
generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
print(f"Generated SMILES: {generated_smiles}")

📚 Training Details

Dataset: ZINC15 + MuMOInstruct (Parquet format)
Epochs: 5
Batch Size: 512 (with gradient accumulation steps of 4)
Learning Rate: 1e-4 (Cosine scheduler, 10% Warmup)
Precision: bf16 (Mixed Precision)
Early Stopping Patience: 5 epochs

Downloads last month: 3

Safetensors

Model size

57.5M params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support