🧬 MoLLaMA-Small

MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings.

This model uses DeepChem's SmilesTokenizer and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.

πŸ“Š Model Performance

The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.

Metric Score
Parameters 57.2 M
Validity 100.0%
Average QED 0.6400
Diversity 0.8363

πŸ—οΈ Model Architecture

A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:

  • Hidden Size: 768
  • Intermediate Size: 2048
  • Number of Hidden Layers: 8
  • Number of Attention Heads: 8
  • Max Position Embeddings: 1024

πŸš€ How to Use

You can easily load this model using the standard transformers library. The model generates SMILES strings by prompting it with the [bos] (Beginning of Sequence) token.

Prerequisites

Make sure you have the required libraries installed:

pip install transformers torch deepchem

Generation Code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Load Model and Tokenizer
model_id = "jonghyunlee/MoLLaMA"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

# 2. Prepare Prompt for Unconditional Generation
prompt = "[bos]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

# 3. Generate SMILES
model.eval()
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

# 4. Decode the output
generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
print(f"Generated SMILES: {generated_smiles}")

πŸ“š Training Details

  • Dataset: ZINC15 + MuMOInstruct (Parquet format)
  • Epochs: 5
  • Batch Size: 512 (with gradient accumulation steps of 4)
  • Learning Rate: 1e-4 (Cosine scheduler, 10% Warmup)
  • Precision: bf16 (Mixed Precision)
  • Early Stopping Patience: 5 epochs
Downloads last month
34
Safetensors
Model size
57.5M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support