spatialwhisperer / README.md
Good-Lab's picture
Correct README: 512-d shared space, fix package import + GitHub URL, fill in verified eval numbers
2310750 verified
|
Raw
History Blame Contribute Delete
6.12 kB
---
license: cc-by-nc-4.0
tags:
- biology
- histopathology
- spatial-transcriptomics
- multimodal
- pathology
- gene-expression
- biobert
- vision-language
library_name: pytorch
---
# SpatialWhisperer
A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.
This checkpoint (seed 0) is from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the **Trimodal** model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.
- **Code & reproduction pipeline:** <https://github.com/zinagoodlab/spatialwhisperer>
- **Paper:** <https://openreview.net/forum?id=Ze7U293Zw4>
## Architecture
| Modality | Encoder | Status |
|----------|---------|--------|
| Image (H&E) | UNI2 (`MahmoodLab/UNI2-h`) | locked |
| Transcriptome | Geneformer-12L-30M | locked |
| Text | BioBERT v1.1 | trained |
Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.
## Training data
Three paired-modality datasets:
- **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots)
- **CellxGene Census** — gene expression ↔ free-text cell/sample metadata
- **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions
Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.
## What this checkpoint contains
- `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped.
**The locked foundation-model weights are NOT included.** UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call.
## Usage
Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then:
```python
from spatialwhisperer import load_spatialwhisperer_model
model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)
```
First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.
Each `get_<modality>_features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities.
```python
import torch
prompts = ["cytotoxic T cells", "plasma cells"]
text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
_, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
print(text_emb.shape) # (2, 512)
```
Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities.
## Foundation-model setup
UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use:
1. Accept the terms at <https://huggingface.co/MahmoodLab/UNI2-h>.
2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is:
```bash
huggingface-cli login # paste your read token
```
Geneformer downloads without gating.
## Evaluation
The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):
| Benchmark | Macro AUROC (this ckpt, seed 0) |
|-----------|-----------------|
| PathoCell CRC (13-class) | 0.630 |
| Lizard (3-class reduced) | 0.764 |
| PanNuke (4-class reduced) | 0.689 |
| Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 |
These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.
## Intended use & limitations
**Intended:** research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.
**Not intended:** clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.
**Limitations:**
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.
## License
**CC BY-NC 4.0** (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.
## Citation
```bibtex
@inproceedings{schaefer2026spatialwhisperer,
title = {Transitive Representation Learning Enhances Histopathology Annotation},
author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
publisher = {PMLR},
address = {Seoul, South Korea},
month = jul,
year = {2026},
url = {https://openreview.net/forum?id=Ze7U293Zw4}
}
```