| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - biology |
| - histopathology |
| - spatial-transcriptomics |
| - multimodal |
| - pathology |
| - gene-expression |
| - biobert |
| - vision-language |
| library_name: pytorch |
| --- |
| |
| # SpatialWhisperer |
|
|
| A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data. |
|
|
| This checkpoint (seed 0) is from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the **Trimodal** model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities. |
|
|
| - **Code & reproduction pipeline:** <https://github.com/zinagoodlab/spatialwhisperer> |
| - **Paper:** <https://openreview.net/forum?id=Ze7U293Zw4> |
|
|
| ## Architecture |
|
|
| | Modality | Encoder | Status | |
| |----------|---------|--------| |
| | Image (H&E) | UNI2 (`MahmoodLab/UNI2-h`) | locked | |
| | Transcriptome | Geneformer-12L-30M | locked | |
| | Text | BioBERT v1.1 | trained | |
|
|
| Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained. |
|
|
| ## Training data |
|
|
| Three paired-modality datasets: |
|
|
| - **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots) |
| - **CellxGene Census** — gene expression ↔ free-text cell/sample metadata |
| - **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions |
|
|
| Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624. |
|
|
| ## What this checkpoint contains |
|
|
| - `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped. |
|
|
| **The locked foundation-model weights are NOT included.** UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call. |
|
|
| ## Usage |
|
|
| Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then: |
|
|
| ```python |
| from spatialwhisperer import load_spatialwhisperer_model |
| |
| model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model() |
| # model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available) |
| ``` |
|
|
| First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache. |
|
|
| Each `get_<modality>_features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities. |
|
|
| ```python |
| import torch |
| |
| prompts = ["cytotoxic T cells", "plasma cells"] |
| text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device) |
| |
| with torch.no_grad(): |
| _, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs) |
| print(text_emb.shape) # (2, 512) |
| ``` |
|
|
| Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities. |
|
|
| ## Foundation-model setup |
|
|
| UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use: |
|
|
| 1. Accept the terms at <https://huggingface.co/MahmoodLab/UNI2-h>. |
| 2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is: |
| ```bash |
| huggingface-cli login # paste your read token |
| ``` |
|
|
| Geneformer downloads without gating. |
|
|
| ## Evaluation |
|
|
| The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31): |
|
|
| | Benchmark | Macro AUROC (this ckpt, seed 0) | |
| |-----------|-----------------| |
| | PathoCell CRC (13-class) | 0.630 | |
| | Lizard (3-class reduced) | 0.764 | |
| | PanNuke (4-class reduced) | 0.689 | |
| | Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 | |
|
|
| These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper. |
|
|
| ## Intended use & limitations |
|
|
| **Intended:** research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data. |
|
|
| **Not intended:** clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use. |
|
|
| **Limitations:** |
| - Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed. |
| - BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution. |
| - The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower. |
|
|
| ## License |
|
|
| **CC BY-NC 4.0** (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{schaefer2026spatialwhisperer, |
| title = {Transitive Representation Learning Enhances Histopathology Annotation}, |
| author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida}, |
| booktitle = {Proceedings of the 43rd International Conference on Machine Learning}, |
| series = {Proceedings of Machine Learning Research}, |
| volume = {306}, |
| publisher = {PMLR}, |
| address = {Seoul, South Korea}, |
| month = jul, |
| year = {2026}, |
| url = {https://openreview.net/forum?id=Ze7U293Zw4} |
| } |
| ``` |
|
|