--- license: cc-by-nc-4.0 tags: - biology - histopathology - spatial-transcriptomics - multimodal - pathology - gene-expression - biobert - vision-language library_name: pytorch --- # SpatialWhisperer A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data. This checkpoint (seed 0) is from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the **Trimodal** model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities. - **Code & reproduction pipeline:** - **Paper:** ## Architecture | Modality | Encoder | Status | |----------|---------|--------| | Image (H&E) | UNI2 (`MahmoodLab/UNI2-h`) | locked | | Transcriptome | Geneformer-12L-30M | locked | | Text | BioBERT v1.1 | trained | Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained. ## Training data Three paired-modality datasets: - **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots) - **CellxGene Census** — gene expression ↔ free-text cell/sample metadata - **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624. ## What this checkpoint contains - `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped. **The locked foundation-model weights are NOT included.** UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call. ## Usage Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then: ```python from spatialwhisperer import load_spatialwhisperer_model model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model() # model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available) ``` First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache. Each `get__features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities. ```python import torch prompts = ["cytotoxic T cells", "plasma cells"] text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device) with torch.no_grad(): _, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs) print(text_emb.shape) # (2, 512) ``` Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities. ## Foundation-model setup UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use: 1. Accept the terms at . 2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is: ```bash huggingface-cli login # paste your read token ``` Geneformer downloads without gating. ## Evaluation The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31): | Benchmark | Macro AUROC (this ckpt, seed 0) | |-----------|-----------------| | PathoCell CRC (13-class) | 0.630 | | Lizard (3-class reduced) | 0.764 | | PanNuke (4-class reduced) | 0.689 | | Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 | These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper. ## Intended use & limitations **Intended:** research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data. **Not intended:** clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use. **Limitations:** - Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed. - BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution. - The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower. ## License **CC BY-NC 4.0** (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories. ## Citation ```bibtex @inproceedings{schaefer2026spatialwhisperer, title = {Transitive Representation Learning Enhances Histopathology Annotation}, author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning}, series = {Proceedings of Machine Learning Research}, volume = {306}, publisher = {PMLR}, address = {Seoul, South Korea}, month = jul, year = {2026}, url = {https://openreview.net/forum?id=Ze7U293Zw4} } ```