Correct README: 512-d shared space, fix package import + GitHub URL, fill in verified eval numbers

2310750 verified 25 days ago

6.12 kB

	---
	license: cc-by-nc-4.0
	tags:
	- biology
	- histopathology
	- spatial-transcriptomics
	- multimodal
	- pathology
	- gene-expression
	- biobert
	- vision-language
	library_name: pytorch
	---

	# SpatialWhisperer

	A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.

	This checkpoint (seed 0) is from the ICML 2026 paper [Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4) (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the Trimodal model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.

	- Code & reproduction pipeline: <https://github.com/zinagoodlab/spatialwhisperer>
	- Paper: <https://openreview.net/forum?id=Ze7U293Zw4>

	## Architecture

	\| Modality \| Encoder \| Status \|
	\|----------\|---------\|--------\|
	\| Image (H&E) \| UNI2 (`MahmoodLab/UNI2-h`) \| locked \|
	\| Transcriptome \| Geneformer-12L-30M \| locked \|
	\| Text \| BioBERT v1.1 \| trained \|

	Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.

	## Training data

	Three paired-modality datasets:

	- HEST-1K — H&E ↔ spatial gene expression (Visium-style spots)
	- CellxGene Census — gene expression ↔ free-text cell/sample metadata
	- ARCHS4/GEO — gene expression ↔ free-text sample descriptions

	Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.

	## What this checkpoint contains

	- `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped.

	The locked foundation-model weights are NOT included. UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call.

	## Usage

	Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then:

	```python
	from spatialwhisperer import load_spatialwhisperer_model

	model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
	# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)
	```

	First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.

	Each `get_<modality>_features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities.

	```python
	import torch

	prompts = ["cytotoxic T cells", "plasma cells"]
	text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

	with torch.no_grad():
	_, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
	print(text_emb.shape) # (2, 512)
	```

	Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities.

	## Foundation-model setup

	UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use:

	1. Accept the terms at <https://huggingface.co/MahmoodLab/UNI2-h>.
	2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is:
	```bash
	huggingface-cli login # paste your read token
	```

	Geneformer downloads without gating.

	## Evaluation

	The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):

	\| Benchmark \| Macro AUROC (this ckpt, seed 0) \|
	\|-----------\|-----------------\|
	\| PathoCell CRC (13-class) \| 0.630 \|
	\| Lizard (3-class reduced) \| 0.764 \|
	\| PanNuke (4-class reduced) \| 0.689 \|
	\| Kriegsmann Skin Conditions (16-class, clinical labels) \| 0.698 \|

	These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.

	## Intended use & limitations

	Intended: research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.

	Not intended: clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.

	Limitations:
	- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
	- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
	- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.

	## License

	CC BY-NC 4.0 (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.

	## Citation

	```bibtex
	@inproceedings{schaefer2026spatialwhisperer,
	title = {Transitive Representation Learning Enhances Histopathology Annotation},
	author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
	booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
	series = {Proceedings of Machine Learning Research},
	volume = {306},
	publisher = {PMLR},
	address = {Seoul, South Korea},
	month = jul,
	year = {2026},
	url = {https://openreview.net/forum?id=Ze7U293Zw4}
	}
	```