SJTU-DENG-Lab
/

LatentUM-Decoder

Model card Files Files and versions

Add model card for LatentUM

#1

by nielsr HF Staff - opened 1 day ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+license: apache-2.0
+pipeline_tag: image-to-image
+---
+# LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
+**LatentUM** unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.
+This repository specifically contains the **Pixel Decoder**, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.
+- **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097)
+- **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)
+## Sample Usage
+To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM).
+### Image Understanding
+```python
+import torch
+from model.latentum import LatentUMModel
+dtype = torch.bfloat16
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = LatentUMModel.from_pretrained(
+    "SJTU-DENG-Lab/LatentUM-Base",
+    device = device,
+    dtype  = dtype,
+)
+answer = model.answer(
+    "asset/blue_apple.png",
+    "Describe this image.",
+)
+print(answer)
+```
+### Image Generation
+```python
+import torch
+from model.decoder import LatentUMDecoderModel
+from model.latentum import LatentUMModel
+dtype = torch.bfloat16
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = LatentUMModel.from_pretrained(
+    "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
+    device = device,
+    dtype  = dtype,
+)
+decoder = LatentUMDecoderModel.from_pretrained(
+    "SJTU-DENG-Lab/LatentUM-Decoder",
+    device=device,
+    dtype=dtype,
+)
+images = model.generate_images(
+    "a photo of a cute dog",
+    decoder       = decoder,
+    show_progress = True,
+)
+images[0].save("generated.png")
+```
+## Citation
+```bibtex
+@article{jin2026latentum,
+  title   = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
+  author  = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
+  journal = {arXiv preprint arXiv:2604.02097},
+  year    = {2026},
+  url     = {https://arxiv.org/abs/2604.02097}
+}
+```