Add model card for LatentUM
Browse filesHi! I'm Niels, part of the community science team at Hugging Face. I'm opening this PR to add a model card to your repository.
This model card includes:
- A link to the [original paper](https://huggingface.co/papers/2604.02097).
- A link to the official Github repository.
- Metadata for the `apache-2.0` license and the `image-to-image` pipeline tag.
- Sample usage code snippets for image understanding and image generation as found in your README.
Feel free to merge this if it looks good!
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-image
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
|
| 7 |
+
|
| 8 |
+
**LatentUM** unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.
|
| 9 |
+
|
| 10 |
+
This repository specifically contains the **Pixel Decoder**, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.
|
| 11 |
+
|
| 12 |
+
- **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097)
|
| 13 |
+
- **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)
|
| 14 |
+
|
| 15 |
+
## Sample Usage
|
| 16 |
+
|
| 17 |
+
To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM).
|
| 18 |
+
|
| 19 |
+
### Image Understanding
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
import torch
|
| 23 |
+
|
| 24 |
+
from model.latentum import LatentUMModel
|
| 25 |
+
|
| 26 |
+
dtype = torch.bfloat16
|
| 27 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 28 |
+
|
| 29 |
+
model = LatentUMModel.from_pretrained(
|
| 30 |
+
"SJTU-DENG-Lab/LatentUM-Base",
|
| 31 |
+
device = device,
|
| 32 |
+
dtype = dtype,
|
| 33 |
+
)
|
| 34 |
+
answer = model.answer(
|
| 35 |
+
"asset/blue_apple.png",
|
| 36 |
+
"Describe this image.",
|
| 37 |
+
)
|
| 38 |
+
print(answer)
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
### Image Generation
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
import torch
|
| 45 |
+
|
| 46 |
+
from model.decoder import LatentUMDecoderModel
|
| 47 |
+
from model.latentum import LatentUMModel
|
| 48 |
+
|
| 49 |
+
dtype = torch.bfloat16
|
| 50 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 51 |
+
|
| 52 |
+
model = LatentUMModel.from_pretrained(
|
| 53 |
+
"SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
|
| 54 |
+
device = device,
|
| 55 |
+
dtype = dtype,
|
| 56 |
+
)
|
| 57 |
+
decoder = LatentUMDecoderModel.from_pretrained(
|
| 58 |
+
"SJTU-DENG-Lab/LatentUM-Decoder",
|
| 59 |
+
device=device,
|
| 60 |
+
dtype=dtype,
|
| 61 |
+
)
|
| 62 |
+
images = model.generate_images(
|
| 63 |
+
"a photo of a cute dog",
|
| 64 |
+
decoder = decoder,
|
| 65 |
+
show_progress = True,
|
| 66 |
+
)
|
| 67 |
+
images[0].save("generated.png")
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Citation
|
| 71 |
+
|
| 72 |
+
```bibtex
|
| 73 |
+
@article{jin2026latentum,
|
| 74 |
+
title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
|
| 75 |
+
author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
|
| 76 |
+
journal = {arXiv preprint arXiv:2604.02097},
|
| 77 |
+
year = {2026},
|
| 78 |
+
url = {https://arxiv.org/abs/2604.02097}
|
| 79 |
+
}
|
| 80 |
+
```
|