nielsr HF Staff commited on
Commit
554e32c
·
verified ·
1 Parent(s): 0cc47ec

Add model card for LatentUM

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face. I'm opening this PR to add a model card to your repository.

This model card includes:
- A link to the [original paper](https://huggingface.co/papers/2604.02097).
- A link to the official Github repository.
- Metadata for the `apache-2.0` license and the `image-to-image` pipeline tag.
- Sample usage code snippets for image understanding and image generation as found in your README.

Feel free to merge this if it looks good!

Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-image
4
+ ---
5
+
6
+ # LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
7
+
8
+ **LatentUM** unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.
9
+
10
+ This repository specifically contains the **Pixel Decoder**, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.
11
+
12
+ - **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097)
13
+ - **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)
14
+
15
+ ## Sample Usage
16
+
17
+ To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM).
18
+
19
+ ### Image Understanding
20
+
21
+ ```python
22
+ import torch
23
+
24
+ from model.latentum import LatentUMModel
25
+
26
+ dtype = torch.bfloat16
27
+ device = "cuda" if torch.cuda.is_available() else "cpu"
28
+
29
+ model = LatentUMModel.from_pretrained(
30
+ "SJTU-DENG-Lab/LatentUM-Base",
31
+ device = device,
32
+ dtype = dtype,
33
+ )
34
+ answer = model.answer(
35
+ "asset/blue_apple.png",
36
+ "Describe this image.",
37
+ )
38
+ print(answer)
39
+ ```
40
+
41
+ ### Image Generation
42
+
43
+ ```python
44
+ import torch
45
+
46
+ from model.decoder import LatentUMDecoderModel
47
+ from model.latentum import LatentUMModel
48
+
49
+ dtype = torch.bfloat16
50
+ device = "cuda" if torch.cuda.is_available() else "cpu"
51
+
52
+ model = LatentUMModel.from_pretrained(
53
+ "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
54
+ device = device,
55
+ dtype = dtype,
56
+ )
57
+ decoder = LatentUMDecoderModel.from_pretrained(
58
+ "SJTU-DENG-Lab/LatentUM-Decoder",
59
+ device=device,
60
+ dtype=dtype,
61
+ )
62
+ images = model.generate_images(
63
+ "a photo of a cute dog",
64
+ decoder = decoder,
65
+ show_progress = True,
66
+ )
67
+ images[0].save("generated.png")
68
+ ```
69
+
70
+ ## Citation
71
+
72
+ ```bibtex
73
+ @article{jin2026latentum,
74
+ title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
75
+ author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
76
+ journal = {arXiv preprint arXiv:2604.02097},
77
+ year = {2026},
78
+ url = {https://arxiv.org/abs/2604.02097}
79
+ }
80
+ ```