Add model card for LatentUM

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-image
4
+ ---
5
+
6
+ # LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
7
+
8
+ **LatentUM** unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.
9
+
10
+ This repository specifically contains the **Pixel Decoder**, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.
11
+
12
+ - **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097)
13
+ - **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)
14
+
15
+ ## Sample Usage
16
+
17
+ To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM).
18
+
19
+ ### Image Understanding
20
+
21
+ ```python
22
+ import torch
23
+
24
+ from model.latentum import LatentUMModel
25
+
26
+ dtype = torch.bfloat16
27
+ device = "cuda" if torch.cuda.is_available() else "cpu"
28
+
29
+ model = LatentUMModel.from_pretrained(
30
+ "SJTU-DENG-Lab/LatentUM-Base",
31
+ device = device,
32
+ dtype = dtype,
33
+ )
34
+ answer = model.answer(
35
+ "asset/blue_apple.png",
36
+ "Describe this image.",
37
+ )
38
+ print(answer)
39
+ ```
40
+
41
+ ### Image Generation
42
+
43
+ ```python
44
+ import torch
45
+
46
+ from model.decoder import LatentUMDecoderModel
47
+ from model.latentum import LatentUMModel
48
+
49
+ dtype = torch.bfloat16
50
+ device = "cuda" if torch.cuda.is_available() else "cpu"
51
+
52
+ model = LatentUMModel.from_pretrained(
53
+ "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
54
+ device = device,
55
+ dtype = dtype,
56
+ )
57
+ decoder = LatentUMDecoderModel.from_pretrained(
58
+ "SJTU-DENG-Lab/LatentUM-Decoder",
59
+ device=device,
60
+ dtype=dtype,
61
+ )
62
+ images = model.generate_images(
63
+ "a photo of a cute dog",
64
+ decoder = decoder,
65
+ show_progress = True,
66
+ )
67
+ images[0].save("generated.png")
68
+ ```
69
+
70
+ ## Citation
71
+
72
+ ```bibtex
73
+ @article{jin2026latentum,
74
+ title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
75
+ author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
76
+ journal = {arXiv preprint arXiv:2604.02097},
77
+ year = {2026},
78
+ url = {https://arxiv.org/abs/2604.02097}
79
+ }
80
+ ```