nielsr HF Staff commited on
Commit
6e6facd
·
verified ·
1 Parent(s): 8b5917a

Add paper and code links to model card

Browse files

Hi! I'm Niels from the Hugging Face community science team.

I'm opening this PR to improve your model card by adding links to the associated research paper, the official GitHub repository, and the project blog. Providing these links helps researchers and users easily find technical details and the implementation codebase.

Files changed (1) hide show
  1. README.md +48 -44
README.md CHANGED
@@ -1,44 +1,48 @@
1
- ---
2
- datasets:
3
- - Skylion007/openwebtext
4
- language:
5
- - en
6
- library_name: transformers
7
- license: apache-2.0
8
- metrics:
9
- - perplexity
10
- pipeline_tag: text-generation
11
- ---
12
-
13
- # LangFlow
14
-
15
- LangFlow is a continuous diffusion language model that operates in embedding space. Unlike discrete diffusion models (MDLM, SEDD, DUO), LangFlow performs diffusion directly on continuous token embeddings, enabling smoother denoising dynamics.
16
-
17
- ## Using LangFlow
18
-
19
- To use the pre-trained model for text generation, use the following snippet:
20
-
21
- ```python
22
- from transformers import AutoModelForMaskedLM, AutoTokenizer
23
-
24
- tokenizer = AutoTokenizer.from_pretrained('gpt2')
25
- model = AutoModelForMaskedLM.from_pretrained('chumengl/langflow-owt', trust_remote_code=True)
26
-
27
- # Generate samples
28
- samples = model.generate_samples(num_samples=5, num_steps=128)
29
- texts = tokenizer.batch_decode(samples, skip_special_tokens=True)
30
- for text in texts:
31
- print(text)
32
- ```
33
-
34
- ## Model Details
35
-
36
- - **Architecture**: DiT (Diffusion Transformer) backbone with adaptive layer normalization
37
- - **Context Length**: 1024 tokens
38
- - **Parameters**: ~130M non-embedding parameters (similar to GPT-2 medium)
39
- - **Training**: 1M steps on OpenWebText corpus
40
- - **Tokenizer**: GPT-2 tokenizer (50,257 vocab size)
41
-
42
- ## Model Card Contact
43
-
44
- Chumeng Liang (chumengl@illinois.edu)
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - Skylion007/openwebtext
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - perplexity
10
+ pipeline_tag: text-generation
11
+ ---
12
+
13
+ # LangFlow
14
+
15
+ LangFlow is a continuous diffusion language model that operates in embedding space. Unlike discrete diffusion models (MDLM, SEDD, DUO), LangFlow performs diffusion directly on continuous token embeddings, enabling smoother denoising dynamics. It is the first continuous DLM to rival discrete diffusion models on standard language modeling benchmarks like LM1B and OpenWebText.
16
+
17
+ - **Paper:** [LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling](https://huggingface.co/papers/2604.11748)
18
+ - **Code:** [GitHub Repository](https://github.com/nealchen2003/LangFlow)
19
+ - **Project Blog:** [LangFlow Blog Post](https://caradryanl.github.io/blog/2026/langflow/)
20
+
21
+ ## Using LangFlow
22
+
23
+ To use the pre-trained model for text generation, use the following snippet:
24
+
25
+ ```python
26
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
27
+
28
+ tokenizer = AutoTokenizer.from_pretrained('gpt2')
29
+ model = AutoModelForMaskedLM.from_pretrained('chumengl/langflow-owt', trust_remote_code=True)
30
+
31
+ # Generate samples
32
+ samples = model.generate_samples(num_samples=5, num_steps=128)
33
+ texts = tokenizer.batch_decode(samples, skip_special_tokens=True)
34
+ for text in texts:
35
+ print(text)
36
+ ```
37
+
38
+ ## Model Details
39
+
40
+ - **Architecture**: DiT (Diffusion Transformer) backbone with adaptive layer normalization
41
+ - **Context Length**: 1024 tokens
42
+ - **Parameters**: ~130M non-embedding parameters (similar to GPT-2 medium)
43
+ - **Training**: 1M steps on OpenWebText corpus
44
+ - **Tokenizer**: GPT-2 tokenizer (50,257 vocab size)
45
+
46
+ ## Model Card Contact
47
+
48
+ Chumeng Liang (chumengl@illinois.edu)