| --- |
| datasets: |
| - Skylion007/openwebtext |
| papers: |
| - arxiv: 2604.11748 |
| language: |
| - en |
| library_name: transformers |
| license: apache-2.0 |
| metrics: |
| - perplexity |
| pipeline_tag: text-generation |
| --- |
| |
| # LangFlow |
|
|
| LangFlow is a continuous diffusion language model that operates in embedding space. Unlike discrete diffusion models (MDLM, SEDD, DUO), LangFlow performs diffusion directly on continuous token embeddings, enabling smoother denoising dynamics. |
|
|
| For more details, please see our paper: [LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling](https://arxiv.org/abs/2604.11748). |
|
|
|
|
| ## Using LangFlow |
|
|
| To use the pre-trained model for text generation, use the following snippet: |
|
|
| ```python |
| from transformers import AutoModelForMaskedLM, AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained('gpt2') |
| model = AutoModelForMaskedLM.from_pretrained('chumengl/langflow-owt', trust_remote_code=True) |
| |
| # Generate samples |
| samples = model.generate_samples(num_samples=5, num_steps=128) |
| texts = tokenizer.batch_decode(samples, skip_special_tokens=True) |
| for text in texts: |
| print(text) |
| ``` |
|
|
| ## Model Details |
|
|
| - **Architecture**: DiT (Diffusion Transformer) backbone with adaptive layer normalization |
| - **Context Length**: 1024 tokens |
| - **Parameters**: ~130M non-embedding parameters (similar to GPT-2 medium) |
| - **Training**: 1M steps on OpenWebText corpus |
| - **Tokenizer**: GPT-2 tokenizer (50,257 vocab size) |
|
|
| ## Citation |
|
|
| ``` |
| @article{chen2026langflow, |
| title={LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling}, |
| author={Chen, Yuxin and Liang, Chumeng and Sui, Hangke and Guo, Ruihan and Cheng, Chaoran and You, Jiaxuan and Liu, Ge}, |
| journal={arXiv preprint arXiv:2604.11748}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Model Card Contact |
|
|
| Chumeng Liang (chumengl@illinois.edu) |
|
|
|
|