Abstract
Micro language models enable instant on-device response initiation with cloud-based continuation, achieving low-latency interactive AI through asymmetric collaboration between edge and cloud computing.
Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (μLMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that μLMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.
Community
In this paper, we explore a practical way to improve the responsiveness of AI assistants by using micro language models to generate the first few words locally before handing off to a larger cloud model. We focus on perceived latency, which is often overlooked when language models are evaluated only by final answer quality. We believe this collaborative design is especially useful for edge devices, where fast interaction matters but full on-device inference is still too costly.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation (2026)
- Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading (2026)
- Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads (2026)
- Efficient Reasoning on the Edge (2026)
- How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks (2026)
- Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM (2026)
- HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.19642 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper