Papers
arxiv:2509.21892

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Published on May 11
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Elastic Mixture-of-Experts (EMoE) addresses the inference-time scaling wall by enabling dynamic expert activation through collaborative training and improved routing mechanisms.

Mixture-of-Experts (MoE) models typically fix the number of activated experts k at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts k' (> k) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the inference-time scaling wall. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3times the training-time k, while also achieving higher peak performance.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2509.21892
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.21892 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.21892 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.21892 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.