REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression
Paper • Code • Blog

GLM-4.7-REAP-25P-W4A16

Note:

The next version is brewing right now, as I got some issues with math in this version.

I'm performing observations procedure right now on wider range of datasets. This process it on iteration 5 now ;) meaning that previous versions didn't quite cut it. Stay tuned, I'll upload next version with and without quantization.

Summary

25% Expert-Pruned GLM-4.7 optimized for coding, function call, and agentic workflows, this version works well with Roo Code.

Created using REAP (Router-weighted Expert Activation Pruning) by Cerebras:

268B: 25% of MoE experts pruned 120/160 - fits perfectly on 2xRTX6000 Pro with 117k of fp16 context
calibrated for coding: dataset: agent_calibration_mix_v2.jsonl
Works with vLLM

Acknowledgments

Cerebras — REAP methodology

Model Specifications

Property	Value
Base Model	zai/glm-4.7
Architecture	Sparse Mixture-of-Experts (SMoE)
Original Parameters	358B
Pruned Parameters	268B
Compression	25% experts removed
Experts per Layer	120 (was 160)
MoE Layers	92
Activated Experts	8 per token
Precision	W4A16
Disk Size	~135GB
VRAM Required	~180GB with fp16 context with 117k tokens

Calibration Dataset

REAP'd model quality strongly depends on the calibration dataset used to detect the most and least activated experts.

This detection procedure is called "observations" and is quite computationally expensive, especially when hardware is limited. I have less VRAM than the original model size, so it had to be offloaded to RAM and even partially to NVMe storage. The original observations procedure for 1,024 samples took more than 100 hours on my hardware, so I modified the original observer to test all experts on a per-layer basis. This significantly lowered the VRAM requirements (I was able to complete it with ~24GB of VRAM) and reduced the time needed from ~100 hours to ~10 hours.

Once the observations file is generated, it can be reused to perform multiple REAP operations (e.g., 10%, 30%, or any other ratio you want).

I used a mix of four datasets to create the 1024 samples:

glaiveai/glaive-function-calling-v2
theblackcat102/evol-codealpaca-v1
nampdn-ai/tiny-codes
HuggingFaceH4/ultrachat_200k

Resulting Dataset

AImhotep/agent_calibration_mix_v2

run in VLLM example

Example of glm47/generation_config.json

{
  "_from_model_config": true,
  "do_sample": true,
  "pad_token_id": 151329,
  "eos_token_id": [
    151329,
    151336,
    151338
  ],
  "top_p": 0.95,
  "temperature": 0.8,
  "repetition_penalty": 1.05,
  "top_k": 40,
  "min_p": 0.0,
  "transformers_version": "4.57.3"
}

#!/bin/bash

source .venv/bin/activate

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_ALLOW_TF32=1
export PYTORCH_CUDA_ALLOC_CONF=""

export VLLM_ATTENTION_BACKEND="FLASHINFER"
export TORCH_CUDA_ARCH_LIST="12.0"
export CUDA_VISIBLE_DEVICES=1,2 #both RTX6000
export VLLM_MARLIN_USE_ATOMIC_ADD=1
export SAFETENSORS_FAST_GPU=1
export OMP_NUM_THREADS=62 # Epyc - tune it to your needs

export VLLM_FLASHINFER_MOE_BACKEND=latency

# without this vllm uses 100% on couple of cores - so more power used on idle (178W -> 250W)
export VLLM_SLEEP_WHEN_IDLE=1

export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=4
export NCCL_MAX_NCHANNELS=8
export NCCL_BUFFSIZE=8388608

vllm serve AImhotep/GLM-4.7-REAP-25P-W4A16 \
    --tensor-parallel-size 2 \
    --uvicorn-log-level info \
    --trust-remote-code \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 1 \
    --seed 42 \
    --max-model-len 117000 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --enable-sleep-mode \
    --compilation-config '{"level": 3, "cudagraph_capture_sizes": [1]}' \
    --generation-config glm47 \ # the folder, where custom generation_config.json is placed - optional but allows to customize temperature etc.
    --host 0.0.0.0 \
    --port 11110

Citation

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025},
  url={https://arxiv.org/abs/2510.13999}
}

Downloads last month: 34

Safetensors

Model size

37B params

Tensor type

BF16

I64

F32

I32

Datasets used to train AImhotep/GLM-4.7-REAP-25P-W4A16

Paper for AImhotep/GLM-4.7-REAP-25P-W4A16

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 9