Model Overview

  • Model Architecture: GLM-5
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.1.0
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark (V0.11.1)
    • moe
      • Weight quantization: MOE-only, OCP MXFP4, Static
      • Activation quantization: MOE-only, OCP MXFP4, Dynamic
  • Calibration Dataset: Pile

This model was built with GLM-5 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from zai-org/GLM-5 using AMD-Quark. The weights and activations are quantized to MXFP4.

Quantization scripts:

from quark.torch import LLMTemplate, ModelQuantizer

# --- Register GLM-5 template ---
GLM5_template = LLMTemplate(
    model_type="glm_moe_dsa",
    kv_layers_name=["*kv_a_proj_with_mqa", "*kv_b_proj"],
    q_layer_name="*q_a_proj",
    exclude_layers_name=["lm_head"],
)
LLMTemplate.register_template(GLM5_template)
print(f"[INFO]: Registered template '{GLM5_template.model_type}'")

# --- Configuration ---
model_dir = "zai-org/GLM-5"
output_dir = "amd/GLM-5-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = [
    "*self_attn*",
    "*mlp.gate",
    "*lm_head",
    "*mlp.gate_proj",
    "*mlp.up_proj",
    "*mlp.down_proj",
    "*shared_experts*",
]

# --- Build quant config from template ---
template = LLMTemplate.get("glm_moe_dsa")
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)

# --- File-to-file quantization (memory-efficient, no full model loading) ---
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=model_dir,
    save_path=output_dir,
)

print(f"[INFO]: Quantization complete. Output saved to {output_dir}")

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The model was evaluated on GSM8K benchmarks.

Accuracy

Benchmark GLM-5 GLM-5-MXFP4(this model) Recovery
GSM8K (flexible-extract) 95.45 95.00 99.53%

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image rocm/pytorch-private:vllm_glm5_0225, with vLLM, lm-eval compiled and installed from source inside the image. The Docker image contains the necessary vLLM code modifications to support this model.

Launching server

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
vllm serve amd/GLM-5-MXFP4 \
  -tp 8 \
  --block-size 1 \
  --trust-remote-code \
  --max-model-len 4096

Evaluating model in a new terminal

lm_eval \
  --model local-completions \
  --model_args '{"model": "amd/GLM-5-MXFP4", "base_url": "http://localhost:8000/v1/completions", "num_concurrent": 32, "max_retries": 10, "max_gen_toks": 2048, "tokenizer_backend":"None","tokenized_requests":"False" }' \
  --tasks gsm8k \
  --batch_size auto \
  --num_fewshot 5 \
  --trust_remote_code

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
242
Safetensors
Model size
410B params
Tensor type
U8
F32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support