Model Overview

  • Model Architecture: GLM-4.7
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.0
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark (V0.11)
    • moe
      • Weight quantization: MOE-only, OCP MXFP4, Static
      • Activation quantization: MOE-only, OCP MXFP4, Dynamic
    • KV cache quantization: OCP FP8, Static
  • Calibration Dataset: Pile

This model was built with GLM-4.7 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from zai-org/GLM-4.7 using AMD-Quark. The weights and activations are quantized to MXFP4. AMD-Quark has been installed from source code inside the Docker image rocm/vllm-private:vllm_dev_base_mxfp4_20260122.

Quantization scripts:

Note that GLM-4.7 is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.

  • Step1: Register model template: creat fle Quark/examples/torch/language_modeling/llm_ptq/quantize_glm.py
import runpy
from quark.torch import LLMTemplate

# Register GLM-4 MoE template
glm4_moe_template = LLMTemplate(
    model_type="glm4_moe",
    kv_layers_name=["*k_proj", "*v_proj"],
    q_layer_name="*q_proj",
    exclude_layers_name=["lm_head","*mlp.gate","*self_attn*","*shared_experts.*","*mlp.down_proj","*mlp.gate_proj","*mlp.up_proj"],
)
LLMTemplate.register_template(glm4_moe_template)
print(f"[INFO]: Registered template '{glm4_moe_template.model_type}'")

# Run quantize_quark.py
# Get the absolute path to the quantize_quark.py script
quantize_script = "/app/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py"

runpy.run_path(quantize_script, run_name="__main__")
  • Step2: Quantize with the quantize_glm.py
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL_DIR=zai-org/GLM-4.7
export output_dir=amd/GLM-4.7-MXFP4

exclude_layers="*self_attn* *mlp.gate lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj *shared_experts.*"
python3 quantize_glm.py --model_dir $MODEL_DIR \
                        --quant_scheme mxfp4 \
                        --num_calib_data 128 \
                        --exclude_layers $exclude_layers \
                        --kv_cache_dtype fp8 \
                        --model_export hf_format \
                        --output_dir $output_dir \
                        --multi_gpu

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The model was evaluated on GSM8K benchmarks.

Accuracy

Benchmark GLM-4.7 GLM-4.7-MXFP4(this model) Recovery
GSM8K (strict-match) 94.16 93.63 99.44%

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image rocm/vllm-private:vllm_dev_base_mxfp4_20260122, with vLLM, lm-eval and amd-quark compiled and installed from source inside the image.

Launching server

vllm serve amd/GLM-4.7-MXFP4 \
    --tensor-parallel-size 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --kv_cache_dtype fp8

Evaluating model in a new terminal

lm_eval \
  --model local-completions \
  --model_args "model=amd/GLM-4.7-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
178
Safetensors
Model size
185B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for amd/GLM-4.7-MXFP4

Base model

zai-org/GLM-4.7
Quantized
(42)
this model