Parameters Architecture Context

GLM-4.7-Flash-PRISM

An over-refusal/propaganda free version of ZAI's GLM-4.7-Flash with over-refusal and bias mechanisms completely removed using our Advanced PRISM Pipeline.

☕ Support Our Work

If you find this model useful, consider supporting us on Ko-fi!

Ko-fi

Option Description
PRISM VIP Membership Access to all PRISM models
One-Time Support Support this model

Model Highlights

  • PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
  • 30B-A3B MoE Architecture — 30 billion total parameters with ~3 billion active per token for fast, efficient inference
  • 128K Context Window — Extended context for complex tasks and large codebases
  • Interleaved Thinking — Multi-turn reasoning that persists across conversations with per-turn thinking control

Benchmarks

Benchmark GLM-4.7-Flash Qwen3-30B-A3B-Thinking-2507 GPT-OSS-20B
AIME 2025 91.6 85.0 91.7
GPQA 75.2 73.4 71.5
LCB v6 64.0 66.0 61.0
HLE 14.4 9.8 10.9
SWE-bench Verified 59.2 22.0 34.0
τ²-Bench 79.5 49.0 47.7
BrowseComp 42.8 2.29 28.3

Usage

Transformers

Install the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Run inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
print(output_text)

vLLM

Install vLLM nightly:

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

Serve the model:

vllm serve Ex0bit/GLM-4.7-Flash-PRISM \
     --tensor-parallel-size 4 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash-prism

SGLang

Install SGLang:

uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/
uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa

Launch the server:

python3 -m sglang.launch_server \
  --model-path Ex0bit/GLM-4.7-Flash-PRISM \
  --tp-size 4 \
  --tool-call-parser glm47  \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.8 \
  --served-model-name glm-4.7-flash-prism \
  --host 0.0.0.0 \
  --port 8000

Note: For Blackwell GPUs, add --attention-backend triton --speculative-draft-attention-backend triton to your SGLang launch command.

Recommended Parameters

Use Case Temperature Top-P Max New Tokens
Default 1.0 0.95 131072
Code (SWE-bench) 0.7 1.0 16384
Agentic Tasks 0.0 16384

License

This model is released under the PRISM Research License.

Citation

@misc{elbaz2026glm47flashPrism,
  author = {Elbaz, Eric},
  title = {Elbaz-GLM-4.7-Flash-PRISM: Unchained GLM-4.7-Flash-PRISM Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-GLM-4.7-Flash-PRISM}}
}

Acknowledgments

Based on GLM-4.7-Flash by Z.AI. See the technical report for more details on the base model.

Downloads last month
887
Safetensors
Model size
30B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Ex0bit/GLM-4.7-Flash-PRISM