Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7

SGLang

How to use zai-org/GLM-4.7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7
```

本地部署开源版本GLM-4.7并接入claude命令使用，就差最后一步！！！

#32

by wiesencao - opened Jan 14

Discussion

wiesencao

Jan 14

•

edited Jan 19

HI,各们大佬：
我最近几天都在研究GLM-4.7本地部署，用于替换公司现在昂贵的claude code使用成本。现在使用cherry studio进行对话测试，回复均没有任何问题，但一接入claude命令，就出现不会调用工具的问题。以下是我详细的部署情况：
一、硬件服务器资源
显卡：8卡H20 141GB
内存：2TB
CPU：INTEL(R) XEON(R) PLATINUM 8558P 96核心CPU
硬盘：ssd硬盘
二、模型部署环境信息
部署方式：docker 容器
推理框架：vllm
vllm镜像版本：vllm/vllm-openai:nightly
docker-compose-glm-4.7.yaml配置文件如下：

services:
    vllm-glm-4.7:
      image: docker.1ms.run/vllm/vllm-openai:nightly
      container_name: vllm-glm-4.7
      restart: always
      # GPU 配置 - 使用 device_ids 限制可见 GPU
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                capabilities: [gpu]
      shm_size: '16gb'
      ipc: host
      environment:
        - NCCL_DEBUG=INFO
        - NCCL_P2P_DISABLE=1
        - NCCL_IB_DISABLE=1
        - NCCL_SOCKET_IFNAME=eth0
        - VLLM_USE_MODELSCOPE=True
        - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
        - VLLM_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      ports:
        - "8080:8000"
      volumes:
        - /etc/localtime:/etc/localtime:ro
        - /etc/timezone:/etc/timezone:ro
        - /root/.cache/modelscope/hub/models/ZhipuAI/GLM-4.7:/workspace/models
      command:
        - "/workspace/models"
        - "--tensor-parallel-size"
        - "8"
        - "--served-model-name"
        - "GLM-4.7"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--enable-auto-tool-choice"
        - "--tool-call-parser"
        - "glm47"
        - "--reasoning-parser"
        - "glm45"
        - "--speculative-config.method"
        - "mtp"
        - "--speculative-config.num_speculative_tokens"
        - "1"
        - "--max-model-len"
        - "202752"

三、目前状态
1、文本生成功能测试正常。
通过使用8张H20卡，可以正常拉起模型(试过4张卡内存不足，6张卡不支持词嵌入数量不能整除6)，并且使用cherry studio可以正常实现对话。
2、将我部署的模型接入claude code中使用，发现无法调起任何工具。
我想实现官方公有模型glm-4.7直接接入claude code中，智谱官方的公有大模型配置是没有任何问题的，可以像claude一样丝滑使用，配置参考的是：https://mp.weixin.qq.com/s/pEPweAhpzZb1ef5oMvaDDA，使用 npx @z_ai/coding-helper，先配置好公有模型对接claude，
然后将 settings.json，按如下修改：

只要将我的本地模型配置到claude中，就出现以下问题。只能生成文本，不能调用任何工具，也没有看到明显的错误日志。

这是claude的日志文件：less .claude/debug/latest

这是vllm容器的日志：

后来网上有大佬推荐使用claude code router 作一层代理转发，我配置了半天，发现仍然是一样的问题。然后我又试着使用sglang来部署glm-4.7，发现更糟糕，连文本生成都有问题。
四、求助
有没有哪位大佬是完整完成了这项集成工程的，如果遇到类似问题，请不吝赐教。
我大胆推测一下，可能的原因：
1、由于glm-4.7开源还不到一个月，目前基本上很少看到有大佬成功部署的案例，现在用的vllm镜像都是非稳定版本nightly，可能会存在bug？
2、官方公有模型集成到claude中，是用工具进行配置的，现在不知道这个工具做了哪些动作，我仅修改了settings.json中的信息，是不是还不够？还有哪些地方需要修改，才能适应我的开源模型glm-4.7？

wiesencao changed discussion status to closed Jan 15

scalaview

Jan 18

大佬，怎么解决的？可以分享一下

wiesencao

Jan 19

目前我的问题已经解决，原因为使用vllm或sglang部署的模型，默认都是类openai接口形式，而在工具调用时，claude code只能接收anthropic风格接口。所以claude与vllm中间需要部署一个接口转换层，如果是个人使用可以选择claude code router（开始配置没有对），如果你是想让统一企业的ai网关，统一做大模型管理，那你可以选择litellm。

大佬，怎么解决的？可以分享一下

wiesencao changed discussion status to open Jan 28

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment