ChibuUkachi commited on
Commit
40fba27
·
verified ·
1 Parent(s): 02b0eb8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +216 -0
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ base_model:
6
+ - MiniMaxAI/MiniMax-M2.5
7
+ tags:
8
+ - neuralmagic
9
+ - redhat
10
+ - llmcompressor
11
+ - quantized
12
+ - INT8
13
+ ---
14
+
15
+ # MiniMax-M2.5-quantized.w8a8
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** MiniMaxM2ForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT8
23
+ - **Intended Use Cases:**
24
+ - Reasoning.
25
+ - Function calling.
26
+ - Subject matter experts via fine-tuning.
27
+ - Multilingual instruction following.
28
+ - Translation.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
30
+ - **Release Date:** 05/05/2025
31
+ - **Version:** 1.0
32
+ - **Model Developers:** RedHat (Neural Magic)
33
+
34
+ ### Model Optimizations
35
+
36
+ This model was obtained by quantizing the weights of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) to INT8 data type.
37
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
38
+ Weight quantization also reduces disk size requirements by approximately 50%.
39
+
40
+ Only weights and activations of the linear operators within transformers blocks are quantized.
41
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
42
+ A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
43
+
44
+ ## Deployment
45
+
46
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
47
+
48
+ ```python
49
+ from vllm import LLM, SamplingParams
50
+ from transformers import AutoTokenizer
51
+
52
+ model_id = "RedHatAI/MiniMax-M2.5-quantized.w8a8"
53
+ number_gpus = 1
54
+ sampling_params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, min_p=0, max_tokens=256)
55
+
56
+ messages = [
57
+ {"role": "user", "content": prompt}
58
+ ]
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
63
+
64
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
65
+
66
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
67
+
68
+ outputs = llm.generate(prompts, sampling_params)
69
+
70
+ generated_text = outputs[0].outputs[0].text
71
+ print(generated_text)
72
+ ```
73
+
74
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
75
+
76
+ ## Creation
77
+
78
+ <details>
79
+ <summary>Creation details</summary>
80
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
81
+
82
+
83
+ ```python
84
+ from datasets import load_dataset
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
86
+ from llmcompressor import oneshot
87
+ from llmcompressor.modifiers.quantization import GPTQModifier
88
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
89
+
90
+ MODEL_ID = "inference-optimization/MiniMax-M2.5-BF16"
91
+
92
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", trust_remote_code=True)
93
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
94
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
95
+
96
+ NUM_CALIBRATION_SAMPLES=512
97
+ MAX_SEQUENCE_LENGTH=2048
98
+
99
+ # Load dataset.
100
+ ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]")
101
+ ds = ds.shuffle(seed=42)
102
+
103
+ # Preprocess the data into the format the model is trained with.
104
+ def preprocess(example):
105
+ return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
106
+
107
+ ds = ds.map(preprocess)
108
+
109
+ # Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it).
110
+ def tokenize(sample):
111
+ return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
112
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
113
+
114
+ # Configure the quantization algorithm to run.
115
+ recipe = GPTQModifier( scheme="W8A8", weight_observer="mse", targets= [r"re:.*block_sparse_moe\.experts\.\d+\.w[1-3]$", r"re:.*mlp\.experts\.\d+\.(gate|up|gate_up|down)_proj$" ], ignore=["re:.*self_attn.*", "lm_head"])
116
+
117
+
118
+ # Apply quantization.
119
+ oneshot(
120
+ model=model, dataset=ds,
121
+ recipe=recipe,
122
+ max_seq_length=MAX_SEQUENCE_LENGTH,
123
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
124
+ processor=processor
125
+ )
126
+
127
+ # Save to disk compressed.
128
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + ".w8a8"
129
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
130
+ tokenizer.save_pretrained(SAVE_DIR)
131
+
132
+ </details>
133
+
134
+
135
+
136
+
137
+ ## Evaluation
138
+
139
+ The model was evaluated on the ifeval, mmlu_pro and gsm8k_platinum using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning).
140
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
141
+
142
+
143
+ <details>
144
+ <summary>Evaluation details</summary>
145
+
146
+ Deploy using vllm to create an OpenAI-compatible API endpoint:
147
+
148
+ - vLLM:
149
+ ```shell
150
+ vllm serve RedHatAI/MiniMax-M2.5.w8a8 --max-model-len 262144 --reasoning-parser deepseek_r1
151
+ ```
152
+
153
+ **lm-evaluation-harness**
154
+ ```
155
+ lm_eval --model local-chat-completions \
156
+ --tasks mmlu_pro_chat \
157
+ --model_args "model=RedHatAI/MiniMax-M2.5.w8a8,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
158
+ --num_fewshot 0 \
159
+ --apply_chat_template \
160
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=40,min_p=0.0,max_gen_toks=64000
161
+ ```
162
+
163
+ ```
164
+ lm_eval --model local-chat-completions \
165
+ --tasks ifeval \
166
+ --model_args "model=RedHatAI/MiniMax-M2.5.w8a8,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
167
+ --num_fewshot 0 \
168
+ --apply_chat_template \
169
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=40,min_p=0.0,max_gen_toks=64000
170
+ ```
171
+
172
+ ```
173
+ lm_eval --model local-chat-completions \
174
+ --tasks gsm8k_platinum_cot_llama \
175
+ --model_args "model=RedHatAI/MiniMax-M2.5.w8a8,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
176
+ --num_fewshot 0 \
177
+ --apply_chat_template \
178
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=40,min_p=0.0,max_gen_toks=64000
179
+ ```
180
+
181
+ **lighteval**
182
+
183
+ lighteval_model_arguments.yaml
184
+ ```yaml
185
+ model_parameters:
186
+ model_name: RedHatAI/MiniMax-M2.5.w8a8
187
+ dtype: auto
188
+ gpu_memory_utilization: 0.9
189
+ max_model_length: 40960
190
+ generation_parameters:
191
+ temperature: 1.0
192
+ top_k: 40
193
+ min_p: 0.0
194
+ top_p: 0.95
195
+ max_new_tokens: 64000
196
+ ```
197
+
198
+ ```
199
+ lighteval endpoint litellm lighteval_model_arguments.yaml \
200
+ "aime25|0,math_500|0,gpqa:diamond|0"
201
+ ```
202
+
203
+
204
+ </details>
205
+
206
+ ### Accuracy
207
+
208
+
209
+ | Benchmark | inference-optimization/MiniMax-M2.5-BF16 | inference-optimization/MiniMax-M2.5.w8a8 | Recovery (%) |
210
+ |-----------|------------------------------------------|-------------------------------------------|--------------|
211
+ | GSM8k Platinum (0-shot) | 95.15 | 96.36 | 101.27 |
212
+ | IfEval (0-shot) | 88.17 | 85.58 | 97.06 |
213
+ | AIME 2025 | 87.50 | 84.17 | 96.19 |
214
+ | GPQA diamond | 83.67 | 84.51 | 101.01 |
215
+ | Math 500 | 87.33 | 87.60 | 100.31 |
216
+ | Mmlu Pro Chat | 80.83 | 81.25 | 100.51 |