ChibuUkachi commited on
Commit
14e25cb
·
verified ·
1 Parent(s): 57bc6f0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -0
README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ base_model:
6
+ - MiniMaxAI/MiniMax-M2.5
7
+ tags:
8
+ - neuralmagic
9
+ - redhat
10
+ - llmcompressor
11
+ - quantized
12
+ - INT4
13
+ ---
14
+
15
+ # MiniMax-M2.5-quantized.w4a16
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** MiniMaxM2ForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Intended Use Cases:**
24
+ - Reasoning.
25
+ - Function calling.
26
+ - Subject matter experts via fine-tuning.
27
+ - Multilingual instruction following.
28
+ - Translation.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
30
+ - **Release Date:** 05/05/2025
31
+ - **Version:** 1.0
32
+ - **Model Developers:** RedHat (Neural Magic)
33
+
34
+ ### Model Optimizations
35
+
36
+ This model was obtained by quantizing the weights of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) to INT4 data type.
37
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
38
+
39
+ Only the weights of the linear operators within transformers blocks are quantized.
40
+ Weights are quantized using a asymmetric per-group scheme, with group size 64.
41
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
42
+
43
+
44
+ ## Deployment
45
+
46
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
47
+
48
+ ```python
49
+ from vllm import LLM, SamplingParams
50
+ from transformers import AutoTokenizer
51
+
52
+ model_id = "RedHatAI/MiniMax-M2.5-quantized.w4a16"
53
+ number_gpus = 1
54
+ sampling_params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, min_p=0, max_tokens=256)
55
+
56
+ messages = [
57
+ {"role": "user", "content": prompt}
58
+ ]
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
63
+
64
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
65
+
66
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
67
+
68
+ outputs = llm.generate(prompts, sampling_params)
69
+
70
+ generated_text = outputs[0].outputs[0].text
71
+ print(generated_text)
72
+ ```
73
+
74
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
75
+
76
+ ## Creation
77
+
78
+ <details>
79
+ <summary>Creation details</summary>
80
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
81
+
82
+
83
+ ```python
84
+ from datasets import load_dataset
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
86
+ from llmcompressor import oneshot
87
+ from llmcompressor.modifiers.quantization import GPTQModifier
88
+
89
+ MODEL_ID = "inference-optimization/MiniMax-M2.5-BF16"
90
+
91
+ # Load model.
92
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", trust_remote_code=True)
93
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
94
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
95
+
96
+
97
+ NUM_CALIBRATION_SAMPLES=512
98
+ MAX_SEQUENCE_LENGTH=2048
99
+
100
+ # Load dataset.
101
+ ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]", trust_remote_code=True)
102
+ ds = ds.shuffle(seed=42)
103
+
104
+ # Preprocess the data into the format the model is trained with.
105
+ def preprocess(example):
106
+ return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False, )}
107
+
108
+ ds = ds.map(preprocess)
109
+
110
+ # Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it).
111
+ def tokenize(sample):
112
+ return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
113
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
114
+
115
+ # Configure the quantization algorithm to run.
116
+ recipe = GPTQModifier( scheme="W4A16", weight_observer="mse", targets= [r"re:.*block_sparse_moe\.experts\.\d+\.w[1-3]$", r"re:.*mlp\.experts\.\d+\.(gate|up|gate_up|down)_proj$" ], ignore=["re:.*self_attn.*", "lm_head"])
117
+
118
+
119
+ # Apply quantization.
120
+ oneshot(
121
+ model=model, dataset=ds,
122
+ recipe=recipe,
123
+ max_seq_length=MAX_SEQUENCE_LENGTH,
124
+ processor=processor,
125
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES
126
+ )
127
+
128
+ # Save to disk compressed.
129
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + ".w4a16"
130
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
131
+ tokenizer.save_pretrained(SAVE_DIR)
132
+
133
+ </details>
134
+
135
+
136
+
137
+
138
+ ## Evaluation
139
+
140
+ The model was evaluated on the ifeval, mmlu_pro and gsm8k_platinum using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning).
141
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
142
+
143
+
144
+ <details>
145
+ <summary>Evaluation details</summary>
146
+
147
+ Deploy using vllm to create an OpenAI-compatible API endpoint:
148
+
149
+ - vLLM:
150
+ ```shell
151
+ vllm serve RedHatAI/MiniMax-M2.5.w4a16 --max-model-len 262144 --reasoning-parser deepseek_r1
152
+ ```
153
+
154
+ **lm-evaluation-harness**
155
+ ```
156
+ lm_eval --model local-chat-completions \
157
+ --tasks mmlu_pro_chat \
158
+ --model_args "model=RedHatAI/MiniMax-M2.5.w4a16,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
159
+ --num_fewshot 0 \
160
+ --apply_chat_template \
161
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=40,min_p=0.0,max_gen_toks=64000
162
+ ```
163
+
164
+ ```
165
+ lm_eval --model local-chat-completions \
166
+ --tasks ifeval \
167
+ --model_args "model=RedHatAI/MiniMax-M2.5.w4a16,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
168
+ --num_fewshot 0 \
169
+ --apply_chat_template \
170
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=40,min_p=0.0,max_gen_toks=64000
171
+ ```
172
+
173
+ ```
174
+ lm_eval --model local-chat-completions \
175
+ --tasks gsm8k_platinum_cot_llama \
176
+ --model_args "model=RedHatAI/MiniMax-M2.5.w4a16,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
177
+ --num_fewshot 0 \
178
+ --apply_chat_template \
179
+ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=40,min_p=0.0,max_gen_toks=64000
180
+ ```
181
+
182
+ **lighteval**
183
+
184
+ lighteval_model_arguments.yaml
185
+ ```yaml
186
+ model_parameters:
187
+ model_name: RedHatAI/MiniMax-M2.5.w4a16
188
+ dtype: auto
189
+ gpu_memory_utilization: 0.9
190
+ max_model_length: 40960
191
+ generation_parameters:
192
+ temperature: 1.0
193
+ top_k: 40
194
+ min_p: 0.0
195
+ top_p: 0.95
196
+ max_new_tokens: 64000
197
+ ```
198
+
199
+ ```
200
+ lighteval endpoint litellm lighteval_model_arguments.yaml \
201
+ "aime25|0,math_500|0,gpqa:diamond|0"
202
+ ```
203
+
204
+
205
+ </details>
206
+
207
+ ### Accuracy
208
+
209
+
210
+ | Benchmark | inference-optimization/MiniMax-M2.5-BF16 | inference-optimization/MiniMax-M2.5.w4a16 | Recovery (%) |
211
+ |-----------|------------------------------------------|-------------------------------------------|--------------|
212
+ | GSM8k Platinum (0-shot) | 95.15 | 96.36 | 101.27 |
213
+ | IfEval (0-shot) | 88.17 | 85.58 | 97.06 |
214
+ | AIME 2025 | 87.50 | 84.17 | 96.19 |
215
+ | GPQA diamond | 83.67 | 84.51 | 101.01 |
216
+ | Math 500 | 87.33 | 87.60 | 100.31 |
217
+ | Mmlu Pro Chat | 80.83 | 81.25 | 100.51 |