Instructions to use TheBlokeAI/Mixtral-tiny-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBlokeAI/Mixtral-tiny-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBlokeAI/Mixtral-tiny-GPTQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBlokeAI/Mixtral-tiny-GPTQ") model = AutoModelForCausalLM.from_pretrained("TheBlokeAI/Mixtral-tiny-GPTQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBlokeAI/Mixtral-tiny-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBlokeAI/Mixtral-tiny-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBlokeAI/Mixtral-tiny-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBlokeAI/Mixtral-tiny-GPTQ
- SGLang
How to use TheBlokeAI/Mixtral-tiny-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBlokeAI/Mixtral-tiny-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBlokeAI/Mixtral-tiny-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBlokeAI/Mixtral-tiny-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBlokeAI/Mixtral-tiny-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBlokeAI/Mixtral-tiny-GPTQ with Docker Model Runner:
docker model run hf.co/TheBlokeAI/Mixtral-tiny-GPTQ
Seems like the GPTQ versions are broken
for the bigger models i get:
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 1, 0] because the unspecified dimension size -1 can be any value and is ambiguous in self.gate...
for this test one i get:
...
File "/home/nepe/.local/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 708, in forward
router_logits = self.gate(hidden_states)
File "/home/nepe/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nepe/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nepe/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/nepe/.local/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda.py", line 227, in forward
zeros = zeros.reshape(self.scales.shape)
RuntimeError: shape '[8, 8]' is invalid for input of size 0
The non GPTQ version of the test model works perfectly.
Yeah, see the READMEs of the proper GPTQs for how to load them - you still need an AutoGPTQ PR at the moment
I tried both the old and the fix branches, same error. I even tried to quantize this model, same error.
As far as i understand there's still some more things to do.
Based on this:
https://github.com/PanQiWei/AutoGPTQ/pull/480
Have to apply this:
https://github.com/huggingface/transformers/pull/27956
And maybe this one too:
https://github.com/huggingface/optimum/pull/1585
My mistake, tried it with AutoModelForCausalLM.from_pretrained instead of AutoGPTQForCausalLM.from_quantized