Instructions to use utter-project/EuroLLM-1.7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use utter-project/EuroLLM-1.7B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="utter-project/EuroLLM-1.7B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct") model = AutoModelForCausalLM.from_pretrained("utter-project/EuroLLM-1.7B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use utter-project/EuroLLM-1.7B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "utter-project/EuroLLM-1.7B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "utter-project/EuroLLM-1.7B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/utter-project/EuroLLM-1.7B-Instruct
- SGLang
How to use utter-project/EuroLLM-1.7B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "utter-project/EuroLLM-1.7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "utter-project/EuroLLM-1.7B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "utter-project/EuroLLM-1.7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "utter-project/EuroLLM-1.7B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use utter-project/EuroLLM-1.7B-Instruct with Docker Model Runner:
docker model run hf.co/utter-project/EuroLLM-1.7B-Instruct
Clarification on the way the tokenizer should be used
For more detail you can read this post: https://github.com/huggingface/transformers/issues/31513#issuecomment-2393320476
in essence, if you use the snippet of the model card you are getting this:
In [10]: import torch
...: from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
...:
...: tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left')
...: prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
...: input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
...: print(input_ids)
...: print(prompt)
...: outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
...: print(outputs)
tensor([[ 1, 3, 15236, 271, 31702, 31817, 557, 5302, 6001,
1061, 6771, 2023, 5256, 119735, 271, 31601, 119782, 97849,
4437, 271, 60457, 119782, 4, 119715, 271, 3, 58406,
271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
['<s><|im_start|> user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|> \n<|im_start|> assistant\n']
You can cannot really see but there are spaces added before user, assistant, and between <|im_end|> and "\n"
it uses some specific tokens for user (15236) and assistant (58406)
Now if you add the following flag to the tokenizer:
In [11]: import torch
...: from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
...:
...: tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left', add_prefix_space=False)
...: prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
...: input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
...: print(input_ids)
...: print(prompt)
...: outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
...: print(outputs)
tensor([[ 1, 3, 13676, 271, 31702, 31817, 557, 5302, 6001,
1061, 6771, 2023, 5256, 119735, 271, 31601, 119782, 97849,
4437, 271, 60457, 119782, 4, 271, 3, 788, 35441,
271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant
['<s><|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n']
You can see the spaces are no longer there and tokens are not the same.
SO the question is: what tokenizer did you use at training ? if HF, then can you please specify the token IDs and flags so that we use the same at inference ?
Thanks
Vincent
BTW: the issue is similar with Tower.
Hi Vincent,
Thanks for pointing out this issue. We are aware of this and we may find an alternative solution in future models (same for Tower). For now, please use the tokenizer as in the first snippet. For example, these are the tokens the model sees during training for "<|im_start|>user\n": [3, 15236, 271]. This is a strange issue indeed because [tokenizer.decode([num]) for num in [3, 15236, 271]] yields ['<|im_start|>', 'user', '\n'].
We may experiment training further iterations with the add_prefix_space option set to False.
P.S.: Please note that, for best results, EuroLLM requires adding a system placeholder even if the system message is "".
you also realize that the first snippet triggers this sequence [4, 119715, 271] with this useless character between im_end and \n
Yes, indeed. That was also seen during training. This character is usually triggered before numbers and \n.