Instructions to use utter-project/EuroLLM-1.7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use utter-project/EuroLLM-1.7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="utter-project/EuroLLM-1.7B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("utter-project/EuroLLM-1.7B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use utter-project/EuroLLM-1.7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "utter-project/EuroLLM-1.7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "utter-project/EuroLLM-1.7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/utter-project/EuroLLM-1.7B-Instruct

SGLang

How to use utter-project/EuroLLM-1.7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "utter-project/EuroLLM-1.7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "utter-project/EuroLLM-1.7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "utter-project/EuroLLM-1.7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "utter-project/EuroLLM-1.7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use utter-project/EuroLLM-1.7B-Instruct with Docker Model Runner:
```
docker model run hf.co/utter-project/EuroLLM-1.7B-Instruct
```

Clarification on the way the tokenizer should be used

by vince62s - opened Oct 4, 2024

Discussion

vince62s

Oct 4, 2024

For more detail you can read this post: https://github.com/huggingface/transformers/issues/31513#issuecomment-2393320476

in essence, if you use the snippet of the model card you are getting this:

In [10]: import torch
    ...: from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
    ...: 
    ...: tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left')
    ...: prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
    ...: input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
    ...: print(input_ids)
    ...: print(prompt)
    ...: outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
    ...: print(outputs)
tensor([[     1,      3,  15236,    271,  31702,  31817,    557,   5302,   6001,
           1061,   6771,   2023,   5256, 119735,    271,  31601, 119782,  97849,
           4437,    271,  60457, 119782,      4, 119715,    271,      3,  58406,
            271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant

['<s><|im_start|> user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|> \n<|im_start|> assistant\n']

You can cannot really see but there are spaces added before user, assistant, and between <|im_end|> and "\n"
it uses some specific tokens for user (15236) and assistant (58406)

Now if you add the following flag to the tokenizer:

In [11]: import torch
    ...: from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
    ...: 
    ...: tokenizer = AutoTokenizer.from_pretrained("utter-project/EuroLLM-1.7B-Instruct", padding_side='left', add_prefix_space=False)
    ...: prompt = f"<|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n"
    ...: input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=256, truncation=True).input_ids.cuda()
    ...: print(input_ids)
    ...: print(prompt)
    ...: outputs = tokenizer.batch_decode(input_ids, skip_special_tokens=False)
    ...: print(outputs)
tensor([[     1,      3,  13676,    271,  31702,  31817,    557,   5302,   6001,
           1061,   6771,   2023,   5256, 119735,    271,  31601, 119782,  97849,
           4437,    271,  60457, 119782,      4,    271,      3,    788,  35441,
            271]], device='cuda:0')
<|im_start|>user
Translate the following text from English into German.
English: Hello world
German:<|im_end|>
<|im_start|>assistant

['<s><|im_start|>user\nTranslate the following text from English into German.\nEnglish: Hello world\nGerman:<|im_end|>\n<|im_start|>assistant\n']

You can see the spaces are no longer there and tokens are not the same.

SO the question is: what tokenizer did you use at training ? if HF, then can you please specify the token IDs and flags so that we use the same at inference ?

Thanks
Vincent

BTW: the issue is similar with Tower.

nunonmg

Oct 4, 2024

Hi Vincent,
Thanks for pointing out this issue. We are aware of this and we may find an alternative solution in future models (same for Tower). For now, please use the tokenizer as in the first snippet. For example, these are the tokens the model sees during training for "<|im_start|>user\n": [3, 15236, 271]. This is a strange issue indeed because [tokenizer.decode([num]) for num in [3, 15236, 271]] yields ['<|im_start|>', 'user', '\n'].
We may experiment training further iterations with the add_prefix_space option set to False.

P.S.: Please note that, for best results, EuroLLM requires adding a system placeholder even if the system message is "".

vince62s

Oct 4, 2024

you also realize that the first snippet triggers this sequence [4, 119715, 271] with this useless character between im_end and \n

nunonmg

Oct 4, 2024

Yes, indeed. That was also seen during training. This character is usually triggered before numbers and \n.

phmartins changed discussion status to closed Oct 18, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment