Instructions to use codellama/CodeLlama-70b-Instruct-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use codellama/CodeLlama-70b-Instruct-hf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="codellama/CodeLlama-70b-Instruct-hf")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-70b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-70b-Instruct-hf", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use codellama/CodeLlama-70b-Instruct-hf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codellama/CodeLlama-70b-Instruct-hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-70b-Instruct-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/codellama/CodeLlama-70b-Instruct-hf

SGLang

How to use codellama/CodeLlama-70b-Instruct-hf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "codellama/CodeLlama-70b-Instruct-hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-70b-Instruct-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "codellama/CodeLlama-70b-Instruct-hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-70b-Instruct-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use codellama/CodeLlama-70b-Instruct-hf with Docker Model Runner:
```
docker model run hf.co/codellama/CodeLlama-70b-Instruct-hf
```

Context length?

by turboderp - opened Jan 29, 2024

Discussion

turboderp

Jan 29, 2024

Is this really 2k seq length? The base 70b seems to be 16k, is there something up with the config?

amgadhasan

Jan 29, 2024

cc @osanseviero

juewang

Jan 29, 2024

Same question here. The blog shows both the instruction and python models are long context fine-tuned.

michaelfeil

Jan 29, 2024

2048: https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf/blob/5c0e18bec97099ebf50649c002631054e1b9725e/config.json#L13

yard1

Jan 29, 2024

Actually it should be 4096, it seems like the config.json is wrong (the conversion script needs to be updated is my guess). I confirmed that with a Meta engineer, plus you can see that in the reference implementation - https://github.com/facebookresearch/codellama/blob/1af62e1f43db1fa5140fa43cb828465a603a48f3/llama/model.py#L277 (self.params.max_seq_len * 2 where self.params.max_seq_len == 2048).

lmg-anon

Jan 29, 2024

•

edited Jan 29, 2024

The README says this is a model with 16k context, corroborating with turboderp's findings.

Code Llama is an auto-regressive language model that uses an optimized transformer architecture. It was fine-tuned with up to 16k tokens. This variant does not support long context of up to 100k tokens.

Altough I guess it could be wrong too.

turboderp

Jan 30, 2024

@yard1 Thanks.

It's a real shame that the instruct and python versions were nerfed like this, but I guess 4096 is a better starting point than 2048 at least. :(

mohdsoci

Jan 30, 2024

4096 for a coding model is painfully small.

viktor-ferenczi

Feb 1, 2024

Without 16k context length it is basically useless as a coding model.

viktor-ferenczi

Feb 1, 2024

I guess we need to wait for the instruct fine-tuned 16k versions created by others. Maybe Phind will make one, we'll see.

iphann

Feb 11, 2024

I guess we need to wait for the instruct fine-tuned 16k versions created by others. Maybe Phind will make one, we'll see.

加油Phind

sbnc

Apr 28, 2025

How come all the smaller models of the same series (34B, 13B, 7B) have a context length of 16k, but the largest one only 4k? Doesn't make much sense. Also all documentation states that these models were trained on 16k inputs. It looks most like a type in config.json. Which is also strange, like how come noone noticed/fixed it? Also everywhere they say it supports up to 100k context. Is that a theoretical maximum, or what?

michaelfeil

Apr 28, 2025

@sbnc This model is quite outdated now, so who still cares!

sbnc

Apr 28, 2025

I am new to the space so i was experimenting with different models. But probably you are right...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment