Instructions to use stepfun-ai/Step-3.5-Flash-Base-Midtrain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.5-Flash-Base-Midtrain with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stepfun-ai/Step-3.5-Flash-Base-Midtrain", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.5-Flash-Base-Midtrain", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use stepfun-ai/Step-3.5-Flash-Base-Midtrain with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.5-Flash-Base-Midtrain"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash-Base-Midtrain",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.5-Flash-Base-Midtrain

SGLang

How to use stepfun-ai/Step-3.5-Flash-Base-Midtrain with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.5-Flash-Base-Midtrain" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash-Base-Midtrain",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.5-Flash-Base-Midtrain" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.5-Flash-Base-Midtrain",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.5-Flash-Base-Midtrain with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.5-Flash-Base-Midtrain
```

Question about benchmark results

by tarruda - opened Mar 3

Discussion

tarruda

Mar 3

Are the benchmarks in the README for the final 3.5 flash?

The reason I ask is that I ran some benchmarks on a IQ4_XS quant (https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF) of Step 3.5 Flash and in some benchmarks seem to have gotten better results than what you published.

If these benchmarks are just for midtrain, did you publish the final benchmarks anywhere? I'd love to compare with the quantized version.

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |   |0.8238|±  |0.0031|
| - humanities                          |      2|none  |     0|acc   |↑  |0.7543|±  |0.0060|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.7063|±  |0.0407|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.8788|±  |0.0255|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.9314|±  |0.0177|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.9283|±  |0.0168|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.9008|±  |0.0273|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.8796|±  |0.0315|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.8466|±  |0.0283|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.8295|±  |0.0202|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.6011|±  |0.0164|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.8778|±  |0.0186|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.8889|±  |0.0175|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.6656|±  |0.0120|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.9123|±  |0.0217|
| - other                               |      2|none  |     0|acc   |↑  |0.8626|±  |0.0060|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.8100|±  |0.0394|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.8943|±  |0.0189|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.7919|±  |0.0310|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.6900|±  |0.0465|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.8251|±  |0.0255|
|  - management                         |      1|none  |     0|acc   |↑  |0.8641|±  |0.0339|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.9573|±  |0.0133|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.8900|±  |0.0314|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.9298|±  |0.0091|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.9052|±  |0.0168|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.7979|±  |0.0240|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.8897|±  |0.0190|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5904|±  |0.0383|
| - social sciences                     |      2|none  |     0|acc   |↑  |0.9012|±  |0.0053|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.7544|±  |0.0405|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.9242|±  |0.0189|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9793|±  |0.0103|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.8923|±  |0.0157|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.9328|±  |0.0163|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9541|±  |0.0090|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.8855|±  |0.0279|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.8709|±  |0.0136|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.8182|±  |0.0369|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.8490|±  |0.0229|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.9055|±  |0.0207|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9600|±  |0.0197|
| - stem                                |      2|none  |     0|acc   |↑  |0.8138|±  |0.0067|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.7000|±  |0.0461|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.8444|±  |0.0313|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.9211|±  |0.0219|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.9514|±  |0.0180|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.6000|±  |0.0492|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.8300|±  |0.0378|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.6600|±  |0.0476|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.7549|±  |0.0428|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.8200|±  |0.0386|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.8766|±  |0.0215|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.8345|±  |0.0310|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.8836|±  |0.0165|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.9258|±  |0.0149|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.8079|±  |0.0277|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.8900|±  |0.0314|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.6481|±  |0.0291|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.7550|±  |0.0351|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.8333|±  |0.0254|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.5982|±  |0.0465|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |   |0.8238|±  |0.0031|
| - humanities     |      2|none  |     0|acc   |↑  |0.7543|±  |0.0060|
| - other          |      2|none  |     0|acc   |↑  |0.8626|±  |0.0060|
| - social sciences|      2|none  |     0|acc   |↑  |0.9012|±  |0.0053|
| - stem           |      2|none  |     0|acc   |↑  |0.8138|±  |0.0067|

|        Tasks        |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|---------------------|------:|------|-----:|--------|---|----:|---|-----:|
|gpqa_diamond_zeroshot|      1|none  |     0|acc     |↑  |0.399|±  |0.0349|
|                     |       |none  |     0|acc_norm|↑  |0.399|±  |0.0349|

|          Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_cot_zeroshot|      1|flexible-extract|     0|exact_match|↑  |0.7525|±  |0.0307|
|                         |       |strict-match    |     0|exact_match|↑  |0.6667|±  |0.0336|

|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     0|exact_match|↑  |0.9310|±  |0.0070|
|         |       |strict-match    |     0|exact_match|↑  |0.8567|±  |0.0097|

ilintar

Mar 4

I have a feeling, judging from my experience on self-hosting said IQ4_XS quant and running the OpenRouter API version, that Llama.cpp's implementation on StepFun3.5 is just better than the vLLM one, hence giving better results. There seems to be some error in the vLLM / Transformers implementation that causes some performance degradation and the reported reasoning loops - I have not encountered them a single time and I've been using StepFun3.5 as my daily driver for production work on my Java system for the past few weeks now.

bobzhuyb

StepFun org Mar 5

Hi @tarruda , I am not 100% sure I understand your question. This repo is 3.5 Flash's midtrain checkpoint, which means it has not gone through SFT or RL. The performance is expected to be lower than the final version (published in Feb). In addition, we will release SFT data soon. With the SFT data users can reproduce a model whose performance is much closer to the final version we published. The reason we release these checkpoints and data is to give back to the community and academia for reproducibility and ease customization. E.g., someone can mix their own SFT data for a given domain with our SFT data, to get a model similar to our published final version while have a given aspect enhanced.

Regarding the comparison for quantization, I am sorry we did not run a very throughout evaluation for the 4-bit quantized variants. We only had a single sanity check :(
https://huggingface.co/stepfun-ai/Step-3.5-Flash-GGUF-Q4_K_S/discussions/9

Let me see whether we can get some resources to do that.

ilintar

Mar 5

@bobzhuyb I think what he means is that we've been running the Llama.cpp 4bit quant for some time and it feels that it's better than the official version (at least the one accessible via OpenRouter) - no looping on reasoning among other things - and now he's also done a benchmark that confirms the results for that 4bit quant are better than the ones in your official reference sheet. Which probably means there's some bug in the Transformers / vLLM implementation that was fixed during the Llama.cpp port.

tarruda

Mar 5

@bobzhuyb it was a mistake on my part. I saw the table and assumed the benchmarks were for the SFT version.

@ilintar TBH I did reproduce infinite looping with @ubergarm 's IQ4_XS, but only happened once (I don't recall which prompt I used). With @AesSedai version (which is even smaller) it still hasn't happened yet, and I've been using it a lot with pi coding agent. It has been working pretty much perfectly, I feel like for the first time I have a truly good local model for agentic coding.

huangguanzhe

StepFun org Mar 9

Hi @ilintar , if possible, could you share some reproducible test cases?

tarruda

Mar 9

@huangguanzhe Do you mean for infinite reasoning? There's an example here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3870132727

dehnhaide

24 days ago

•

edited 24 days ago

I have a feeling, judging from my experience on self-hosting said IQ4_XS quant and running the OpenRouter API version, that Llama.cpp's implementation on StepFun3.5 is just better than the vLLM one, hence giving better results.

Till today, this is still an issue in llama.cpp, even with the newer model version (stepfun-ai/Step-3.5-Flash-Base-Midtrain). I am not sure if it's chat template related or not... I will be trying your autoparser version and see if it comes right.

And this is with "aessedai/Step-3.5-Flash-Base-Midtrain-Q5_K_M" ... BUT... the error only strikes with specific TUIs (e.g. omp) while with Opencode and ClaudeCode all is smooth.

dehnhaide

24 days ago

I am a bit at unease with these serving tunning recommendations:

"Recommended Inference Parameters
For general chat domain, we suggest: temperature=0.6, top_p=0.95
For reasoning / agent scenario, we recommend: temperature=1.0, top_p=0.95."

and I've decided to use, at least for TUIs code assisting scope, to go for: "temperature=0.8, top_p=0.9"

switched to ik_lllama server where "-khad" and "-vhad" help with rotation applied to K/V and improved the PPL.

So far, the looping is gone.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment