Question about benchmark results
Are the benchmarks in the README for the final 3.5 flash?
The reason I ask is that I ran some benchmarks on a IQ4_XS quant (https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF) of Step 3.5 Flash and in some benchmarks seem to have gotten better results than what you published.
If these benchmarks are just for midtrain, did you publish the final benchmarks anywhere? I'd love to compare with the quantized version.
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc | |0.8238|± |0.0031|
| - humanities | 2|none | 0|acc |↑ |0.7543|± |0.0060|
| - formal_logic | 1|none | 0|acc |↑ |0.7063|± |0.0407|
| - high_school_european_history | 1|none | 0|acc |↑ |0.8788|± |0.0255|
| - high_school_us_history | 1|none | 0|acc |↑ |0.9314|± |0.0177|
| - high_school_world_history | 1|none | 0|acc |↑ |0.9283|± |0.0168|
| - international_law | 1|none | 0|acc |↑ |0.9008|± |0.0273|
| - jurisprudence | 1|none | 0|acc |↑ |0.8796|± |0.0315|
| - logical_fallacies | 1|none | 0|acc |↑ |0.8466|± |0.0283|
| - moral_disputes | 1|none | 0|acc |↑ |0.8295|± |0.0202|
| - moral_scenarios | 1|none | 0|acc |↑ |0.6011|± |0.0164|
| - philosophy | 1|none | 0|acc |↑ |0.8778|± |0.0186|
| - prehistory | 1|none | 0|acc |↑ |0.8889|± |0.0175|
| - professional_law | 1|none | 0|acc |↑ |0.6656|± |0.0120|
| - world_religions | 1|none | 0|acc |↑ |0.9123|± |0.0217|
| - other | 2|none | 0|acc |↑ |0.8626|± |0.0060|
| - business_ethics | 1|none | 0|acc |↑ |0.8100|± |0.0394|
| - clinical_knowledge | 1|none | 0|acc |↑ |0.8943|± |0.0189|
| - college_medicine | 1|none | 0|acc |↑ |0.7919|± |0.0310|
| - global_facts | 1|none | 0|acc |↑ |0.6900|± |0.0465|
| - human_aging | 1|none | 0|acc |↑ |0.8251|± |0.0255|
| - management | 1|none | 0|acc |↑ |0.8641|± |0.0339|
| - marketing | 1|none | 0|acc |↑ |0.9573|± |0.0133|
| - medical_genetics | 1|none | 0|acc |↑ |0.8900|± |0.0314|
| - miscellaneous | 1|none | 0|acc |↑ |0.9298|± |0.0091|
| - nutrition | 1|none | 0|acc |↑ |0.9052|± |0.0168|
| - professional_accounting | 1|none | 0|acc |↑ |0.7979|± |0.0240|
| - professional_medicine | 1|none | 0|acc |↑ |0.8897|± |0.0190|
| - virology | 1|none | 0|acc |↑ |0.5904|± |0.0383|
| - social sciences | 2|none | 0|acc |↑ |0.9012|± |0.0053|
| - econometrics | 1|none | 0|acc |↑ |0.7544|± |0.0405|
| - high_school_geography | 1|none | 0|acc |↑ |0.9242|± |0.0189|
| - high_school_government_and_politics| 1|none | 0|acc |↑ |0.9793|± |0.0103|
| - high_school_macroeconomics | 1|none | 0|acc |↑ |0.8923|± |0.0157|
| - high_school_microeconomics | 1|none | 0|acc |↑ |0.9328|± |0.0163|
| - high_school_psychology | 1|none | 0|acc |↑ |0.9541|± |0.0090|
| - human_sexuality | 1|none | 0|acc |↑ |0.8855|± |0.0279|
| - professional_psychology | 1|none | 0|acc |↑ |0.8709|± |0.0136|
| - public_relations | 1|none | 0|acc |↑ |0.8182|± |0.0369|
| - security_studies | 1|none | 0|acc |↑ |0.8490|± |0.0229|
| - sociology | 1|none | 0|acc |↑ |0.9055|± |0.0207|
| - us_foreign_policy | 1|none | 0|acc |↑ |0.9600|± |0.0197|
| - stem | 2|none | 0|acc |↑ |0.8138|± |0.0067|
| - abstract_algebra | 1|none | 0|acc |↑ |0.7000|± |0.0461|
| - anatomy | 1|none | 0|acc |↑ |0.8444|± |0.0313|
| - astronomy | 1|none | 0|acc |↑ |0.9211|± |0.0219|
| - college_biology | 1|none | 0|acc |↑ |0.9514|± |0.0180|
| - college_chemistry | 1|none | 0|acc |↑ |0.6000|± |0.0492|
| - college_computer_science | 1|none | 0|acc |↑ |0.8300|± |0.0378|
| - college_mathematics | 1|none | 0|acc |↑ |0.6600|± |0.0476|
| - college_physics | 1|none | 0|acc |↑ |0.7549|± |0.0428|
| - computer_security | 1|none | 0|acc |↑ |0.8200|± |0.0386|
| - conceptual_physics | 1|none | 0|acc |↑ |0.8766|± |0.0215|
| - electrical_engineering | 1|none | 0|acc |↑ |0.8345|± |0.0310|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.8836|± |0.0165|
| - high_school_biology | 1|none | 0|acc |↑ |0.9258|± |0.0149|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.8079|± |0.0277|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.8900|± |0.0314|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.6481|± |0.0291|
| - high_school_physics | 1|none | 0|acc |↑ |0.7550|± |0.0351|
| - high_school_statistics | 1|none | 0|acc |↑ |0.8333|± |0.0254|
| - machine_learning | 1|none | 0|acc |↑ |0.5982|± |0.0465|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc | |0.8238|± |0.0031|
| - humanities | 2|none | 0|acc |↑ |0.7543|± |0.0060|
| - other | 2|none | 0|acc |↑ |0.8626|± |0.0060|
| - social sciences| 2|none | 0|acc |↑ |0.9012|± |0.0053|
| - stem | 2|none | 0|acc |↑ |0.8138|± |0.0067|
| Tasks |Version|Filter|n-shot| Metric | |Value| |Stderr|
|---------------------|------:|------|-----:|--------|---|----:|---|-----:|
|gpqa_diamond_zeroshot| 1|none | 0|acc |↑ |0.399|± |0.0349|
| | |none | 0|acc_norm|↑ |0.399|± |0.0349|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_cot_zeroshot| 1|flexible-extract| 0|exact_match|↑ |0.7525|± |0.0307|
| | |strict-match | 0|exact_match|↑ |0.6667|± |0.0336|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 0|exact_match|↑ |0.9310|± |0.0070|
| | |strict-match | 0|exact_match|↑ |0.8567|± |0.0097|
I have a feeling, judging from my experience on self-hosting said IQ4_XS quant and running the OpenRouter API version, that Llama.cpp's implementation on StepFun3.5 is just better than the vLLM one, hence giving better results. There seems to be some error in the vLLM / Transformers implementation that causes some performance degradation and the reported reasoning loops - I have not encountered them a single time and I've been using StepFun3.5 as my daily driver for production work on my Java system for the past few weeks now.
Hi @tarruda , I am not 100% sure I understand your question. This repo is 3.5 Flash's midtrain checkpoint, which means it has not gone through SFT or RL. The performance is expected to be lower than the final version (published in Feb). In addition, we will release SFT data soon. With the SFT data users can reproduce a model whose performance is much closer to the final version we published. The reason we release these checkpoints and data is to give back to the community and academia for reproducibility and ease customization. E.g., someone can mix their own SFT data for a given domain with our SFT data, to get a model similar to our published final version while have a given aspect enhanced.
Regarding the comparison for quantization, I am sorry we did not run a very throughout evaluation for the 4-bit quantized variants. We only had a single sanity check :(
https://huggingface.co/stepfun-ai/Step-3.5-Flash-GGUF-Q4_K_S/discussions/9
Let me see whether we can get some resources to do that.
@bobzhuyb I think what he means is that we've been running the Llama.cpp 4bit quant for some time and it feels that it's better than the official version (at least the one accessible via OpenRouter) - no looping on reasoning among other things - and now he's also done a benchmark that confirms the results for that 4bit quant are better than the ones in your official reference sheet. Which probably means there's some bug in the Transformers / vLLM implementation that was fixed during the Llama.cpp port.
@bobzhuyb it was a mistake on my part. I saw the table and assumed the benchmarks were for the SFT version.
@ilintar TBH I did reproduce infinite looping with @ubergarm 's IQ4_XS, but only happened once (I don't recall which prompt I used). With @AesSedai version (which is even smaller) it still hasn't happened yet, and I've been using it a lot with pi coding agent. It has been working pretty much perfectly, I feel like for the first time I have a truly good local model for agentic coding.