yehya PRO
AI & ML interests
Recent Activity
Organizations
CUDA Version -- Min requirement?
Inference Settings
@alfredo-ottomate I seriously thought I was missing a huge breakthrough when reading that lol. I mean even the 4 expert active layers wont fit on the claimed 1.5GB RAM and if we even go further and assume there is disk offloading and the Pi had a high-end Gen3 NVME SSD, I would assume sub 1 tok/s.
@SeaWolf-AI you have a nice structured approach for benchmarking and including different kind of variables and metrics but a lot of info in this is flawed honestly. Also Qwen3.5 models underperforming the Qwen3 ones is unexpected at all, are you sure you have used the recommended generation parameters for each model? As slight variations can lead to totally different outputs especially on the metrics you are looking for.
FP8 Version for running on vLLM with hardware optimizations from Ada+ generation GPUs
gpt-oss-20b on 1.5GB RAM? Which inference framework are you using for that? llama.cpp?