Running on DGX Spark π₯³
#3
by dionode - opened
This quantization is ideal for the DGX Spark, with a context window in the order of 32K you can reach 60 tokens/s
The performance in tokens/ reduces as the amount of data in cache grows, but still solid performance on Spark.
I'm still testing but optimist on this quantization
On the Spark:
- FP8 runs out of memory
- The theoretical ideal NVFP4 brings a lot of trouble during installation -> I hope this changes once NVIDIA releases newer versions of vLLM on container registry
Please, share your learnings about running on Spark π€