Running on DGX Spark πŸ₯³

#3
by dionode - opened

This quantization is ideal for the DGX Spark, with a context window in the order of 32K you can reach 60 tokens/s
The performance in tokens/ reduces as the amount of data in cache grows, but still solid performance on Spark.

I'm still testing but optimist on this quantization

On the Spark:

  • FP8 runs out of memory
  • The theoretical ideal NVFP4 brings a lot of trouble during installation -> I hope this changes once NVIDIA releases newer versions of vLLM on container registry

Please, share your learnings about running on Spark πŸ€—

Sign up or log in to comment