Running on DGX Spark 🥳

by dionode - opened Apr 1

Apr 1

This quantization is ideal for the DGX Spark, with a context window in the order of 32K you can reach 60 tokens/s
The performance in tokens/ reduces as the amount of data in cache grows, but still solid performance on Spark.

I'm still testing but optimist on this quantization

On the Spark:

FP8 runs out of memory
The theoretical ideal NVFP4 brings a lot of trouble during installation -> I hope this changes once NVIDIA releases newer versions of vLLM on container registry

Please, share your learnings about running on Spark 🤗

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment