Context Length for 2X6000 Pros (2x96 = 192GB VRAM)
#2
by mtcl - opened
On M2.5 i can easily get full context length however, on this version of M2.7 I cannot get more than ~88K of context length.
these are the commands:
for M2.5
docker run --rm -it \
--gpus '"device=0,2"' \
--shm-size 32g \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path /models/nvidia/MiniMax-M2.5-NVFP4 \
--served-model-name jarvis-thinker \
--tp-size 2 \
--quantization modelopt_fp4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype auto \
--mem-fraction-static 0.90 \
--context-length 196608 \
--max-running-requests 16 \
--chunked-prefill-size 16384 \
--sleep-on-idle
and for M2.7
docker run --rm -it \
--gpus '"device=0,2"' \
--shm-size 32g \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path /models/nvidia/MiniMax-M2.7-NVFP4 \
--served-model-name jarvis-thinker \
--tp-size 2 \
--quantization modelopt_fp4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype auto \
--mem-fraction-static 0.90 \
--context-length 196608 \
--max-running-requests 16 \
--chunked-prefill-size 16384 \
--sleep-on-idle