Context Length for 2X6000 Pros (2x96 = 192GB VRAM)

#2
by mtcl - opened

On M2.5 i can easily get full context length however, on this version of M2.7 I cannot get more than ~88K of context length.

these are the commands:

for M2.5

docker run --rm -it \
    --gpus '"device=0,2"' \
    --shm-size 32g \
    -p 10002:8000 \
    -v /media/mukul/data/models:/models \
    -e PYTORCH_ALLOC_CONF=expandable_segments:True \
    lmsysorg/sglang:latest \
    python -m sglang.launch_server \
        --model-path /models/nvidia/MiniMax-M2.5-NVFP4 \
        --served-model-name jarvis-thinker \
        --tp-size 2 \
        --quantization modelopt_fp4 \
        --tool-call-parser minimax-m2 \
        --reasoning-parser minimax \
        --host 0.0.0.0 \
        --port 8000 \
        --trust-remote-code \
        --dtype auto \
        --mem-fraction-static 0.90 \
        --context-length 196608 \
        --max-running-requests 16 \
        --chunked-prefill-size 16384 \
        --sleep-on-idle

and for M2.7

docker run --rm -it \
  --gpus '"device=0,2"' \
  --shm-size 32g \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path /models/nvidia/MiniMax-M2.7-NVFP4 \
    --served-model-name jarvis-thinker \
    --tp-size 2 \
    --quantization modelopt_fp4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --dtype auto \
    --mem-fraction-static 0.90 \
    --context-length 196608 \
    --max-running-requests 16 \
    --chunked-prefill-size 16384 \
    --sleep-on-idle

Sign up or log in to comment