Doradus-AI PRO
Doradus-AI
AI & ML interests
None yet
Recent Activity
posted an update 1 day ago
Tonight we validated a small upstream vLLM fix that brings GLM-5.1-REAP-478B back into our consumer-Blackwell rotation pool.
Sleep/wake on 4× RTX PRO 6000 (SM_120) had a CuMemAllocator race that retired GLM in April: cuMemUnmap runs synchronously from the host the moment a pool-backed tensor's refcount hits zero, but kernels can still be in flight against that storage, accumulating CUDA_ERROR_ILLEGAL_ADDRESS,
engine eventually unrecoverable.
vllm-project/vllm#43020 is a one-line torch.cuda.synchronize() at the top of _python_free_callback. Steady-state inference unaffected (only cumem
frees pay the cost).
We caught the unpatched bug live during validation:
```
CUDA Error: invalid argument at /build/vllm/csrc/cumem_allocator.cpp:146
```
That's the exact failure class #43020 fixes. With it bind-mounted in: Q3.6-27B sleep/wake cycle clean (25.8 GiB VRAM released on /sleep level=1,
engine alive, post-wake chat coherent), GLM 30-request stress test 30/30 PASS, 0 CUDA errors. Back into rotation.
Side win: we're also submitting a generic Triton autotune shmem-budget helper upstream that replaces hand-rolled check_shared_mem() ? [64,128] :
[32,64] bucket switches with per-config precision via Triton's existing prune_configs_by={"early_config_prune": ...} hook. Zero change to the
H100/H200 fast path. Submitted: vllm-project/vllm#43047
Full writeup with byte math + stress-test logs + the bind-mount overlay pattern: https://doradusresearch.ai/blog/sleep-mode-on-blackwell-part-2/
Hardware: 4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM_120, 95 GiB per GPU, 101 KiB per-block opt-in shmem).
Image stack documented in the writeup! liked a model 3 months ago
Sehyo/Qwen3.5-122B-A10B-NVFP4 updated a model 5 months ago
Doradus-AI/RnJ-1-Instruct-FP8