Doradus-AI PRO

Doradus-AI

3 4

DoradusA

AI & ML interests

None yet

Recent Activity

commentedon a paper 22 days ago

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

updated a model about 1 month ago

Doradus-AI/EvoQuality-IQA-GGUF

published a model about 1 month ago

Doradus-AI/EvoQuality-IQA-GGUF

View all activity

Organizations

commented a paper 22 days ago

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Paper • 2606.21906 • Published 26 days ago • 25 •

updated a model about 1 month ago

Doradus-AI/EvoQuality-IQA-GGUF

Image-Text-to-Text • 8B • Updated Jun 13 • 338

published a model about 1 month ago

Doradus-AI/EvoQuality-IQA-GGUF

Image-Text-to-Text • 8B • Updated Jun 13 • 338

New activity in prefeitura-rio/Rio-3.5-Open-397B about 1 month ago

Gguf

👍 5

#1 opened about 1 month ago by

cdani

New activity in stepfun-ai/Step-3.7-Flash about 2 months ago

Nice~!

👍 1

#7 opened about 2 months ago by

Doradus-AI

posted an update about 2 months ago

Post

223

Tonight we validated a small upstream vLLM fix that brings GLM-5.1-REAP-478B back into our consumer-Blackwell rotation pool.

Sleep/wake on 4× RTX PRO 6000 (SM_120) had a CuMemAllocator race that retired GLM in April: cuMemUnmap runs synchronously from the host the moment a pool-backed tensor's refcount hits zero, but kernels can still be in flight against that storage, accumulating CUDA_ERROR_ILLEGAL_ADDRESS,
engine eventually unrecoverable.

vllm-project/vllm#43020 is a one-line torch.cuda.synchronize() at the top of _python_free_callback. Steady-state inference unaffected (only cumem
frees pay the cost).

We caught the unpatched bug live during validation:

CUDA Error: invalid argument at /build/vllm/csrc/cumem_allocator.cpp:146

That's the exact failure class #43020 fixes. With it bind-mounted in: Q3.6-27B sleep/wake cycle clean (25.8 GiB VRAM released on /sleep level=1,
engine alive, post-wake chat coherent), GLM 30-request stress test 30/30 PASS, 0 CUDA errors. Back into rotation.

Side win: we're also submitting a generic Triton autotune shmem-budget helper upstream that replaces hand-rolled check_shared_mem() ? [64,128] :
[32,64] bucket switches with per-config precision via Triton's existing prune_configs_by={"early_config_prune": ...} hook. Zero change to the
H100/H200 fast path. Submitted: vllm-project/vllm#43047

Full writeup with byte math + stress-test logs + the bind-mount overlay pattern: https://doradusresearch.ai/blog/sleep-mode-on-blackwell-part-2/

Hardware: 4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM_120, 95 GiB per GPU, 101 KiB per-block opt-in shmem).

Image stack documented in the writeup!