Spaces:

Atlas-Inference
/

README

Running

AzeezIsh commited on May 6

Commit

e19a44c

1 Parent(s): d7150f4

Atlas Inference org card

- Static-SDK markdown landing page linking to atlasinference.io,
GitHub (Avarok-Cybersecurity/atlas), Docker Hub, and Discord
- Demo video (atlas-demo.mov) as hero, tracked via LFS
- Launch announcement link to X
- Tagline: Pure Rust LLM Inference.

Files changed (3) hide show

.gitattributes +3 -0
README.md +156 -5
atlas-demo.mov +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.mov filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.webm filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,10 +1,161 @@
 ---
-title: README
-emoji: 🐠
-colorFrom: purple
 colorTo: yellow
-sdk: docker
 pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
+title: Atlas Inference
+emoji: 🚀
+colorFrom: red
 colorTo: yellow
+sdk: static
 pinned: false
+license: agpl-3.0
+short_description: Pure Rust LLM Inference.
 ---
+<p align="center">
+  <video src="https://huggingface.co/spaces/Atlas-Inference/README/resolve/main/atlas-demo.mov" controls muted playsinline width="820"></video>
+</p>
+<p align="center">
+  <a href="https://x.com/AIshaqui81766/status/2052121270506930276"><strong>📣 Read the launch announcement on X →</strong></a>
+</p>
+<p align="center">
+  <h1 align="center">Atlas Inference</h1>
+  <p align="center">
+    <strong>Pure Rust LLM Inference.</strong>
+  </p>
+  <p align="center">
+    <a href="https://atlasinference.io"><img alt="Website" src="https://img.shields.io/badge/web-atlasinference.io-orange?style=flat-square"></a>
+    <a href="https://github.com/Avarok-Cybersecurity/atlas"><img alt="GitHub" src="https://img.shields.io/badge/source-github-181717?style=flat-square&logo=github&logoColor=white"></a>
+    <a href="https://hub.docker.com/r/avarok/atlas-gb10"><img alt="Docker Hub" src="https://img.shields.io/badge/Docker%20Hub-avarok%2Fatlas--gb10-2496ED?style=flat-square&logo=docker&logoColor=white"></a>
+    <a href="https://discord.gg/DwF3brBMpw"><img alt="Discord" src="https://img.shields.io/badge/community-discord-5865F2?style=flat-square&logo=discord&logoColor=white"></a>
+    <a href="https://github.com/Avarok-Cybersecurity/atlas/blob/master/LICENSE"><img alt="License: AGPLv3" src="https://img.shields.io/badge/license-AGPLv3-yellow?style=flat-square"></a>
+  </p>
+</p>
+---
+## What is Atlas?
+Atlas is a from-scratch Rust + CUDA inference engine built for the next decade of LLM deployment. No Python interpreter. No PyTorch. No 20 GB Docker image. One ~2.5 GB binary that boots in under two minutes and pins the bandwidth ceiling on every supported (Hardware × Model × Quantization) target.
+We started on NVIDIA's DGX Spark (GB10 / SM121) with twelve hand-tuned model targets and a plug-and-play architecture designed so AMD, Intel, and Apple Silicon can land as community contributions, and so the next round of model families slot in the same way the Qwens did this quarter.
+## Why Atlas
+|                              | Atlas                | vLLM (same hardware) |
+| ---------------------------- | -------------------- | -------------------- |
+| Image size                   | **~2.5 GB**          | 20+ GB               |
+| Cold start                   | **<2 min**           | ~10 min              |
+| Runtime                      | **Rust + CUDA**      | Python + PyTorch     |
+| Dependencies                 | **None**             | 200+ packages        |
+| Peak Qwen3.5-35B (NVFP4)     | **130 tok/s**        | ~38 tok/s            |
+| Average across workloads     | **111 tok/s (3.0×)** | 37 tok/s             |
+Same hardware. Same model weights. Bring your own benchmark — `scripts/sweep_all_models.sh` is in the repo and we publish the vLLM baseline command alongside ours so you can verify both. If you reproduce a faster vLLM number, file an issue. We would rather be measured than congratulated.
+## Quick Start
+```bash
+docker pull avarok/atlas-gb10:latest
+docker run --gpus all --ipc=host -p 8888:8888 \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  avarok/atlas-gb10:latest \
+  serve Sehyo/Qwen3.5-35B-A3B-NVFP4 --speculative --mtp-quantization nvfp4
+```
+Anything OpenAI- or Anthropic-compatible — `curl`, the OpenAI SDK, opencode, Claude Code, Cline, Open WebUI — points at port 8888:
+```bash
+curl http://localhost:8888/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"atlas","messages":[{"role":"user","content":"Hello!"}],"max_tokens":256}'
+```
+Per-model recipes (vision, MoE, multi-node EP=2, single-GPU 122B with the tighter budget) live in [`QUICKSTART.md`](https://github.com/Avarok-Cybersecurity/atlas/blob/master/QUICKSTART.md).
+## What Ships Today
+Thirteen hand-tuned (Hardware × Model × Quantization) targets across Qwen3 / Qwen3.5 / Qwen3.6 / Qwen3-Next / Qwen3-VL / Gemma-4 / Mistral / MiniMax / Nemotron-H families. Every supported model runs off one multi-model binary; the right kernel set is selected at startup from the model's `config.json`.
+| Model                       | Params / active     | Quant       | Architecture                  | Throughput      |
+| --------------------------- | ------------------- | ----------- | ----------------------------- | --------------- |
+| Qwen3.5-35B-A3B (MTP K=2)   | 35B / 3B            | NVFP4       | GDN + Attention + MoE         | **~130 tok/s**  |
+| Qwen3-VL-30B-A3B            | 30B / 3B            | NVFP4       | Vision + Attention + MoE      | ~97 tok/s       |
+| Nemotron-3-Nano-30B-A3B     | 30B / 3.5B          | NVFP4 / FP8 | Mamba-2 + Attention + MoE     | ~88 tok/s       |
+| Qwen3-Next-80B-A3B          | 80B / 3B            | NVFP4       | SSM + Attention + MoE         | ~74–87 tok/s    |
+| Qwen3.6-35B-A3B             | 35B / 3B            | FP8         | GDN + Attention + MoE + ViT   | ~71 tok/s       |
+| Gemma-4-26B-A4B             | 26B / 4B            | NVFP4       | Attention + MoE (GeGLU)       | ~67 tok/s       |
+| Qwen3.5-122B-A10B (EP=2)    | 122B / 10B          | NVFP4       | GDN + Attention + MoE         | ~46 tok/s       |
+| Mistral-Small-4-119B        | 119B / 6.5B         | NVFP4       | MLA + MoE                     | ~33 tok/s       |
+| Nemotron-3-Super-120B-A12B  | 120B / 12B          | NVFP4 / FP8 | Mamba-2 + Attention + MoE     | ~24 tok/s       |
+| MiniMax-M2.7 (EP=2)         | 229B / ~10B         | NVFP4       | Attention + 256-expert MoE    | ~15 tok/s       |
+| Qwen3.5-27B (dense hybrid)  | 27B                 | NVFP4       | Hybrid SSM + Attention        | ~13 tok/s       |
+| Gemma-4-31B                 | 31B                 | NVFP4       | Attention (sliding + full)    | ~9–11 tok/s     |
+Full HuggingFace IDs, methodology, and the kernel-by-kernel comparison against PyTorch eager live in the [GitHub README](https://github.com/Avarok-Cybersecurity/atlas#readme).
+## What Works Today
+| Component | Status |
+|---|---|
+| OpenAI- and Anthropic-compatible HTTP API (streaming + non-streaming) | ✅ |
+| Tool calling (Hermes, Qwen3-Coder, Mistral formats) with grammar-constrained decoding | ✅ |
+| Reasoning / thinking tokens with budget cap | ✅ |
+| Concurrent batched decode + per-batch CUDA graphs | ✅ |
+| MTP speculative decoding (K=2, pipelined verify) | ✅ |
+| Prefix caching via radix tree (RadixAttention) + SSM snapshot cache (Marconi) — 10× warm-cache TTFT | ✅ |
+| KV cache dtypes — BF16, FP8, NVFP4, turbo3, turbo4 | ✅ |
+| MoE routing up to 512 experts | ✅ |
+| Vision encoder (Qwen3-VL, Qwen3.6 ViT) | ✅ |
+| Multi-GPU expert parallelism (EP=2 over RoCEv2) | ✅ |
+| SLO-aware scheduling, chunked prefill, active context compaction | ✅ |
+| High-speed NVMe KV swap (sliding-window aware) | ✅ |
+| Auto OOM pre-flight + UVM fallback on host OOM | ✅ |
+## Plug & Play Architecture
+Atlas is built around a small set of Rust traits and a kernel registry — each marked with 🔌 below is the abstraction boundary where a new integration plugs in without touching anything above or below it:
+| Plug Point | What It Abstracts | To Add Support |
+|---|---|---|
+| 🔌 `trait ModelWeightLoader` | HuggingFace → layer translation | Implement one struct + add a match arm in `factory.rs` |
+| 🔌 `trait TransformerLayer` | Per-layer compute (attn, SSM, MoE, FFN) | Compose existing primitives or implement a new layer type |
+| 🔌 `trait GpuBackend` | All GPU memory and kernel ops | Swap CUDA for another accelerator backend |
+| 🔌 `kernels/<hw>/<model>/<quant>/` | Hardware-tuned CUDA kernels | Drop a directory with `MODEL.toml` + `.cu` files; `build.rs` auto-discovers it |
+| 🔌 `trait CommBackend` | Multi-GPU collectives | Implement for MPI, GDR, custom interconnects |
+| 🔌 `trait StorageBackend` | NVMe KV-cache offload I/O | Implement for CXL, RDMA, other storage tiers |
+A `MockGpuBackend` in `spark-runtime` lets you write and test the entire scaffold without owning the hardware — every layer above the GPU trait is hardware-agnostic.
+## What the Community is Saying
+> *"103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup."*
+> — **ronald_15496**, [Discord #general](https://discord.gg/DwF3brBMpw)
+> *"Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas."*
+> — **PersonWhoThinks**, [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1rmvxo3/)
+> *"I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this."*
+> — **tetsuro59**, [Discord #general](https://discord.gg/DwF3brBMpw)
+## Citations
+We did not invent the kernels we ship. We picked the right ideas from the right papers, fused them together, and tuned them for one chip until they pinned the bandwidth ceiling. Direct intellectual debts: **FlashAttention-2** (Dao, 2024), **FlashAttention-4** (Shah et al., 2025), **FlashInfer** (Ye et al., MLSys 2025), **SageAttention 3** (Zhang et al., NeurIPS 2025), **LeanAttention** (Roy et al., 2024). Full references in the [GitHub README](https://github.com/Avarok-Cybersecurity/atlas#citations).
+## License & Enterprise Edition
+Atlas operates under a **dual-license** model. Both are real and intentional.
+1. **Community Edition — AGPLv3.** Free, open, copyleft. Use it on your own hardware for research, hobby, side-projects, hosted demos.
+2. **Enterprise Edition — commercial license.** Ship Atlas inside a closed-source product, run it as a SaaS backend without inheriting the AGPLv3 source-disclosure obligation, get a support relationship with the people who wrote the kernels, and prioritized model and hardware ports. Reach us via the [website](https://atlasinference.io) or Discord.
+A permissive license keeps us building Atlas full-time; the AGPL community license keeps the project honest. What is in this repository is what we run.
+---
+<p align="center">
+  <a href="https://atlasinference.io">atlasinference.io</a> ·
+  <a href="https://github.com/Avarok-Cybersecurity/atlas">GitHub</a> ·
+  <a href="https://hub.docker.com/r/avarok/atlas-gb10">Docker Hub</a> ·
+  <a href="https://discord.gg/DwF3brBMpw">Discord</a>
+</p>

atlas-demo.mov ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:edcd4a7da124c271cd13e33e5656a7df32805834ed1b81ae2086248aeee08f13
+size 20198132