JiRack 10B FP32, INT8, INT4

A fast and efficient coding assistant with a clean, modern built-in web UI.

Powered by Meta Llama 3.1 8B Instruct weights and a fully refactored architecture optimized for a 10B-scale model. The model was specifically designed for high-performance tuning with advanced quantization options.

The next version will be JiRack Ternary 10B — a highly optimized ternary model delivering exceptional speed and efficiency using Microsoft ONNX Runtime.

JiRack is a cloud-ready model that helps save money on cloud usage. It can be used as an expert model in RAG deployments on cloud, with the ONNX JiRack Java server as an alternative.
Subscription: $1 per month per user (updated license for non-company use).
Corp Subscription: $3 per month per user (updated license for company use).
It works without subscription but send message about subscription

JiRack Android Client DEMO:
https://www.youtube.com/watch?v=SaO6Jfb8R68

CMS Manhattan RAG & Email Reply + Document and Email Analytics:
https://www.youtube.com/watch?v=KRu2nLEh_6g&t=78s

"So I don’t read my emails — I just ask JiRack to tell me the news!"

Welcome to the CMS Manhattan AI Front Office Solution.

Training the Model

It is easy to train on a Blackwell GPU with 96 GB VRAM.
You do not need a data center for fine-tuning or QLoRA — it works well on consumer-grade GPU cards.

Let me know if you need the training code!

Quick Start

Watch JiRack 10B in action and run it with Docker.

Run with Docker

Default CPU INT8

docker run -d \
  --name jirack_10b \
  -p 7869:7869 \
  --restart unless-stopped \
  cmsmanhattan/jirack_10b_int8:latest

Default CPU INT4

docker run -d \
  --name jirack_10b \
  -p 7869:7869 \
  --restart unless-stopped \
  cmsmanhattan/jirack_10b_int4:latest

Multi CPU

docker run -d \
  --name jirack_10b \
  -p 7869:7869 \
  --restart unless-stopped \
  --memory=20g \
  --cpus=12 \
  cmsmanhattan/jirack_10b_int8:latest

GPU

docker run -d \
  --name jirack_10b \
  -p 7869:7869 \
  --gpus all \
  --restart unless-stopped \
  cmsmanhattan/jirack_10b_gpu_int8:latest

Docker Compose Example

services:
  jirack:
    image: cmsmanhattan/jirack_10b_int8:1.0.2
    container_name: jirack_onnx_service
    ports:
      - "7869:7869"
    volumes:
      - .:/app
      - ./web:/app/web
    environment:
      - MAX_TOKENS=1024
      - TEMPERATURE=0.7
      - TOP_P=0.9
      - DEFAULT_STREAM=False
      - INTRA_THREADS=4
      - USE_ENV_ALLOCATOR=1
    deploy:
      resources:
        limits:
          memory: 16g

Access the UI

Once the container is running, open your browser and go to:

http://localhost:7869

This opens the JiRack UI — a clean, modern web interface for chat.

Changing the Port

The listening port can be easily changed from the Settings panel inside the JiRack Chat UI.

Licensing

The JiRack 10B model is provided under a commercial enterprise license.
All JiRack UI clients are provided under a commercial license.
UI clients can be used for free when running with the official JiRack Docker containers, as long as they are not redistributed separately.

Subscription Plans

JiRack Enterprise: $36 per user per year
JiRack Private: $12 per user per year

For commercial licensing, cluster deployment, performance tuning, or enterprise use, please contact us.

JiRack Android Chat Client (voice + Ollama API):
https://huggingface.co/kgrabko/JiRackTernary_1b/resolve/main/app-release.apk
or Google Play
JiRack Windows 11 Desktop Client (Ollama API):
https://huggingface.co/kgrabko/JiRackTernary_1b/resolve/main/jirack-chat.zip
Live email chat: support@cmsmanhattan.com

Hardware Recommendations for AMD Systems

Recommended Hardware for JiRack 10B INT8 (single Docker container)

Use Case	CPU	GPU (ROCm)	VRAM / RAM	Expected Speed	Recommendation
Recommended	Ryzen 7 7700 / 9700X	RX 7900 XTX / 7900 XT	24GB VRAM	50-75 tokens/s	Best choice
High Performance	Ryzen 9 7950X / 9950X	RX 7900 XTX	24GB+ VRAM	65-90 tokens/s	Excellent
Enterprise	EPYC 7003/9004 series	MI300X or 2x RX 7900 XTX	48GB+ VRAM	90-140 tokens/s	For 32B model
Budget Option	Ryzen 5 7600 / 9600X	RX 7800 XT (16GB)	16GB VRAM	35-50 tokens/s	Acceptable

Important Memory Notes

Even though the 10B INT8 model itself takes approximately 8–9 GB, we recommend at least 24GB VRAM for:

KV-cache consumption during generation, especially with long context
ONNX Runtime overhead and temporary buffers
System stability and avoiding out-of-memory errors
Support for larger context windows

Minimum recommended: 24GB VRAM (RX 7900 series)
Ideal: 24–32GB VRAM

For pure CPU inference, we recommend at least 64GB system RAM.

I added the default model in full FP32 precision (~62 GB). This serves as the base for quantization to find the best balance between size and performance.

📧 Contact & Licensing

For joint ventures, hardware integration, or licensing inquiries:

Email: grabko@cmsmanhattan.com
Phone: +1 (516) 777-0945
Location: New York, USA

Downloads last month: 107

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support