Instructions to use my-ai-stack/Stack-X-Ultimate with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-X-Ultimate with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-X-Ultimate")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-X-Ultimate")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-X-Ultimate")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-X-Ultimate with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-X-Ultimate"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-X-Ultimate",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-X-Ultimate

SGLang

How to use my-ai-stack/Stack-X-Ultimate with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-X-Ultimate" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-X-Ultimate",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-X-Ultimate" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-X-Ultimate",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-X-Ultimate with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-X-Ultimate
```

Stack X Ultimate

The ultimate 3B parameter model for sovereign AI deployment

Stack X Ultimate is a high-performance 3B parameter language model designed for sovereign AI deployment. Optimized for edge computing, on-premise infrastructure, and air-gapped environments. Delivers exceptional performance while maintaining a compact footprint suitable for consumer hardware and enterprise deployment.

Hardware Requirements

Quantization	GPU Required	VRAM	Total Model Size
FP16 (full precision)	RTX 3060+	~6 GB	~6 GB
Q8_0	RTX 3060	~3 GB	~3 GB
Q4_K_M	Any modern GPU	~1.8 GB	~1.8 GB
Q3_K_M	Integrated GPU	~1.2 GB	~1.2 GB
Q2_K	CPU + 8GB RAM	~900 MB	~900 MB

Minimum Requirements (Q3_K and below)

GPU: None required (CPU inference supported)
RAM: 8GB system RAM
Storage: 2GB+ free space

Recommended Requirements

GPU: NVIDIA RTX 3060 (12GB) or better
RAM: 16GB system RAM
Storage: 4GB+ free space for multiple quantizations

Edge Deployment

Platform	Quantization	Requirements
NVIDIA Jetson Orin	Q4_K_M	8GB RAM, 15W TDP
Raspberry Pi 5 + GPU	Q2_K	8GB RAM, external GPU
Apple Silicon (M1/M2/M3)	Q4_K_M	16GB unified memory
Intel Arc GPU	Q4_K_M	Intel Arc A770

File Sizes

Quantization	File Size	Download
FP16	~6.0 GB	Download
Q8_0	~3.0 GB	Download
Q4_K_M	~1.8 GB	Download
Q3_K_M	~1.2 GB	Download
Q2_K	~900 MB	Download

Use Cases

Best Suited Tasks

Code Generation: Multi-language code writing, refactoring, and debugging
Text Generation: Creative writing, documentation, content creation
Question Answering: Information retrieval, knowledge base queries
Summarization: Document summarization, abstract generation
Classification: Text classification, sentiment analysis
Translation: Cross-language text translation
Embedded Systems: On-device AI, IoT applications

Industries & Domains

Industry	Use Case
Healthcare	HIPAA-compliant AI assistants, clinical documentation
Finance	SOC2-compliant automation, risk assessment
Legal	Contract analysis, case law research
Government	Classified environment AI, secure documentation
Manufacturing	Edge AI for quality control, predictive maintenance
Retail	On-premise customer service, inventory optimization
Education	Offline learning assistants, classroom AI

Quick Start

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "my-ai-stack/Stack-X-Ultimate"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Generate response
prompt = "Explain the concept of sovereignty in AI systems and why it matters for enterprise deployment."

messages = [
    {"role": "system", "content": "You are Stack X Ultimate, a helpful and knowledgeable AI assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
    )

response = tokenizer.decode(
    outputs[0][inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)

print(response)

llama.cpp

# Download the GGUF model file
# Visit: https://huggingface.co/my-ai-stack/Stack-X-Ultimate/tree/main

# Run with llama.cpp on GPU
./main -m stack-x-ultimate-q4_k_m.gguf \
  -n 512 \
  -t 8 \
  -c 131072 \
  --temp 0.7 \
  --top-p 0.95 \
  -p "Write a Python function to implement quicksort algorithm."

# Run on CPU only
./main -m stack-x-ultimate-q4_k_m.gguf \
  -n 512 \
  -t 8 \
  -c 131072 \
  --no-display \
  --threads 8 \
  -p "Explain the differences between sovereign AI and cloud-based AI solutions."

# Use with quantization comparison
./main -m stack-x-ultimate-q2_k.gguf -n 256 --temp 0.5
./main -m stack-x-ultimate-q4_k_m.gguf -n 256 --temp 0.5
./main -m stack-x-ultimate-q8_0.gguf -n 256 --temp 0.5

Ollama

# Pull the model
ollama pull stack-x-ultimate

# Run interactively
ollama run stack-x-ultimate "Write a Python function to implement binary search."

# Run with creative temperature
ollama run stack-x-ultimate \
  --temperature 0.9 \
  --top-p 0.95 \
  "Write a short story about an AI that becomes self-aware in an air-gapped facility."

# Run with low temperature for factual responses
ollama run stack-x-ultimate \
  --temperature 0.2 \
  --top-p 0.9 \
  "Explain quantum computing and its applications in cryptography."

# Use with longer context for document processing
ollama run stack-x-ultimate \
  --num-ctx 65536 \
  --temperature 0.5 \
  "Summarize the following research paper: [PASTE TEXT]"

Model Architecture

Attribute	Value
Base Model	Qwen/Qwen2.5-3B
Parameters	3B
Fine-tuning	Full fine-tuning + LoRA
Context Length	131,072 tokens (128K)
Vocabulary Size	151,936 tokens
Hidden Size	1,536
Attention Heads	12
Num Key Value Heads	2
Transformer Layers	28
Activation Function	SiLU
RoPE Scaling	NTK (factor: 4.0)

Training Details

Base Model: Qwen2.5-3B
Training Approach: Combined full fine-tuning + LoRA
Fine-tuning Data: Diverse high-quality corpus
Focus Areas: General understanding, code generation, instruction following
Special Training: Sovereign deployment optimization, edge computing efficiency
Context Length: 128K tokens
License: Apache 2.0
Release Date: April 2026

Performance Notes

Inference Speed (Q4_K_M)

Device	Tokens/sec	Latency (512 tokens)
RTX 4090	~55	~9.3s
RTX 3090	~42	~12.2s
RTX 3060	~25	~20.5s
Apple M2 Pro	~35	~14.6s
CPU (i9-13900K)	~10	~51.2s

Deployment Scenarios

Single User (Interactive)

config = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95,
    "batch_size": 1,
}

Multi-User (Server)

config = {
    "max_new_tokens": 256,
    "temperature": 0.5,
    "top_p": 0.9,
    "batch_size": 4,
    "use_kv_cache": True,
}

Offline/Edge

config = {
    "max_new_tokens": 128,
    "temperature": 0.3,
    "top_p": 0.85,
    "quantization": "q4_k_m",
}

Security & Sovereignty

Stack X Ultimate is designed for secure, sovereign deployment:

Air-Gapped Operation: No internet connection required
Data Privacy: All data stays within your infrastructure
Compliance Ready: SOC2, HIPAA, GDPR compatible
Audit Trail: Full inference logging capabilities
On-Premise Only: No cloud dependencies

Enterprise Security Features

Feature	Description
VPC Deployment	Deploy within your private network
TLS/SSL	Encrypted communication
Authentication	OAuth2, LDAP, SSO support
Rate Limiting	Prevent abuse and overuse
Audit Logging	Complete inference history

Limitations

Model Size: At 3B parameters, less capable than larger models for complex reasoning
Specialized Tasks: May require fine-tuning for domain-specific tasks
Multi-modal: Text-only; does not support images or audio
Hallucinations: May occasionally generate incorrect information; verification recommended

Quick Links

Citation

@misc{my-ai-stack/stack-x-ultimate,
  author = {Walid Sobhi},
  title = {Stack X Ultimate: 3B Parameter Model for Sovereign AI Deployment},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/my-ai-stack/Stack-X-Ultimate}
}

Built with love for developers
Discord · GitHub · HuggingFace

Downloads last month: 811

Safetensors

Model size

3B params

Tensor type

F16

Model tree for my-ai-stack/Stack-X-Ultimate

Base model

Qwen/Qwen2.5-3B

Finetuned

(387)

this model

Space using my-ai-stack/Stack-X-Ultimate 1

Evaluation results

pass@k
self-reported

0.880