Instructions to use my-ai-stack/Stack-4.0-Qwen-3B-Agentic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-4.0-Qwen-3B-Agentic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-4.0-Qwen-3B-Agentic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("my-ai-stack/Stack-4.0-Qwen-3B-Agentic", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-4.0-Qwen-3B-Agentic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-4.0-Qwen-3B-Agentic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-4.0-Qwen-3B-Agentic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-4.0-Qwen-3B-Agentic

SGLang

How to use my-ai-stack/Stack-4.0-Qwen-3B-Agentic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-4.0-Qwen-3B-Agentic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-4.0-Qwen-3B-Agentic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-4.0-Qwen-3B-Agentic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-4.0-Qwen-3B-Agentic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-4.0-Qwen-3B-Agentic with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-4.0-Qwen-3B-Agentic
```

Stack 4.0 Qwen 3B Agentic

Fine-tuned 3B parameter model optimized for tool-calling, RAG, and multi-step agentic workflows

Stack 4.0 Qwen 3B Agentic is a specialized fine-tuned version of Qwen2.5-Coder-3B, optimized specifically for agentic AI workflows. It excels at function calling, tool use, multi-turn conversations, and autonomous task execution. Designed for regulated environments requiring sovereign AI deployment.

Hardware Requirements

Quantization	GPU Required	VRAM	Total Model Size
FP16 (full precision)	RTX 3060+	~6 GB	~6 GB
Q8_0	RTX 3060	~3 GB	~3 GB
Q4_K_M	Any modern GPU	~1.8 GB	~1.8 GB
Q3_K_M	Integrated GPU	~1.2 GB	~1.2 GB
Q2_K	CPU + 8GB RAM	~900 MB	~900 MB

Minimum Requirements (Q3_K and below)

GPU: None required (CPU inference supported)
RAM: 8GB system RAM
Storage: 2GB+ free space

Recommended Requirements

GPU: NVIDIA RTX 3060 (12GB) or better
RAM: 16GB system RAM
Storage: 4GB+ free space for multiple quantizations

File Sizes

Quantization	File Size	Download
FP16	~6.0 GB	Download
Q8_0	~3.0 GB	Download
Q4_K_M	~1.8 GB	Download
Q3_K_M	~1.2 GB	Download
Q2_K	~900 MB	Download

Use Cases

Best Suited Tasks

Tool-Calling Agents: Autonomous agents that call external functions and APIs
RAG Systems: Retrieval-augmented generation with context-aware tool selection
Multi-Step Reasoning: Complex tasks requiring planning and sequential execution
Code Assistance: Code generation, debugging, and refactoring
Conversation Agents: Multi-turn dialog with state management
Workflow Automation: Task orchestration and process automation

Industries & Domains

Industry	Use Case
Software Development	AI coding assistants, automated code review
Customer Support	Autonomous support agents, ticket routing
Data Analysis	Data pipeline automation, report generation
DevOps	Infrastructure automation, CI/CD optimization
Legal	Document automation, case research
Healthcare	Clinical decision support, appointment scheduling
Finance	Portfolio management, fraud detection

Quick Start

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "my-ai-stack/Stack-4.0-Qwen-3B-Agentic"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Example tool call format
tool_schema = [
    {
        "type": "function",
        "function": {
            "name": "search_code",
            "description": "Search for code patterns in the repository",
            "parameters": {
                "type": "object",
                "properties": {
                    "pattern": {"type": "string", "description": "Regex pattern to search"},
                    "path": {"type": "string", "description": "Directory path to search"}
                },
                "required": ["pattern"]
            }
        }
    }
]

# Generate with tool calling
prompt = """Search for all functions containing 'async' in the src directory."""

messages = [
    {"role": "system", "content": "You are Stack 4.0, an agentic AI assistant with tool-calling capabilities."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        top_p=0.95,
        do_sample=True,
    )

response = tokenizer.decode(
    outputs[0][inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)

print(response)

llama.cpp

# Download the GGUF model file
# Visit: https://huggingface.co/my-ai-stack/Stack-4.0-Qwen-3B-Agentic/tree/main

# Run with llama.cpp
./main -m stack-4.0-qwen-3b-agentic-q4_k_m.gguf \
  -n 512 \
  -t 8 \
  -c 131072 \
  --temp 0.2 \
  --top-p 0.95 \
  -p "Write a Python function that searches for code patterns using regex."

# Or use with tool schema (JSON mode)
./main -m stack-4.0-qwen-3b-agentic-q4_k_m.gguf \
  --json-schema '{
    "type": "object",
    "properties": {
      "search": {
        "type": "object",
        "properties": {
          "pattern": {"type": "string"},
          "path": {"type": "string"}
        }
      }
    }
  }'

Ollama

# Pull the model
ollama pull stack-4.0-qwen-3b-agentic

# Run interactively with agentic mode
ollama run stack-4.0-qwen-3b-agentic "Search for all async functions in the src directory."

# Or use with custom parameters for agentic workflows
ollama run stack-4.0-qwen-3b-agentic \
  --temperature 0.1 \
  --top-p 0.9 \
  --num-ctx 131072 \
  --num-gpu 1 \
  "Create a Python script that implements a multi-step data pipeline with error handling."

# Use with Ollama's function calling (if available in your version)
ollama function call stack-4.0-qwen-3b-agentic \
  --function search_code \
  --args '{"pattern": "def.*", "path": "./src"}'

Agentic Capabilities

Stack 4.0 Qwen 3B Agentic is specifically trained for autonomous agent workflows:

Tool Calling

Native function calling with structured JSON output
Support for tool schemas in OpenAI format
Multi-tool selection and chaining

Multi-Step Reasoning

Plan-and-execute workflows
Intermediate step tracking
Self-correction on failure

Available Tools (72+ Built-in)

Category	Tools
File Operations	file_read, file_write, file_edit, file_delete
Code Search	grep, glob, grep_count
Task Management	task_create, task_list, task_update, task_delete
Agent Orchestration	agent_spawn, team_create, team_assign
Web Operations	web_search, web_fetch
Scheduling	cron_create, cron_list
Skills	skill_execute, skill_chain
Messaging	message_send, message_channel
MCP Integration	mcp_call, mcp_list_servers

Model Architecture

Attribute	Value
Base Model	Qwen/Qwen2.5-Coder-3B
Parameters	3B
Fine-tuning	LoRA (Rank 8)
Context Length	131,072 tokens (128K)
Vocabulary Size	151,936 tokens
Hidden Size	1,536
Attention Heads	12
Num Key Value Heads	2
Transformer Layers	28
Activation Function	SiLU
RoPE Scaling	NTK (factor: 4.0)

Training Details

Base Model: Qwen2.5-Coder-3B
Training Method: LoRA (Low-Rank Adaptation)
LoRA Rank: 8
LoRA Alpha: 16
Target Modules: All linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
Training Data: Multi-turn tool conversations, function-calling examples, enterprise workflow patterns
Focus Areas: Tool selection, function arguments, multi-step planning
Context Length: 128K tokens
License: Apache 2.0
Release Date: April 2026

Performance Notes

Inference Speed (Q4_K_M)

GPU	Tokens/sec
RTX 4090	~45
RTX 3090	~35
RTX 3060	~20
CPU (i9-13900K)	~8

Memory Usage During Inference

# Optimal settings for inference
config = {
    "batch_size": 1,
    "use_kv_cache": True,
    "max_new_tokens": 512,
    "torch_dtype": torch.float16,  # Use float16 on GPU
    # For CPU inference:
    # "torch_dtype": torch.float32,
    # "device_map": "cpu",
}

Limitations

Model Size: At 3B parameters, less capable than larger models for complex reasoning
Training Data: Optimized for English; other languages may have reduced quality
Tool Accuracy: May occasionally call incorrect tools; verification recommended
Long Context: Performance may degrade beyond 64K tokens in some scenarios

Quick Links

Citation

@misc{my-ai-stack/stack-4-0-qwen-3b-agentic,
  author = {Walid Sobhi},
  title = {Stack 4.0 Qwen 3B Agentic: Fine-tuned for Tool-Calling and Agentic Workflows},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/my-ai-stack/Stack-4.0-Qwen-3B-Agentic}
}

Built with love for developers
Discord · GitHub · HuggingFace

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for my-ai-stack/Stack-4.0-Qwen-3B-Agentic

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-Coder-3B

Finetuned

(56)

this model

Space using my-ai-stack/Stack-4.0-Qwen-3B-Agentic 1

Evaluation results

pass@k
self-reported

0.850
tool_call_accuracy
self-reported

0.920