# Stack 2.9 Inference API Documentation REST API for code generation using the Stack 2.9 fine-tuned Qwen model. ## Quick Start ### 1. Install Dependencies ```bash pip install -r requirements_api.txt pip install -r requirements.txt # Core dependencies (transformers, torch, etc.) ``` ### 2. Set Model Path ```bash # Option A: Environment variable export MODEL_PATH=/path/to/your/merged/model # Option B: Direct parameter MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000 ``` ### 3. Start the Server ```bash # Basic usage uvicorn inference_api:app --host 0.0.0.0 --port 8000 # With auto-reload (development) uvicorn inference_api:app --reload --port 8000 # Using Python directly python inference_api.py ``` ### 4. Verify It's Running ```bash curl http://localhost:8000/health ``` Expected response: ```json { "status": "healthy", "model_loaded": true, "model_path": "base_model_qwen7b", "device": "cuda", "cuda_available": true } ``` --- ## API Endpoints ### `GET /health` Health check endpoint to verify API and model status. **Response:** ```json { "status": "healthy", "model_loaded": true, "model_path": "/path/to/model", "device": "cuda", "cuda_available": true } ``` --- ### `GET /model-info` Get information about the currently loaded model. **Response:** ```json { "model_path": "/path/to/model", "device": "cuda:0", "dtype": "torch.float16" } ``` --- ### `POST /generate` Generate code completion for a prompt. **Request Body:** ```json { "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n", "max_tokens": 128, "temperature": 0.2, "top_p": 0.95, "do_sample": true, "repetition_penalty": 1.1, "num_return_sequences": 1 } ``` **Parameters:** | Parameter | Type | Default | Range | Description | |-----------|------|---------|-------|-------------| | `prompt` | string | required | - | Input prompt to complete | | `max_tokens` | int | 512 | 1-4096 | Maximum tokens to generate | | `temperature` | float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) | | `top_p` | float | 0.95 | 0.0-1.0 | Nucleus sampling threshold | | `do_sample` | bool | true | - | Whether to use sampling vs greedy | | `repetition_penalty` | float | 1.1 | 1.0-2.0 | Penalize repeated tokens | | `num_return_sequences` | int | 1 | 1-10 | Number of sequences to generate | **Response:** ```json { "generated_text": " seen = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in seen:\n return [seen[complement], i]\n seen[num] = i\n return []", "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n", "model": "base_model_qwen7b", "num_tokens": 45, "finish_reason": "stop" } ``` **Example with curl:** ```bash curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{ "prompt": "def fibonacci(n):\n \"\"\"Return first n Fibonacci numbers.\"\"\"\n", "max_tokens": 100, "temperature": 0.2 }' ``` --- ### `POST /chat` Conversational interface for multi-turn interactions. **Request Body:** ```json { "messages": [ {"role": "user", "content": "Write a function to reverse a string in Python"}, {"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"}, {"role": "user", "content": "Make it recursive instead"} ], "max_tokens": 128, "temperature": 0.2, "top_p": 0.95 } ``` **Message Roles:** - `user` - User's message - `assistant` - Model's previous response (for conversation history) **Response:** ```json { "message": { "role": "assistant", "content": "def reverse_string(s):\n if len(s) <= 1:\n return s\n return s[-1] + reverse_string(s[:-1])" }, "model": "base_model_qwen7b", "num_tokens": 67, "finish_reason": "stop" } ``` **Example with curl:** ```bash curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "Write a binary search function"} ], "max_tokens": 150 }' ``` --- ### `POST /generate/raw` Same as `/generate` but returns raw output without extracting code from markdown blocks. **Example with curl:** ```bash curl -X POST http://localhost:8000/generate/raw \ -H "Content-Type: application/json" \ -d '{ "prompt": "def quick_sort(arr):", "max_tokens": 200 }' ``` --- ### `POST /extract-code` Extract code from a text response that may contain markdown code blocks. **Request Body:** ```json { "prompt": "```python\ndef hello():\n print(\"world\")\n```" } ``` **Response:** ```json { "code": "def hello():\n print(\"world\")" } ``` --- ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `MODEL_PATH` | `base_model_qwen7b` | Path to model directory | | `DEVICE` | `cuda` (if available) | Device to use: `cuda` or `cpu` | | `PORT` | `8000` | Server port | | `HOST` | `0.0.0.0` | Server host | | `RELOAD` | `false` | Enable auto-reload for development | | `DEFAULT_MAX_TOKENS` | `512` | Default max tokens | | `DEFAULT_TEMPERATURE` | `0.2` | Default temperature | | `DEFAULT_TOP_P` | `0.95` | Default top_p | --- ## Usage Examples ### Python Client ```python import requests API_URL = "http://localhost:8000" # Health check health = requests.get(f"{API_URL}/health").json() print(f"Model loaded: {health['model_loaded']}") # Code completion response = requests.post( f"{API_URL}/generate", json={ "prompt": "def merge_sort(arr):\n \"\"\"Return sorted array.\"\"\"\n", "max_tokens": 200, "temperature": 0.3, } ).json() print(response["generated_text"]) ``` ### JavaScript/Node.js Client ```javascript const API_URL = "http://localhost:8000"; // Code completion async function generate(prompt) { const response = await fetch(`${API_URL}/generate`, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ prompt, max_tokens: 128, temperature: 0.2, }), }); return response.json(); } const result = await generate("def binary_search(arr, target):"); console.log(result.generated_text); ``` ### Using with OpenAI SDK (with base_url replacement) ```python from openai import OpenAI client = OpenAI( api_key="not-needed", base_url="http://localhost:8000" ) # Note: This works for basic completions but may need adapter code # for full OpenAI compatibility response = client.completions.create( model="stack-2.9", prompt="def factorial(n):", max_tokens=100, ) ``` --- ## Performance Tips 1. **GPU Recommended**: For fastest inference, run on GPU with CUDA 2. **Batch Processing**: For multiple prompts, process sequentially (model is loaded once) 3. **Memory**: Ensure adequate GPU memory; reduce `max_tokens` if needed 4. **Temperature**: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks --- ## Error Handling **503 Service Unavailable**: Model not loaded or loading failed ```json {"detail": "Model not loaded. Check /health for status."} ``` **500 Internal Server Error**: Generation failed ```json {"detail": "Generation failed: "} ``` **400 Bad Request**: Invalid input ```json {"detail": "Last message must be from user"} ``` --- ## Architecture Notes - **Single Model Instance**: Model is loaded once at startup and reused - **Synchronous Generation**: Uses `torch.no_grad()` for inference - **CORS Enabled**: Accepts requests from any origin (configure for production) - **No Authentication**: Add middleware (e.g., API key) for production deployments