| # Stack 2.9 Inference API Documentation |
|
|
| REST API for code generation using the Stack 2.9 fine-tuned Qwen model. |
|
|
| ## Quick Start |
|
|
| ### 1. Install Dependencies |
|
|
| ```bash |
| pip install -r requirements_api.txt |
| pip install -r requirements.txt # Core dependencies (transformers, torch, etc.) |
| ``` |
|
|
| ### 2. Set Model Path |
|
|
| ```bash |
| # Option A: Environment variable |
| export MODEL_PATH=/path/to/your/merged/model |
| |
| # Option B: Direct parameter |
| MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000 |
| ``` |
|
|
| ### 3. Start the Server |
|
|
| ```bash |
| # Basic usage |
| uvicorn inference_api:app --host 0.0.0.0 --port 8000 |
| |
| # With auto-reload (development) |
| uvicorn inference_api:app --reload --port 8000 |
| |
| # Using Python directly |
| python inference_api.py |
| ``` |
|
|
| ### 4. Verify It's Running |
|
|
| ```bash |
| curl http://localhost:8000/health |
| ``` |
|
|
| Expected response: |
| ```json |
| { |
| "status": "healthy", |
| "model_loaded": true, |
| "model_path": "base_model_qwen7b", |
| "device": "cuda", |
| "cuda_available": true |
| } |
| ``` |
|
|
| --- |
|
|
| ## API Endpoints |
|
|
| ### `GET /health` |
|
|
| Health check endpoint to verify API and model status. |
|
|
| **Response:** |
| ```json |
| { |
| "status": "healthy", |
| "model_loaded": true, |
| "model_path": "/path/to/model", |
| "device": "cuda", |
| "cuda_available": true |
| } |
| ``` |
|
|
| --- |
|
|
| ### `GET /model-info` |
|
|
| Get information about the currently loaded model. |
|
|
| **Response:** |
| ```json |
| { |
| "model_path": "/path/to/model", |
| "device": "cuda:0", |
| "dtype": "torch.float16" |
| } |
| ``` |
|
|
| --- |
|
|
| ### `POST /generate` |
|
|
| Generate code completion for a prompt. |
|
|
| **Request Body:** |
| ```json |
| { |
| "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n", |
| "max_tokens": 128, |
| "temperature": 0.2, |
| "top_p": 0.95, |
| "do_sample": true, |
| "repetition_penalty": 1.1, |
| "num_return_sequences": 1 |
| } |
| ``` |
|
|
| **Parameters:** |
| | Parameter | Type | Default | Range | Description | |
| |-----------|------|---------|-------|-------------| |
| | `prompt` | string | required | - | Input prompt to complete | |
| | `max_tokens` | int | 512 | 1-4096 | Maximum tokens to generate | |
| | `temperature` | float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) | |
| | `top_p` | float | 0.95 | 0.0-1.0 | Nucleus sampling threshold | |
| | `do_sample` | bool | true | - | Whether to use sampling vs greedy | |
| | `repetition_penalty` | float | 1.1 | 1.0-2.0 | Penalize repeated tokens | |
| | `num_return_sequences` | int | 1 | 1-10 | Number of sequences to generate | |
|
|
| **Response:** |
| ```json |
| { |
| "generated_text": " seen = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in seen:\n return [seen[complement], i]\n seen[num] = i\n return []", |
| "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n", |
| "model": "base_model_qwen7b", |
| "num_tokens": 45, |
| "finish_reason": "stop" |
| } |
| ``` |
|
|
| **Example with curl:** |
| ```bash |
| curl -X POST http://localhost:8000/generate \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "prompt": "def fibonacci(n):\n \"\"\"Return first n Fibonacci numbers.\"\"\"\n", |
| "max_tokens": 100, |
| "temperature": 0.2 |
| }' |
| ``` |
|
|
| --- |
|
|
| ### `POST /chat` |
|
|
| Conversational interface for multi-turn interactions. |
|
|
| **Request Body:** |
| ```json |
| { |
| "messages": [ |
| {"role": "user", "content": "Write a function to reverse a string in Python"}, |
| {"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"}, |
| {"role": "user", "content": "Make it recursive instead"} |
| ], |
| "max_tokens": 128, |
| "temperature": 0.2, |
| "top_p": 0.95 |
| } |
| ``` |
|
|
| **Message Roles:** |
| - `user` - User's message |
| - `assistant` - Model's previous response (for conversation history) |
|
|
| **Response:** |
| ```json |
| { |
| "message": { |
| "role": "assistant", |
| "content": "def reverse_string(s):\n if len(s) <= 1:\n return s\n return s[-1] + reverse_string(s[:-1])" |
| }, |
| "model": "base_model_qwen7b", |
| "num_tokens": 67, |
| "finish_reason": "stop" |
| } |
| ``` |
|
|
| **Example with curl:** |
| ```bash |
| curl -X POST http://localhost:8000/chat \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "messages": [ |
| {"role": "user", "content": "Write a binary search function"} |
| ], |
| "max_tokens": 150 |
| }' |
| ``` |
|
|
| --- |
|
|
| ### `POST /generate/raw` |
|
|
| Same as `/generate` but returns raw output without extracting code from markdown blocks. |
|
|
| **Example with curl:** |
| ```bash |
| curl -X POST http://localhost:8000/generate/raw \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "prompt": "def quick_sort(arr):", |
| "max_tokens": 200 |
| }' |
| ``` |
|
|
| --- |
|
|
| ### `POST /extract-code` |
|
|
| Extract code from a text response that may contain markdown code blocks. |
|
|
| **Request Body:** |
| ```json |
| { |
| "prompt": "```python\ndef hello():\n print(\"world\")\n```" |
| } |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "code": "def hello():\n print(\"world\")" |
| } |
| ``` |
|
|
| --- |
|
|
| ## Environment Variables |
|
|
| | Variable | Default | Description | |
| |----------|---------|-------------| |
| | `MODEL_PATH` | `base_model_qwen7b` | Path to model directory | |
| | `DEVICE` | `cuda` (if available) | Device to use: `cuda` or `cpu` | |
| | `PORT` | `8000` | Server port | |
| | `HOST` | `0.0.0.0` | Server host | |
| | `RELOAD` | `false` | Enable auto-reload for development | |
| | `DEFAULT_MAX_TOKENS` | `512` | Default max tokens | |
| | `DEFAULT_TEMPERATURE` | `0.2` | Default temperature | |
| | `DEFAULT_TOP_P` | `0.95` | Default top_p | |
| |
| --- |
| |
| ## Usage Examples |
| |
| ### Python Client |
| |
| ```python |
| import requests |
| |
| API_URL = "http://localhost:8000" |
|
|
| # Health check |
| health = requests.get(f"{API_URL}/health").json() |
| print(f"Model loaded: {health['model_loaded']}") |
|
|
| # Code completion |
| response = requests.post( |
| f"{API_URL}/generate", |
| json={ |
| "prompt": "def merge_sort(arr):\n \"\"\"Return sorted array.\"\"\"\n", |
| "max_tokens": 200, |
| "temperature": 0.3, |
| } |
| ).json() |
| |
| print(response["generated_text"]) |
| ``` |
| |
| ### JavaScript/Node.js Client |
| |
| ```javascript |
| const API_URL = "http://localhost:8000"; |
|
|
| // Code completion |
| async function generate(prompt) { |
| const response = await fetch(`${API_URL}/generate`, { |
| method: "POST", |
| headers: { "Content-Type": "application/json" }, |
| body: JSON.stringify({ |
| prompt, |
| max_tokens: 128, |
| temperature: 0.2, |
| }), |
| }); |
| return response.json(); |
| } |
| |
| const result = await generate("def binary_search(arr, target):"); |
| console.log(result.generated_text); |
| ``` |
| |
| ### Using with OpenAI SDK (with base_url replacement) |
| |
| ```python |
| from openai import OpenAI |
|
|
| client = OpenAI( |
| api_key="not-needed", |
| base_url="http://localhost:8000" |
| ) |
| |
| # Note: This works for basic completions but may need adapter code |
| # for full OpenAI compatibility |
| response = client.completions.create( |
| model="stack-2.9", |
| prompt="def factorial(n):", |
| max_tokens=100, |
| ) |
| ``` |
| |
| --- |
|
|
| ## Performance Tips |
|
|
| 1. **GPU Recommended**: For fastest inference, run on GPU with CUDA |
| 2. **Batch Processing**: For multiple prompts, process sequentially (model is loaded once) |
| 3. **Memory**: Ensure adequate GPU memory; reduce `max_tokens` if needed |
| 4. **Temperature**: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks |
|
|
| --- |
|
|
| ## Error Handling |
|
|
| **503 Service Unavailable**: Model not loaded or loading failed |
| ```json |
| {"detail": "Model not loaded. Check /health for status."} |
| ``` |
|
|
| **500 Internal Server Error**: Generation failed |
| ```json |
| {"detail": "Generation failed: <error message>"} |
| ``` |
|
|
| **400 Bad Request**: Invalid input |
| ```json |
| {"detail": "Last message must be from user"} |
| ``` |
|
|
| --- |
|
|
| ## Architecture Notes |
|
|
| - **Single Model Instance**: Model is loaded once at startup and reused |
| - **Synchronous Generation**: Uses `torch.no_grad()` for inference |
| - **CORS Enabled**: Accepts requests from any origin (configure for production) |
| - **No Authentication**: Add middleware (e.g., API key) for production deployments |
|
|