Stack-2-9-finetuned / stack /docs /archive /TOGETHER_AI.md
walidsobhie-code
refactor: Squeeze folders further - cleaner structure
65888d5

Using Stack 2.9 with Together AI

This guide explains how to use Stack 2.9 with Together AI as the model provider.

Overview

Together AI provides powerful cloud-hosted models with high performance and competitive pricing. Stack 2.9 supports Together AI through its OpenAI-compatible API, allowing you to use models like:

  • togethercomputer/meta-llama-3-70b-instruct
  • togethercomputer/CodeLlama-34b-instruct
  • togethercomputer/Qwen2.5-Coder-32B-Instruct (recommended for Stack 2.9)
  • And many others from Together's model library

Prerequisites

  1. Together AI Account: Sign up at together.ai
  2. API Key: Obtain your API key from the Together dashboard
  3. OpenAI Python Package: Install openai>=1.0.0 (required for Together client)
pip install openai

Environment Variables

Configure your environment with the following variables:

# Required: Together AI API key
export TOGETHER_API_KEY="your-together-api-key-here"

# Optional: Model selection (default: togethercomputer/Qwen2.5-Coder-32B-Instruct)
export TOGETHER_MODEL="togethercomputer/Qwen2.5-Coder-32B-Instruct"

# Optional: Provider configuration (for auto-detection)
export MODEL_PROVIDER="together"

Setting up in Shell

Add these lines to your ~/.zshrc, ~/.bashrc, or shell profile:

# Together AI configuration
export TOGETHER_API_KEY="tog-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export TOGETHER_MODEL="togethercomputer/Qwen2.5-Coder-32B-Instruct"

Then reload your shell:

source ~/.zshrc  # or ~/.bashrc

Using .env file (recommended for development)

Create a .env file in your project root:

TOGETHER_API_KEY=tog-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TOGETHER_MODEL=togethercomputer/Qwen2.5-Coder-32B-Instruct
MODEL_PROVIDER=together

Then load it with python-dotenv:

pip install python-dotenv

And in your Python script:

from dotenv import load_dotenv
load_dotenv()  # loads .env file

Usage Examples

Command Line

Use the built-in CLI with Together provider:

# Using default model (Meta-Llama-3-70B)
python stack.py --provider together "Write a Python function to reverse a string"

# Using a specific model (override env var)
TOGETHER_MODEL=togethercomputer/Qwen2.5-Coder-32B-Instruct python stack.py --provider together "def factorial(n):"

Python API

from stack_2_9_eval.model_client import create_model_client

# Create Together client (reads TOGETHER_API_KEY from env)
client = create_model_client(provider="together")

# Or specify explicitly
client = create_model_client(
    provider="together",
    model="togethercomputer/Qwen2.5-Coder-32B-Instruct",
    api_key="your-api-key"
)

# Generate code
result = client.generate(
    prompt="Write a Python function to sort a list using quicksort",
    temperature=0.2,
    max_tokens=1024
)

print(result.text)

Chat Mode

from stack_2_9_eval.model_client import create_model_client, ChatMessage

client = create_model_client(provider="together")

messages = [
    ChatMessage(role="system", content="You are an expert Python programmer."),
    ChatMessage(role="user", content="How do I read a JSON file in Python?"),
]

result = client.chat(messages, temperature=0.2, max_tokens=512)
print(result.text)

Using with Tool Calls

tools = [
    {
        "type": "function",
        "function": {
            "name": "FileReadTool",
            "description": "Read file contents",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path"}
                },
                "required": ["path"]
            }
        }
    }
]

messages = [
    ChatMessage(role="user", content="Read the file 'config.yaml' and tell me what's in it")
]

result = client.chat(messages, temperature=0.2, tools=tools)
print(result.text)
# Check result.raw_response for tool_calls if model requested a tool

Recommended Models

For Stack 2.9 use cases (coding + tool use), these Together models are recommended:

Primary Recommendation

togethercomputer/qwen2.5-coder-32b-instruct

  • Matches Stack 2.9's base model (Qwen2.5-Coder-32B)
  • Excellent code generation
  • Strong tool-calling capabilities
  • Cost-effective: ~$0.22 / 1M tokens (input)
  • Use this for production Stack 2.9 deployments

Alternatives

togethercomputer/meta-llama-3-70b-instruct

  • Larger model (70B) with strong reasoning
  • Slightly higher cost but excellent quality
  • Good for complex problem-solving

togethercomputer/codellama-34b-instruct

  • Code-specialized Llama 34B
  • Good performance, lower cost than 70B models

togethercomputer/qwen2.5-72b-instruct

  • 72B variant of Qwen2.5 (if you need maximum quality)
  • Higher cost and latency

Model Selection Tips

  • Match training distribution: Use Qwen models for Stack 2.9 pattern compatibility
  • Budget: 34B models offer best price/performance for coding tasks
  • Latency: Smaller models (7B-13B) are faster but less capable
  • Throughput: Consider batching for large-scale usage

Cost Estimation

Together AI pricing (as of 2025, check their site for current rates):

Model Input ($/1M tokens) Output ($/1M tokens)
Qwen2.5-Coder-32B ~0.22 ~0.22
Meta-Llama-3-70B ~0.70 ~0.70
CodeLlama-34B ~0.22 ~0.22
Qwen2.5-72B ~0.70 ~0.70

Example Cost Calculation

If your typical usage:

  • 100 queries/day
  • Average 2,000 tokens per query (input + output)
  • Using Qwen2.5-Coder-32B

Daily cost: (100 * 2000 / 1,000,000) * $0.22 β‰ˆ $0.044 Monthly cost: ~$1.32

Very affordable for development and light production use.

Performance Considerations

  • Latency: Expect 100-500ms per request depending on model size and complexity
  • Rate Limits: Together provides generous rate limits (check your plan)
  • Throughput: Use concurrent requests for batch processing (respect rate limits)
  • Streaming: Together supports streaming; use stream=True in client for long generations

Error Handling

Implement robust error handling for production:

from stack_2_9_eval.model_client import create_model_client
import time

def generate_with_retry(client, prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = client.generate(prompt, temperature=0.2, max_tokens=1024)
            return result
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # exponential backoff
            print(f"Error: {e}. Retrying in {wait}s...")
            time.sleep(wait)

client = create_model_client(provider="together", api_key=os.getenv("TOGETHER_API_KEY"))
result = generate_with_retry(client, "Write a function to calculate prime numbers")

Comparison with Other Providers

Feature Together AI Ollama (local) OpenAI Anthropic
Cost (32B class) Low (~$0.22/M) Free (your hardware) High (~$3/M) High (~$3/M)
Qwen2.5-Coder βœ… Supported βœ… Via pull ❌ No ❌ No
Privacy Cloud (check TOS) Full local Cloud Cloud
Latency Medium Fast (local) Medium Medium
Setup Complexity Low (API key) Medium (install) Low Low
Rate Limits Generous Unlimited Pay-as-you-go Pay-as-you-go
Tool Calling βœ… Yes βœ… Yes βœ… Yes βœ… Yes

Best for Stack 2.9: Together AI when you need cloud access and Qwen models without running locally.

Troubleshooting

API Key Errors

ValueError: Together AI API key required.

Solution: Set TOGETHER_API_KEY environment variable or pass api_key param.

Model Not Found

openai.BadRequestError: The model '...' does not exist

Solution: Check model name spelling. Browse available models at Together Models. Use full model ID like togethercomputer/qwen2.5-coder-32b-instruct.

Rate Limit Exceeded

Solution: Add retry logic with exponential backoff. Consider upgrading your Together plan.

Import Errors

ImportError: openai package required

Solution: pip install openai (version 1.0+)

Advanced Configuration

Custom Base URL

If you need to use a custom endpoint (e.g., for regional deployments):

client = create_model_client(
    provider="together",
    model="togethercomputer/qwen2.5-coder-32b-instruct",
    base_url="https://your-custom-endpoint.together.ai/v1"
)

Timeouts and Retries

client = TogetherClient(
    model="togethercomputer/qwen2.5-coder-32b-instruct",
    api_key=os.getenv("TOGETHER_API_KEY"),
    timeout=300  # 5 minute timeout
)

Streaming Responses

For long generations, use streaming (requires modifying client or using OpenAI library directly):

from openai import OpenAI

client = OpenAI(api_key=os.getenv("TOGETHER_API_KEY"), base_url="https://api.together.xyz/v1")

stream = client.chat.completions.create(
    model="togethercomputer/qwen2.5-coder-32b-instruct",
    messages=[{"role": "user", "content": "Write a detailed explanation of binary search"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Integration with Stack 2.9 CLI

To make Together AI the default provider:

# Set environment variables permanently
echo 'export MODEL_PROVIDER="together"' >> ~/.zshrc
echo 'export TOGETHER_MODEL="togethercomputer/qwen2.5-coder-32b-instruct"' >> ~/.zshrc
source ~/.zshrc

Now stack.py will automatically use Together AI without --provider flag.

Security Best Practices

  1. Never commit API keys to version control. Use .env files or environment variables.
  2. Rotate keys periodically from Together dashboard.
  3. Use minimal permissions - Together API keys have full access; protect them.
  4. Enable billing alerts to avoid unexpected charges.
  5. Review Together's TOS for data usage and privacy policies.

Support


Last Updated: 2025-04-02
Compatible Stack 2.9 Version: 2.9.0+