fix(agent): default back to gpt-oss-120b, pin Llama 3.3 to Together
Browse filesGroq (HF Router's :fastest pick for Llama 3.3 70B) validates tool calls
very strictly and was rejecting the editor's 18-tool registry with
"Failed to call a function. Please adjust your prompt." Fireworks has
deprecated the Llama 3.3 checkpoint; Nscale doesn't support the `tools`
parameter at all. Together serves the same model with full
tool-calling support, so pin Llama 3.3 to `:together`.
Restore openai/gpt-oss-120b as the default - it's HF's trending
tool-calling model and the reasoning-strip fix already covers the
Cerebras round-trip case.
Co-authored-by: Cursor <cursoragent@cursor.com>
- backend/.env.example +7 -2
- backend/src/agent/chat.ts +13 -1
- backend/src/agent/stream-handler.ts +7 -1
- docs/SPECIFICATION.md +2 -2
backend/.env.example
CHANGED
|
@@ -87,8 +87,13 @@ OAUTH_CLIENT_SECRET=
|
|
| 87 |
# Override the default model id used by the chat agent. The list of
|
| 88 |
# supported models is in backend/src/agent/chat.ts (AVAILABLE_MODELS), but
|
| 89 |
# any model exposed by HF Inference Providers with tool-calling support
|
| 90 |
-
# works. Defaults to "
|
| 91 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
# -----------------------------------------------------------------------------
|
| 94 |
# Publishing
|
|
|
|
| 87 |
# Override the default model id used by the chat agent. The list of
|
| 88 |
# supported models is in backend/src/agent/chat.ts (AVAILABLE_MODELS), but
|
| 89 |
# any model exposed by HF Inference Providers with tool-calling support
|
| 90 |
+
# works. Defaults to "openai/gpt-oss-120b".
|
| 91 |
+
# You can also pin a specific provider by appending ":<provider>" to the
|
| 92 |
+
# model id, e.g. "meta-llama/Llama-3.3-70B-Instruct:together" - useful
|
| 93 |
+
# to avoid providers (such as Groq or Nscale) that reject the editor's
|
| 94 |
+
# wide tool registry with "Failed to call a function" or
|
| 95 |
+
# "tools parameter not supported".
|
| 96 |
+
# HF_INFERENCE_MODEL=openai/gpt-oss-120b
|
| 97 |
|
| 98 |
# -----------------------------------------------------------------------------
|
| 99 |
# Publishing
|
backend/src/agent/chat.ts
CHANGED
|
@@ -8,6 +8,18 @@ import type { Request, Response } from "express";
|
|
| 8 |
* Face Inference Providers (`https://router.huggingface.co/v1`) and
|
| 9 |
* support function/tool calling - the agent loop won't work without it.
|
| 10 |
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
* Discover more conversational models here:
|
| 12 |
* https://huggingface.co/models?inference_provider=all&other=conversational
|
| 13 |
*
|
|
@@ -16,9 +28,9 @@ import type { Request, Response } from "express";
|
|
| 16 |
* own rates, see the docs for the source of truth.
|
| 17 |
*/
|
| 18 |
export const AVAILABLE_MODELS = [
|
| 19 |
-
{ id: "meta-llama/Llama-3.3-70B-Instruct", label: "Llama 3.3 70B", context: "128K", cost: "$" },
|
| 20 |
{ id: "openai/gpt-oss-120b", label: "GPT-OSS 120B", context: "131K", cost: "$$" },
|
| 21 |
{ id: "openai/gpt-oss-20b", label: "GPT-OSS 20B", context: "131K", cost: "$" },
|
|
|
|
| 22 |
{ id: "Qwen/Qwen3-Coder-480B-A35B-Instruct", label: "Qwen3 Coder 480B", context: "262K", cost: "$$" },
|
| 23 |
{ id: "deepseek-ai/DeepSeek-V3.1", label: "DeepSeek V3.1", context: "128K", cost: "$$" },
|
| 24 |
];
|
|
|
|
| 8 |
* Face Inference Providers (`https://router.huggingface.co/v1`) and
|
| 9 |
* support function/tool calling - the agent loop won't work without it.
|
| 10 |
*
|
| 11 |
+
* Note about provider suffixes (`:provider`):
|
| 12 |
+
* HF Router defaults to the `:fastest` provider for a given model.
|
| 13 |
+
* That's usually fine, but a few providers don't fit the editor's
|
| 14 |
+
* workload:
|
| 15 |
+
* - Groq enforces strict tool-call validation and tends to reject
|
| 16 |
+
* our 18-tool registry with `Failed to call a function`.
|
| 17 |
+
* - Nscale + a few others reject the `tools` parameter outright.
|
| 18 |
+
* - Fireworks has deprecated several Llama 3.x checkpoints.
|
| 19 |
+
* We pin `Llama-3.3-70B` to Together, which serves the model with
|
| 20 |
+
* full tool-calling support. Unsuffixed ids use the default :fastest
|
| 21 |
+
* policy.
|
| 22 |
+
*
|
| 23 |
* Discover more conversational models here:
|
| 24 |
* https://huggingface.co/models?inference_provider=all&other=conversational
|
| 25 |
*
|
|
|
|
| 28 |
* own rates, see the docs for the source of truth.
|
| 29 |
*/
|
| 30 |
export const AVAILABLE_MODELS = [
|
|
|
|
| 31 |
{ id: "openai/gpt-oss-120b", label: "GPT-OSS 120B", context: "131K", cost: "$$" },
|
| 32 |
{ id: "openai/gpt-oss-20b", label: "GPT-OSS 20B", context: "131K", cost: "$" },
|
| 33 |
+
{ id: "meta-llama/Llama-3.3-70B-Instruct:together", label: "Llama 3.3 70B", context: "128K", cost: "$" },
|
| 34 |
{ id: "Qwen/Qwen3-Coder-480B-A35B-Instruct", label: "Qwen3 Coder 480B", context: "262K", cost: "$$" },
|
| 35 |
{ id: "deepseek-ai/DeepSeek-V3.1", label: "DeepSeek V3.1", context: "128K", cost: "$$" },
|
| 36 |
];
|
backend/src/agent/stream-handler.ts
CHANGED
|
@@ -4,7 +4,13 @@ import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
|
|
| 4 |
import type { Request, Response } from "express";
|
| 5 |
import { extractToken } from "../auth.js";
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
/**
|
| 10 |
* Hugging Face Inference Providers exposes an OpenAI-compatible chat
|
|
|
|
| 4 |
import type { Request, Response } from "express";
|
| 5 |
import { extractToken } from "../auth.js";
|
| 6 |
|
| 7 |
+
/**
|
| 8 |
+
* `openai/gpt-oss-120b` is HF's trending tool-calling model with a
|
| 9 |
+
* strong reputation on the editor's 18-tool agent loop. The `:fastest`
|
| 10 |
+
* policy routes it to Cerebras, which we've validated end-to-end
|
| 11 |
+
* (multi-turn, with the reasoning-strip below).
|
| 12 |
+
*/
|
| 13 |
+
export const DEFAULT_MODEL = "openai/gpt-oss-120b";
|
| 14 |
|
| 15 |
/**
|
| 16 |
* Hugging Face Inference Providers exposes an OpenAI-compatible chat
|
docs/SPECIFICATION.md
CHANGED
|
@@ -127,7 +127,7 @@ flowchart LR
|
|
| 127 |
|
| 128 |
### 4.6 AI Agent
|
| 129 |
|
| 130 |
-
- Provider: Hugging Face Inference Providers (`https://router.huggingface.co/v1`), default model `meta-llama/Llama-3.3-70B-Instruct`
|
| 131 |
- Auth: per-request bearer token resolved from the editor's OAuth cookie when available, falling back to the server-side `HF_TOKEN`. On a HF Space with `inference-api` scope, no extra secret is needed - the logged-in user pays for their own inference under their HF quota.
|
| 132 |
- Streaming via Vercel AI SDK `streamText` over `@ai-sdk/openai-compatible`
|
| 133 |
- Reasoning parts from prior assistant turns are stripped before re-sending the history: providers like Cerebras reject `reasoning_content` on round-trip, and the model doesn't need to see its own past reasoning to continue the conversation.
|
|
@@ -283,7 +283,7 @@ The publisher reads these same CSS files server-side and injects them inline int
|
|
| 283 |
| `OAUTH_SCOPES` | No (default `openid profile`) | OAuth scopes. Add `manage-repos` for dataset persistence and `inference-api` to power AI features with the user's token |
|
| 284 |
| `HF_DATASET_ID` | No | Override dataset name (default: `{SPACE_ID}-data`) |
|
| 285 |
| `HF_TOKEN` | For AI chat in local dev | Fallback Hub token for HF API + Inference Providers. Needs the "Make calls to Inference Providers" permission |
|
| 286 |
-
| `HF_INFERENCE_MODEL` | No (default `
|
| 287 |
| `ENABLE_PDF` | No (default true) | Toggle PDF/thumbnail generation |
|
| 288 |
|
| 289 |
### 6.4 Local Development
|
|
|
|
| 127 |
|
| 128 |
### 4.6 AI Agent
|
| 129 |
|
| 130 |
+
- Provider: Hugging Face Inference Providers (`https://router.huggingface.co/v1`), default model `openai/gpt-oss-120b`. Model ids may be suffixed with `:<provider>` (e.g. `meta-llama/Llama-3.3-70B-Instruct:together`) to bypass providers that enforce overly strict tool-call validation (notably Groq) or that don't support the `tools` parameter (Nscale, etc.).
|
| 131 |
- Auth: per-request bearer token resolved from the editor's OAuth cookie when available, falling back to the server-side `HF_TOKEN`. On a HF Space with `inference-api` scope, no extra secret is needed - the logged-in user pays for their own inference under their HF quota.
|
| 132 |
- Streaming via Vercel AI SDK `streamText` over `@ai-sdk/openai-compatible`
|
| 133 |
- Reasoning parts from prior assistant turns are stripped before re-sending the history: providers like Cerebras reject `reasoning_content` on round-trip, and the model doesn't need to see its own past reasoning to continue the conversation.
|
|
|
|
| 283 |
| `OAUTH_SCOPES` | No (default `openid profile`) | OAuth scopes. Add `manage-repos` for dataset persistence and `inference-api` to power AI features with the user's token |
|
| 284 |
| `HF_DATASET_ID` | No | Override dataset name (default: `{SPACE_ID}-data`) |
|
| 285 |
| `HF_TOKEN` | For AI chat in local dev | Fallback Hub token for HF API + Inference Providers. Needs the "Make calls to Inference Providers" permission |
|
| 286 |
+
| `HF_INFERENCE_MODEL` | No (default `openai/gpt-oss-120b`) | Default chat-completion model id served by HF Inference Providers. May be suffixed with `:<provider>` to pin a specific routing |
|
| 287 |
| `ENABLE_PDF` | No (default true) | Toggle PDF/thumbnail generation |
|
| 288 |
|
| 289 |
### 6.4 Local Development
|