Abstract
SEAR is a schema-based system for evaluating and routing LLM responses that uses structured signals derived from LLM reasoning to enable accurate, interpretable routing decisions across multiple providers.
Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and routing system for multi-model, multi-provider LLM gateways. SEAR defines an extensible relational schema covering both LLM evaluation signals (context, intent, response characteristics, issue attribution, and quality scores) and gateway operational metrics (latency, cost, throughput), with cross-table consistency links across around one hundred typed, SQL-queryable columns. To populate the evaluation signals reliably, SEAR proposes self-contained signal instructions, in-schema reasoning, and multi-stage generation that produces database-ready structured outputs. Because signals are derived through LLM reasoning rather than shallow classifiers, SEAR captures complex request semantics, enables human-interpretable routing explanations, and unifies evaluation and routing in a single query layer. Across thousands of production sessions, SEAR achieves strong signal accuracy on human-labeled data and supports practical routing decisions, including large cost reductions with comparable quality.
Community
LLM gateways route requests across multiple models and providers, but evaluating response quality and making routing decisions still relies on shallow heuristics or black-box logic. SEAR changes that.
SEAR defines an extensible relational schema with ~100 typed, SQL-queryable columns covering evaluation signals (intent, context, response characteristics, issue attribution, quality scores) and operational metrics (latency, cost, throughput). It uses LLM-based reasoning to populate these signals, capturing complex request semantics and producing human-interpretable, structured outputs. Because all signals live in a unified SQL layer, evaluation insights can be aggregated, joined, and queried at scale, bringing a big-data approach to LLM quality analysis and routing.
Evaluated on thousands of production sessions from https://infron.ai/, SEAR achieves strong signal accuracy against human labels and supports practical routing decisions, including large cost reductions with comparable quality.
SEAR turns model evaluation and routing from a static, opaque process into a data-driven loop that continuously improves cost, quality, and reliability.
By using SEAR, we reduce all LLM production evaluation and routing to SQL queries which based on reasoning LLMs instead of shallow classifiers and also fully human interpretable for routing decision makings for companies.
Evaluation Example
Surface LLM-caused issues across models and domains for coding tasks over the last 30 days
SELECT gw.model_id,
ctx.context_domain_category AS domain,
COUNT(*) AS n,
AVG(CASE WHEN ia.issue_caused_by_code_task
IN ('llm','both')
THEN 1 ELSE 0 END) AS llm_issue_rate,
AVG(CASE eval.severity_of_code_task
WHEN 'major' THEN 2 WHEN 'minor' THEN 1
ELSE 0 END) AS avg_severity
FROM context_info ctx
JOIN issue_attribution ia ON ia.context_id = ctx.id
JOIN evaluation eval ON eval.context_id = ctx.id
JOIN llm_response_info llm ON llm.context_id = ctx.id
JOIN gateway_metrics gw ON gw.id = llm.gateway_metrics_id
WHERE ctx.request_task_type = 'coding'
AND gw.created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY llm_issue_rate, avg_severity;
Routing Example
Route the cheapest model within 90% of the best quality
WITH model_perf AS (
SELECT gw.model_id,
AVG(CASE eval.overall_task_type_quality
WHEN 'high' THEN 3 WHEN 'medium' THEN 2
WHEN 'low' THEN 1 ELSE 0 END) AS avg_quality,
AVG(gw.prompt_tokens * mp.input_cost_per_million_token
+ gw.completion_tokens
* mp.output_cost_per_million_token) AS avg_cost
FROM gateway_metrics gw
JOIN llm_response_info llm ON llm.gateway_metrics_id = gw.id
JOIN evaluation eval ON eval.context_id = llm.context_id
JOIN model_provider mp ON mp.model_id = gw.model_id
AND mp.provider_id = gw.provider_id
WHERE gw.is_failed = FALSE
GROUP BY 1
HAVING COUNT(*) >= 30
)
SELECT model_id, avg_quality, avg_cost
FROM model_perf
WHERE avg_quality >= 0.9 * (SELECT MAX(avg_quality) FROM model_perf)
ORDER BY avg_cost;
Get this paper in your agent:
hf papers read 2603.26728 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper