How We Built a Semantic Highlight Model To Save Token Cost for RAG

Community Article Published January 15, 2026

Upvote

zilliz

zilliz

Introduction

The Problem: RAG Token Cost and Quality

The Dilemma of Existing Models
OpenSearch's Model: Limited Context Window

Provence/XProvence: Multilingual Trade-offs

Open Provence: Open-source but Only English and Japanese

Our Choice: Train a Bilingual Model

Our Technical Approach
Model Architecture: The Provence Approach

Base Model: BGE-M3 Reranker v2

Training Data: LLM Annotation with Reasoning Process

Training Process

Evaluation Results: Achieving SOTA Performance
Real-World Case Study: Precisely Identifying Core Sentences

Standing on the Shoulders of Giants

Open Source Release and Getting Started

Conclusion

Related Links

Introduction

We trained and open-sourced a bilingual Semantic Highlight model that achieves state-of-the-art performance on both English and Chinese. The model automatically identifies and highlights semantically relevant sentences in retrieved documents based on semantic understanding.

Model Release:

HuggingFace: zilliz/semantic-highlight-bilingual-v1
License: MIT (commercial-friendly)
Architecture: 0.6B Encoder-Only model based on BGE-M3 Reranker v2
Context Window: 8192 tokens
Supported Languages: English and Chinese

In this article, we'll share our technical approach.

The Problem: RAG Token Cost and Quality

In production RAG systems, a typical query retrieves 10 documents with several thousand tokens each, consuming tens of thousands of tokens per query. The problem: only a few dozen sentences actually contain relevant information, while the rest is noise that increases costs and degrades answer quality.

This creates an urgent need for a targeted highlight model that retains only the contextually relevant sentences by highlighting them, while pruning away all irrelevant noise content — a technique also widely referred to as context pruning.

Traditional keyword-based highlighting can't solve this problem. When a user asks, "How to improve Python code execution efficiency?", traditional systems can only highlight words like "Python" and "efficiency." But the truly useful content—"Use numpy vectorized operations instead of loops"—contains none of the query terms and gets ignored.

This problem becomes even more severe in AI Agent scenarios, where queries are complex instructions after reasoning and decomposition. Traditional highlighting mechanically marks matching words but misses truly valuable analytical conclusions.

Semantic Highlighting solves this problem. It identifies sentences that semantically answer the query, even without keyword matches. This approach offers:

70-80% token cost reduction by sending only highlighted sentences to the LLM
Improved answer quality as the LLM focuses on relevant content
System interpretability showing why documents were retrieved and which sentences matter
Easier debugging for engineers to trace retrieval issues

What we need is a lightweight, fast, and cost-effective small model (hundreds of MB, millisecond-level inference) deployable on search servers for real-time computation.

The Dilemma of Existing Models

We investigated existing solutions but found they didn't quite meet our requirements.

OpenSearch's Model: Limited Context Window

OpenSearch released opensearch-semantic-highlighter-v1, a model specifically for semantic highlighting.

However, it's based on the BERT architecture with a 512-token limit—roughly 400-500 English words, which is not enough for real-world scenarios.

Provence/XProvence: Multilingual Trade-offs

Naver's Provence model series was trained for Context Pruning—a task technically similar to Semantic Highlighting.

Provence is a monolingual English model with strong performance. XProvence extends this to over a dozen languages, but multilingual models typically show performance degradation compared to their monolingual counterparts.

There's also a licensing consideration: both use the CC BY-NC 4.0 license, which restricts commercial use.

Open Provence: Open-source but Only English and Japanese

Open Provence is an outstanding open-source project that fully reproduces Provence's training pipeline.

It includes training scripts, data processing tools, evaluation frameworks, and pre-trained models at different scales—all under an MIT license.

However, it currently supports only English and Japanese.

Our Choice: Train a Bilingual Model

No existing model can meet all our needs:

Supports both English and Chinese
Large enough context window
Good out-of-domain generalization
Good performance in Semantic Highlight scenarios
Friendly license (MIT or Apache 2.0)

Since no suitable model exists on the market, we decided to train one ourselves.

Our Technical Approach

Training a model in this scenario isn't inherently difficult; what's challenging is training a good model that overcomes all the above problems and achieves near-SOTA performance. Our approach:

On the model side, we use the classic Encoder-Only small model architecture for fast inference performance.

On the data side, higher-quality training datasets lead to better training results. We use reasoning LLMs to generate high-quality data and leverage local model inference frameworks to accelerate and scale data generation.

Model Architecture: The Provence Approach

We adopted the Provence approach, which uses a lightweight Encoder-Only model that frames context pruning as a token-level scoring task.

Why Encoder-Only?

Although BERT-like Encoder-Only architectures are no longer the latest technology, they offer significantly faster speed and efficiency than modern LLMs. The key characteristic is the ability to train and simultaneously infer a score for each token position, outputting all token scores in parallel.

Inference Process:

The inference process is straightforward:

Concatenate inputs as [BOS] + Query + Context
Score each token in the context (between 0 and 1)
Average token scores within each sentence to obtain sentence scores
Highlight sentences with high scores while removing those with low scores

Base Model: BGE-M3 Reranker v2

We selected BGE-M3 Reranker v2 as our base model for several reasons:

It employs an Encoder architecture suitable for token and sentence scoring
Supports multiple languages with optimization for both English and Chinese
Provides an 8192-token context window appropriate for longer RAG documents
Maintains 0.6B parameters—strong enough without being computationally heavy
Ensures sufficient world knowledge in the base model
Trained for reranking, which closely aligns with relevance judgment tasks

Training Data: LLM Annotation with Reasoning Process

The key to our success was data construction. We had the LLM (Qwen3 8B) output its complete reasoning process during annotation. The annotation workflow is as follows:

Each training sample includes not just Query, Context, and Sentence Spans fields, but also an important Think Process field to record the reasoning process of the LLM.

Why include the reasoning process?

This approach provides several benefits:

Higher annotation quality: Writing the reasoning process serves as self-verification, reducing errors
Observable and debuggable: We can see why specific sentences were selected
Enables debugging: Reveals whether incorrect annotations stem from prompt issues or knowledge gaps
Data reusability: Provides reference explanation patterns for future re-annotation with different models

Why Qwen3 8B?

We used Qwen3 8B for annotation because it naturally supports a thinking mode with <think> outputs. The 8B size strikes the right balance—smaller models lack stability, while larger ones are too slow and expensive.

We ran annotation using a local vLLM service rather than cloud APIs, providing high concurrent throughput and cost efficiency by trading GPU time for token costs.

Dataset Scale:

Ultimately, we constructed nearly 5 million bilingual training samples, split evenly between English and Chinese.

English data came from MS MARCO, Natural Questions, and GooAQ
Chinese data came from DuReader, Chinese Wikipedia, and mmarco_chinese

Some data came from Open Provence and similar sources with re-annotation, while other portions were generated from raw corpora through query and context generation, followed by annotation.

All annotated training data is also available on HuggingFace for community development and training reference: https://huggingface.co/zilliz/datasets

Training Process

With the model architecture and dataset prepared, we trained on 8 A100 GPUs for 3 epochs over approximately 9 hours.

The training focused on the Pruning Head for the Semantic Highlight task without training the Rerank Head, which helped us achieve better performance on this specific task.

Evaluation Results: Achieving SOTA Performance

We compared different models' performance across multiple datasets, including:

English multi-span QA dataset (multispanqa)
Wikipedia out-of-domain dataset (wikitext2)
Chinese multi-span QA dataset (multispanqa_zh)
Chinese version of Wikipedia out-of-domain dataset (wikitext2_zh)

Evaluated models include the Open Provence series, Naver's Provence/XProvence series, OpenSearch's semantic-highlighter, and our trained bilingual model.

Key findings:

Our model ranks first across all four evaluation datasets
It's the only model that demonstrates strong performance on both English and Chinese
Other models either support only English or show significant performance degradation on Chinese text

Real-World Case Study: Precisely Identifying Core Sentences

Beyond benchmark scores, let's examine a more interesting example to intuitively demonstrate our model's performance in practical applications.

Question: "Who wrote The Killing of a Sacred Deer?"

Text (5 sentences total):

1. The Killing of a Sacred Deer is a 2017 psychological horror film directed by Yorgos Lanthimos,
   with a screenplay by Lanthimos and Efthymis Filippou.

2. The film stars Colin Farrell, Nicole Kidman, Barry Keoghan, Raffey Cassidy,
   Sunny Suljic, Alicia Silverstone, and Bill Camp.

3. The story is based on the ancient Greek playwright Euripides' play Iphigenia in Aulis.

4. The film tells the story of a cardiac surgeon (Farrell) who secretly
   befriends a teenager (Keoghan) connected to his past.

5. He introduces the boy to his family, who then mysteriously fall ill.

Correct Answer: Sentence 1 (explicitly states "screenplay by Lanthimos and Efthymis Filippou")

This example has a trap: Sentence 3 mentions that "Euripides" wrote the original play. But the question asks "who wrote the film The Killing of a Sacred Deer," and the answer should be the film's screenwriters, not the Greek playwright from thousands of years ago.

Model Performance:

Model	Found Correct Answer	Prediction
Our Model	✓	Selected sentences 1 (correct) and 3
XProvence v1	✗	Only selected sentence 3, missed correct answer
XProvence v2	✗	Only selected sentence 3, missed correct answer

Key Sentence Score Comparison:

Sentence	Our Model	XProvence v1	XProvence v2
Sentence 1 (film screenplay, correct answer)	0.915	0.133	0.081
Sentence 3 (original play, distractor)	0.719	0.947	0.802

The results are revealing:

XProvence's Problem:

Strongly attracted to "Euripides" and "play," giving sentence 3 near-perfect scores (0.947 and 0.802)
Completely ignores the actual answer (sentence 1), giving extremely low scores (0.133 and 0.081)
Even when lowering the threshold from 0.5 to 0.2, it still can't find the correct answer

Our Model's Performance:

Gives the correct answer (sentence 1) a high score of 0.915, clearly identifying the film screenwriters
Also gives sentence 3 some score (0.719) since it mentions information related to the play
The distinction is clear: 0.915 vs 0.719, with a gap of nearly 0.2

This example demonstrates our model's key strength: understanding the true intent of questions.

In the context of a film encyclopedia, "Who wrote The Killing of a Sacred Deer" clearly asks about the film's screenwriters. Although the text contains both screenplay and original play information, our model accurately identifies what the user is looking for.

Standing on the Shoulders of Giants

This model's development builds on significant prior work, and we want to acknowledge the contributions that made our work possible:

Provence's theoretical foundation: Proposed an elegant approach of using lightweight Encoder models for context pruning
Open Provence codebase: Provided well-implemented training pipelines, data processing, and model heads with open-source licensing

Building on these foundations, we contributed several innovations:

LLM annotation with reasoning processes to improve data quality
Nearly 5 million bilingual training samples covering English and Chinese scenarios aligned with practical needs
Selection of a base model more suitable for RAG scenarios (BGE-M3 Reranker v2)
Focused training on the Pruning Head for the Semantic Highlight task

We sincerely thank the Provence team and Open Provence project contributors for their foundational work.

Open Source Release and Getting Started

We're now open-sourcing our model under the MIT license, making it safe for commercial use.

Model Download:

HuggingFace: zilliz/semantic-highlight-bilingual-v1

Training Data:

HuggingFace: All annotated training data is also available on HuggingFace for community development and training reference: https://huggingface.co/zilliz/datasets

Additionally, we're working on turning the model inference into a service and integrating and deploying it in Milvus as a Semantic Highlight interface. This will be available soon.

Conclusion

In this article, we shared our journey from identifying the token cost problem in production RAG systems to building a state-of-the-art bilingual Semantic Highlighting model:

We analyzed the limitations of traditional keyword-based highlighting across different scenarios
We evaluated existing solutions and identified their shortcomings
We developed a novel training methodology using LLM annotation with reasoning processes
We achieved SOTA performance on both English and Chinese datasets
We open-sourced our model and training data under the MIT license for the community

This model addresses multiple real-world production requirements: strong bilingual performance, a sufficient context window, good generalization, and a commercially friendly open-source license.

We hope this model can help developers build better RAG/Agent systems at lower cost and higher quality, enhancing their debuggability and interpretability; it can also be extended to any text retrieval system, e.g., recommendation systems, and serve as a semantic highlighting feature. Feel free to try it out and provide feedback anytime.