How We Built a Semantic Highlight Model To Save Token Cost for RAG

Community Article Published January 15, 2026

Introduction

We trained and open-sourced a bilingual Semantic Highlight model that achieves state-of-the-art performance on both English and Chinese. The model automatically identifies and highlights semantically relevant sentences in retrieved documents based on semantic understanding.

Model Release:

  • HuggingFace: zilliz/semantic-highlight-bilingual-v1
  • License: MIT (commercial-friendly)
  • Architecture: 0.6B Encoder-Only model based on BGE-M3 Reranker v2
  • Context Window: 8192 tokens
  • Supported Languages: English and Chinese

huggingface-model-card

In this article, we'll share our technical approach.


The Problem: RAG Token Cost and Quality

In production RAG systems, a typical query retrieves 10 documents with several thousand tokens each, consuming tens of thousands of tokens per query. The problem: only a few dozen sentences actually contain relevant information, while the rest is noise that increases costs and degrades answer quality.

This creates an urgent need for a targeted highlight model that retains only the contextually relevant sentences by highlighting them, while pruning away all irrelevant noise content — a technique also widely referred to as context pruning.

Traditional keyword-based highlighting can't solve this problem. When a user asks, "How to improve Python code execution efficiency?", traditional systems can only highlight words like "Python" and "efficiency." But the truly useful content—"Use numpy vectorized operations instead of loops"—contains none of the query terms and gets ignored.

This problem becomes even more severe in AI Agent scenarios, where queries are complex instructions after reasoning and decomposition. Traditional highlighting mechanically marks matching words but misses truly valuable analytical conclusions.

semantic-highlight-workflow

Semantic Highlighting solves this problem. It identifies sentences that semantically answer the query, even without keyword matches. This approach offers:

  1. 70-80% token cost reduction by sending only highlighted sentences to the LLM
  2. Improved answer quality as the LLM focuses on relevant content
  3. System interpretability showing why documents were retrieved and which sentences matter
  4. Easier debugging for engineers to trace retrieval issues

What we need is a lightweight, fast, and cost-effective small model (hundreds of MB, millisecond-level inference) deployable on search servers for real-time computation.


The Dilemma of Existing Models

We investigated existing solutions but found they didn't quite meet our requirements.

OpenSearch's Model: Limited Context Window

OpenSearch released opensearch-semantic-highlighter-v1, a model specifically for semantic highlighting.

opensearch_page

However, it's based on the BERT architecture with a 512-token limit—roughly 400-500 English words, which is not enough for real-world scenarios.

Provence/XProvence: Multilingual Trade-offs

Naver's Provence model series was trained for Context Pruning—a task technically similar to Semantic Highlighting.

arxiv-provence-paper

Provence is a monolingual English model with strong performance. XProvence extends this to over a dozen languages, but multilingual models typically show performance degradation compared to their monolingual counterparts.

There's also a licensing consideration: both use the CC BY-NC 4.0 license, which restricts commercial use.

Open Provence: Open-source but Only English and Japanese

Open Provence is an outstanding open-source project that fully reproduces Provence's training pipeline.

openprovence_page

It includes training scripts, data processing tools, evaluation frameworks, and pre-trained models at different scales—all under an MIT license.

However, it currently supports only English and Japanese.

Our Choice: Train a Bilingual Model

No existing model can meet all our needs:

  • Supports both English and Chinese
  • Large enough context window
  • Good out-of-domain generalization
  • Good performance in Semantic Highlight scenarios
  • Friendly license (MIT or Apache 2.0)

Since no suitable model exists on the market, we decided to train one ourselves.


Our Technical Approach

Training a model in this scenario isn't inherently difficult; what's challenging is training a good model that overcomes all the above problems and achieves near-SOTA performance. Our approach:

On the model side, we use the classic Encoder-Only small model architecture for fast inference performance.

On the data side, higher-quality training datasets lead to better training results. We use reasoning LLMs to generate high-quality data and leverage local model inference frameworks to accelerate and scale data generation.

Model Architecture: The Provence Approach

We adopted the Provence approach, which uses a lightweight Encoder-Only model that frames context pruning as a token-level scoring task.

Why Encoder-Only?

Although BERT-like Encoder-Only architectures are no longer the latest technology, they offer significantly faster speed and efficiency than modern LLMs. The key characteristic is the ability to train and simultaneously infer a score for each token position, outputting all token scores in parallel.

Inference Process:

The inference process is straightforward:

  1. Concatenate inputs as [BOS] + Query + Context
  2. Score each token in the context (between 0 and 1)
  3. Average token scores within each sentence to obtain sentence scores
  4. Highlight sentences with high scores while removing those with low scores

semantic-highlight-with-sentence-filter

Base Model: BGE-M3 Reranker v2

We selected BGE-M3 Reranker v2 as our base model for several reasons:

  1. It employs an Encoder architecture suitable for token and sentence scoring
  2. Supports multiple languages with optimization for both English and Chinese
  3. Provides an 8192-token context window appropriate for longer RAG documents
  4. Maintains 0.6B parameters—strong enough without being computationally heavy
  5. Ensures sufficient world knowledge in the base model
  6. Trained for reranking, which closely aligns with relevance judgment tasks

Training Data: LLM Annotation with Reasoning Process

The key to our success was data construction. We had the LLM (Qwen3 8B) output its complete reasoning process during annotation. The annotation workflow is as follows:

annotation-generation

Each training sample includes not just Query, Context, and Sentence Spans fields, but also an important Think Process field to record the reasoning process of the LLM.

Why include the reasoning process?

This approach provides several benefits:

  1. Higher annotation quality: Writing the reasoning process serves as self-verification, reducing errors
  2. Observable and debuggable: We can see why specific sentences were selected
  3. Enables debugging: Reveals whether incorrect annotations stem from prompt issues or knowledge gaps
  4. Data reusability: Provides reference explanation patterns for future re-annotation with different models

Why Qwen3 8B?

We used Qwen3 8B for annotation because it naturally supports a thinking mode with <think> outputs. The 8B size strikes the right balance—smaller models lack stability, while larger ones are too slow and expensive.

We ran annotation using a local vLLM service rather than cloud APIs, providing high concurrent throughput and cost efficiency by trading GPU time for token costs.

Dataset Scale:

Ultimately, we constructed nearly 5 million bilingual training samples, split evenly between English and Chinese.

  • English data came from MS MARCO, Natural Questions, and GooAQ
  • Chinese data came from DuReader, Chinese Wikipedia, and mmarco_chinese

Some data came from Open Provence and similar sources with re-annotation, while other portions were generated from raw corpora through query and context generation, followed by annotation.

All annotated training data is also available on HuggingFace for community development and training reference: https://huggingface.co/zilliz/datasets

zilliz-huggingface-datasets

Training Process

With the model architecture and dataset prepared, we trained on 8 A100 GPUs for 3 epochs over approximately 9 hours.

The training focused on the Pruning Head for the Semantic Highlight task without training the Rerank Head, which helped us achieve better performance on this specific task.


Evaluation Results: Achieving SOTA Performance

We compared different models' performance across multiple datasets, including:

  • English multi-span QA dataset (multispanqa)
  • Wikipedia out-of-domain dataset (wikitext2)
  • Chinese multi-span QA dataset (multispanqa_zh)
  • Chinese version of Wikipedia out-of-domain dataset (wikitext2_zh)

Evaluated models include the Open Provence series, Naver's Provence/XProvence series, OpenSearch's semantic-highlighter, and our trained bilingual model.

model-evaluation-results

Key findings:

  1. Our model ranks first across all four evaluation datasets
  2. It's the only model that demonstrates strong performance on both English and Chinese
  3. Other models either support only English or show significant performance degradation on Chinese text

Real-World Case Study: Precisely Identifying Core Sentences

Beyond benchmark scores, let's examine a more interesting example to intuitively demonstrate our model's performance in practical applications.

Question: "Who wrote The Killing of a Sacred Deer?"

Text (5 sentences total):

1. The Killing of a Sacred Deer is a 2017 psychological horror film directed by Yorgos Lanthimos,
   with a screenplay by Lanthimos and Efthymis Filippou.

2. The film stars Colin Farrell, Nicole Kidman, Barry Keoghan, Raffey Cassidy,
   Sunny Suljic, Alicia Silverstone, and Bill Camp.

3. The story is based on the ancient Greek playwright Euripides' play Iphigenia in Aulis.

4. The film tells the story of a cardiac surgeon (Farrell) who secretly
   befriends a teenager (Keoghan) connected to his past.

5. He introduces the boy to his family, who then mysteriously fall ill.

Correct Answer: Sentence 1 (explicitly states "screenplay by Lanthimos and Efthymis Filippou")

This example has a trap: Sentence 3 mentions that "Euripides" wrote the original play. But the question asks "who wrote the film The Killing of a Sacred Deer," and the answer should be the film's screenwriters, not the Greek playwright from thousands of years ago.

Model Performance:

Model Found Correct Answer Prediction
Our Model Selected sentences 1 (correct) and 3
XProvence v1 Only selected sentence 3, missed correct answer
XProvence v2 Only selected sentence 3, missed correct answer

Key Sentence Score Comparison:

Sentence Our Model XProvence v1 XProvence v2
Sentence 1 (film screenplay, correct answer) 0.915 0.133 0.081
Sentence 3 (original play, distractor) 0.719 0.947 0.802

The results are revealing:

XProvence's Problem:

  • Strongly attracted to "Euripides" and "play," giving sentence 3 near-perfect scores (0.947 and 0.802)
  • Completely ignores the actual answer (sentence 1), giving extremely low scores (0.133 and 0.081)
  • Even when lowering the threshold from 0.5 to 0.2, it still can't find the correct answer

Our Model's Performance:

  • Gives the correct answer (sentence 1) a high score of 0.915, clearly identifying the film screenwriters
  • Also gives sentence 3 some score (0.719) since it mentions information related to the play
  • The distinction is clear: 0.915 vs 0.719, with a gap of nearly 0.2

This example demonstrates our model's key strength: understanding the true intent of questions.

In the context of a film encyclopedia, "Who wrote The Killing of a Sacred Deer" clearly asks about the film's screenwriters. Although the text contains both screenplay and original play information, our model accurately identifies what the user is looking for.


Standing on the Shoulders of Giants

This model's development builds on significant prior work, and we want to acknowledge the contributions that made our work possible:

  1. Provence's theoretical foundation: Proposed an elegant approach of using lightweight Encoder models for context pruning
  2. Open Provence codebase: Provided well-implemented training pipelines, data processing, and model heads with open-source licensing

Building on these foundations, we contributed several innovations:

  1. LLM annotation with reasoning processes to improve data quality
  2. Nearly 5 million bilingual training samples covering English and Chinese scenarios aligned with practical needs
  3. Selection of a base model more suitable for RAG scenarios (BGE-M3 Reranker v2)
  4. Focused training on the Pruning Head for the Semantic Highlight task

We sincerely thank the Provence team and Open Provence project contributors for their foundational work.


Open Source Release and Getting Started

We're now open-sourcing our model under the MIT license, making it safe for commercial use.

Model Download:

Training Data:

Additionally, we're working on turning the model inference into a service and integrating and deploying it in Milvus as a Semantic Highlight interface. This will be available soon.


Conclusion

In this article, we shared our journey from identifying the token cost problem in production RAG systems to building a state-of-the-art bilingual Semantic Highlighting model:

  1. We analyzed the limitations of traditional keyword-based highlighting across different scenarios
  2. We evaluated existing solutions and identified their shortcomings
  3. We developed a novel training methodology using LLM annotation with reasoning processes
  4. We achieved SOTA performance on both English and Chinese datasets
  5. We open-sourced our model and training data under the MIT license for the community

This model addresses multiple real-world production requirements: strong bilingual performance, a sufficient context window, good generalization, and a commercially friendly open-source license.

We hope this model can help developers build better RAG/Agent systems at lower cost and higher quality, enhancing their debuggability and interpretability; it can also be extended to any text retrieval system, e.g., recommendation systems, and serve as a semantic highlighting feature. Feel free to try it out and provide feedback anytime.


Related Links

Community

TIL of semantic highlights, thanks a lot for this blog and models!

Congratulations on the release great work!!
I have a doubt and would love some clarification.
Why isn’t top-K reranking sufficient for token cost reduction in production RAG systems, and in which scenarios does semantic highlighting provide the biggest advantage over rerankers?
Additionally, I wanted to ask:
Can the semantic highlight model further break down or split sentences into smaller, more fine-grained relevant spans (instead of selecting full sentences), or is sentence-level pruning the intended granularity?

·
Article author

A reranker selects a few chunks from a large topK chunk set. It sends entire chunks (100–3000 tokens each) to the LLM, but most tokens within a chunk are irrelevant.

Semantic highlight breaks the chunk into smaller spans (you can design the model to score sentences vs. sub-sentence spans). It chooses a few sentences (10-100 tokens) from every document.

Can this highlight approach be used on final answer generated by LLM?

Most of the times I find LLM generated final answer too verbose. Highlighting within answer will give user ability to see areas which actually matter.

·

final answer can be considered as a chunk and we can highlight it seems

Sign up or log in to comment