KU-DFI
/

TelecomGPT-R1

Safetensors

qwen3_5

Model card Files Files and versions

xet

Community

wbhVince829 commited on May 20

Commit

42e93e9

1 Parent(s): fd7747b

update v1 0519

Browse files

Files changed (1) hide show

README.md +75 -104

README.md CHANGED Viewed

@@ -3,147 +3,116 @@ license: apache-2.0
 ---
 # TelecomGPT-R1: The Best Open-Source Telecom Large Language Model
-> A 27B open model that ranks **#1 on the GSMA Open Telco Leaderboard**, **#1 among all open-source models by a 27-point margin**, and **beats GPT-5 on 6 of 7 benchmarks**.
 ---
 ## 1 — A New State of the Art for Telecom LLMs
-**TelecomGPT-R1 is the strongest publicly available large language model for telecommunications.** On the public **[GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)** — the standard benchmark suite that aggregates seven public telecom benchmarks across knowledge QA, protocol understanding, fault analysis, and modeling & computation — a single 27B open-source policy ranks **#1 overall out of 86 evaluated models**, **#1 among all open-source models by a 27-point margin**, **wins 6 of 7 benchmark match-ups against GPT-5**, and **leads the leaderboard's hardest axis (TeleTables) by +29.8 points over every other model on the board — open or closed.** No prior open model — and no general-purpose frontier model from OpenAI, Google, or Anthropic — comes close to this combination of breadth and depth on telecom tasks.
-![radar_chart_v0](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/-ZkxlB0p1XHmJCEDS6MKb.png)
-**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on four of eight axes (3GPP-TSG, TeleLogs, TeleTables, Average) and stays at or above `0.89` on every other axis — visibly tracing the outer edge of the radar where no other model can match it on all axes simultaneously.*
-### Three takeaways
-> **1. #1 among all open-source models by +25.9 points.** TelecomGPT-R1 beats the next-best open model (DeepSeek-V3 at 685B parameters) by a margin larger than DeepSeek's own margin over a 14B base. Our 27B is the new floor for open-source telecom LLMs.
->
-> **2. #1 on the leaderboard's hardest axis by +26.2 points.** TeleTables is the axis where every frontier closed-source model collapses below 50%. The self-rubric reward (introduced in §5) was designed for exactly this regime — and it shows.
->
-> **3. #1 overall — past every operator-internal model too.** TelecomGPT-R1 edges out AT&T's operator-internal OTel-LLM-8.3B on the leaderboard's overall average by **+0.6 points**, carried by a **+29.8-point lead on TeleTables** that more than offsets per-benchmark losses on three knowledge-heavy axes. Open, closed, or operator-trained — no model on the GSMA Open Telco Leaderboard ranks above us.
 ---
 ## 2 — Toward Universal Telecom Reasoning
-Large language models have entered a new era. **General-purpose frontier models** — GPT-5, Claude-Opus-4.6, Gemini-3.1-Pro — write code, prove theorems, and solve olympiad-level problems; their long-CoT reasoning ability sets the modern bar. But step into telecommunications and these same models stumble: a workflow that asks an engineer to *recall a 3GPP clause, follow a multi-step procedure, read a log, and close a link-budget derivation in the same session* breaks them, because what looks like "one telecom problem" is really **four very different kinds of thinking layered on top of dense domain knowledge that their pretraining covers only thinly and never targeted with reasoning-grade supervision**. **Telecom-specialized LLMs** — TelecomGPT, Tele-LLMs, and operator-internal models like AT&T's OTel-LLM and SoftBank's LTM — narrow this gap by training on domain corpora, but they treat each task in isolation, supervise only on *extractive / classification* outputs (answer-the-MCQ, label-the-Tdoc, fill-the-equation, summarize-the-code), and remain closed-source. The reasoning engineers actually do day-to-day — *chain three procedure steps, trace a KPI dip to a root cause, derive capacity bounds, write a MATLAB beamformer that compiles* — is exactly the part current telecom AI cannot do.
-**TelecomGPT-R1 closes this gap with a single open-source model built on top of [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B).** It inherits the TelecomGPT substrate and extends it into the **reasoning** regime across four telecom axes — knowledge QA, protocol understanding, fault analysis, and modeling & computation — under one policy.
 ![four_axes_radia](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/IYs4rpe9Ij1e6KJf5qKy5.png)
 **Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work — knowledge QA (15.3%), protocol understanding (22.7%), fault analysis (18.5%), modeling & computation (43.5%) — and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
-### What a telecom engineer actually does
-A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
 | Job | What it looks like in practice |
 |---|---|
 | **Knowledge QA** | "What does 3GPP Release 18 say about [feature]? What's the typical PRACH timing budget?" |
 | **Protocol understanding** | Reading 3GPP / ITU / IETF specs and following multi-step procedure flows |
 | **Fault analysis** | Looking at a PCAP, RAN log, or KPI dashboard and finding the root cause |
-| **Modeling & Computation** | Closing a link-budget or queueing-theory derivation; reconstructing a system-model equation from a paper; writing a MATLAB beamformer that actually runs |
-### Why TelecomGPT-R1 matters
-> **1. Not just retrieves — reasons.** Existing telecom LLMs stop at extractive answers; TelecomGPT-R1 chains multi-step 3GPP procedures, traces KPI dips to root causes from raw logs, closes link-budget and Shannon-capacity derivations end to end, and writes srsRAN-style code that compiles. It thinks through a telecom problem instead of pulling the closest paragraph.
->
-> **2. An autonomous-agent core, not a chat sidekick.** Under one 27B policy, the model covers all four axes a telecom engineer rotates through in a day — knowledge QA, protocol understanding, fault analysis, modeling & computation. That makes it deployable as the reasoning core of an autonomous NOC operator, a spec-compliance bot, or a fault-triage copilot — automating slices of the engineer's workflow rather than living alongside it as another chatbot.
->
-> **3. The strongest open brain for telecom — built to be extended.** TelecomGPT-R1 is **#1 on the GSMA Open Telco Leaderboard**, **+27 points clear of every other open model**, and runnable on a single H100 with weights, recipe, and training data all public. This is the open foundation that operator-specific fine-tunes, downstream telecom agents, and standards-grade drafting tools can build on — without routing operator-confidential traffic through a closed-API black box.
 ---
 ## 3 — How We Did It: The Recipe at a Glance
-The recipe rests on two design decisions: **(i)** treat the entire training corpus as **one unified whole** — 158,915 examples flowing through one shared eight-step curation pipeline before being indexed by axis, never as a stack of benchmark-specific subsets; and **(ii)** post-train with a **three-pillar GRPO reinforcement-learning recipe** that combines DAPO stabilization, an offline difficulty-mined curriculum with multi-stage continual KL anchoring, and a self-rubric reward that decomposes each rollout's score over a set of teacher- or reference-derived rubrics covering structure, logic, format, and key facts — never reducing to a single 0/1 outcome signal. The first decision makes ablations and reweighting modular; the second makes reinforcement learning *survive* on derivation-heavy axes where outcome-only rewards starve the gradient.
-![recipe_4stage_v0.png-2026-05-19-00-20-52-309](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/v0pnV58Y3uu3hqPQ6ZQeu.png)
-**Figure 3 | The TelecomGPT-R1 three-stage post-training recipe.** *Stage ① curates heterogeneous telecom sources through an eight-step pipeline into one axis-indexed 158,915-example corpus. Stage ② installs cross-axis long-CoT reasoning on [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) via LoRA-SFT. Stage ③ runs a single GRPO loop in which **DAPO** stabilizes the gradient, a **difficulty-mined curriculum** advances the prompt distribution from easy to hard, and a **self-rubric reward** — rubrics generated by a teacher LLM or projected from the expert reference, then scored as a decomposed sum of per-rubric binary indicators — replaces the sparse 0/1 outcome signal, yielding the final TelecomGPT-R1 27B open-weight policy.*
-Two stages, one idea:
-> **Treat all four reasoning axes as slices of one corpus and one policy — not as four specialists glued together at inference time.** Everything that follows is engineered around that single-policy claim.
 ---
-## 4 — Stage 1: One Corpus, Eight Steps
-Rather than train one specialist per benchmark, we curate a **single** telecom post-training corpus that passes through **one shared eight-step pipeline** before being indexed by the four reasoning axes.
 ### The eight pipeline steps at a glance
 | Step | What it does | Why it matters |
 |---|---|---|
-| **S1** — Source-grounded extraction | Modality-specific extractors (AST for code, VLM PDF parsing for textbooks, working-group label projection for specs, row-window slicing for tables, formula masking for math papers, engineering-feature aggregation for raw logs) | Different sources, **one common output schema** |
-| **S2** — Long-CoT generation | Three trace generators chosen by **reasoning type**: teacher LLM (with self-validation) for QA, executable-Python-grounded CoT for derivations, deterministic rule-replay CoT for diagnosis | Right tool for each reasoning type — not one teacher for everything |
-| **S3** — Multi-pass verification | Axis-matched verifiers (exact match / unit-tolerant numeric closeness / rule-replay accuracy / on-policy re-answering) | Bad CoTs never enter the corpus |
-| **S4** — Augmentation | Variable resampling 5×–20×; prefix/suffix decomposition into intermediate-target + final-target pairs | One seed → many supervised rows, with intermediate-step supervision |
-| **S5** — Leakage prevention | Cross-benchmark dedup vs. all public eval splits; SHA-256-archived test sets; "no implicit reference" prompt guards | No train→eval leakage; no hallucinated citations |
-| **S6** — Difficulty stratification | Offline difficulty mining; per-axis class rebalancing | The **same** difficulty pass that filters SFT data also feeds the RL curriculum (Stage 2) |
-| **S7** — Format unification | One `{system, user, assistant}` chat schema; fixed answer-format vocabulary; `meta.axis` and `meta.source_track` tags on every row | Train and ablate the corpus **as one whole**, with axis-/source-wise reweighting |
-| **S8** — Style mixing | Small fraction of general-domain long-CoT mixed in | Preserve reflective markers — *"wait…"*, *"hmm…"*, self-correction — that pure telecom traces lack |
-### The result: one corpus, indexed by axis
-All eight steps converge on **one 158,915-example corpus indexed by reasoning axis** — and those same four axes are exactly the lens through which the seven public benchmarks on the GSMA Open Telco Leaderboard evaluate every telecom LLM.
-#### Benchmark → axis mapping
-| Benchmark | Tests |
-|---|---|
-| **TeleQnA** (10k MCQ) | Knowledge QA |
-| **3GPP-TSG** (working-group classification) | Protocol understanding |
-| **ORANBench** (1.5k MCQ on O-RAN specs) | Protocol understanding |
-| **TeleTables** (table-grounded MCQ on 3GPP tables) | Protocol understanding |
-| **TeleLogs** (5G RAN root-cause analysis) | Fault analysis |
-| **TeleMath** (telecom math problem solving) | Modeling & computation |
-| **srsRANBench** (5G code understanding) | Modeling & computation |
-![data_radar_v0](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/82qavUtMTYeOs3fEGcIWm.png)
-*Figure 4. The 158,915-example unified telecom corpus, broken down by source track. Outer ring: five source families that fold into the four reasoning axes. Middle ring: individual sub-corpora. Inner radial bars: per-corpus row counts on a log scale.*
-> **Why this matters.** Because every row carries the same chat schema, the same answer-format vocabulary, and the same axis tag, **the corpus is trained and ablated as a single whole** with axis-aware reweighting — not as a stack of benchmark-specific subsets. That single-whole property is what makes the downstream RL recipe modular.
----
-## 5 — Stage 2: Three Reinforcement-Learning Pillars
-After supervised fine-tuning (LoRA adapters on a 27B open base), we refine the model with reinforcement learning. Vanilla RL on long-trace mixed-domain training breaks in well-known ways. We stack **three orthogonal fixes**, each tackling a different failure mode:
-| Pillar | What it fixes | What it does, in plain language |
-|---|---|---|
-| **1 — DAPO Stabilization** | Entropy collapse · zero-gradient groups · long-trace dilution | Asymmetric trust region (lets rare-but-good tokens through), dynamic-sampling filter (drops groups where every attempt is right or every attempt is wrong), token-level loss (long reasoning traces stop being underweighted). Plus: keep a small KL anchor to the SFT model so the structured output layout doesn't drift. |
-| **2 — Difficulty-Mined Curriculum** | Easy axes saturate too fast · hard axes never get gradient | Pre-filter prompts by their pass rate against the SFT model — keep only the ones where the model gets it right *some* of the time. Then train in stages, where each stage is anchored to the *previous* stage rather than to the static SFT model. Anchoring to the latest capable policy avoids catastrophic forgetting when harder slices are introduced. |
-| **3 — Self-Rubric Reward** | A 0/1 outcome reward starves derivation-heavy axes of any signal | For each training prompt, **pre-generate a set of fine-grained rubrics** with a strong teacher LLM (or projected directly from the expert reference solution), jointly covering *structure*, *logic*, *format*, and *key factual content*. At training time the current policy samples K responses; each response is scored as a **decomposed sum of per-rubric binary indicators**, yielding a dense per-attempt reward instead of a sparse 0/1 outcome signal. **GRPO** then consumes this rubric-decomposed reward as its group-relative advantage and updates the policy parameters — propagating gradient through the model even when no rollout reaches the correct final answer. |
-### How the three pillars compose into one training loop
-```
-prompt
-  → keep only if difficulty is "just right"        (Pillar 2)
-  → pre-generate rubrics per prompt                (Pillar 3, offline)
-       teacher LLM or expert reference solution
-       cover: structure · logic · format · facts
-  → current policy samples K responses
-  → score each response against its rubrics        (Pillar 3)
-       per-rubric binary indicator → summed dense reward
-  → drop groups with no learning signal            (Pillars 1 + 2 step-level)
-  → compute group-relative advantage
-  → stabilized GRPO update                          (Pillar 1)
-  → repeat
-```
-Each pillar plugs into the next. Pillar 2 supplies the prompt distribution, Pillar 3 shapes the per-attempt signal, Pillar 1 turns the signal into a stable gradient step. **Remove any one of the three and final accuracy measurably drops** — the lessons in §6 quantify this.
-### Three properties that make Pillar 3 the "universal" enabler
-| Property | Plain-language explanation |
-|---|---|
-| **Dense, decomposed credit** | A response that satisfies even a subset of its rubrics still receives non-zero reward, so groups of rollouts that all fail the final answer continue to produce a usable gradient — escaping the early "all rollouts wrong → no gradient" trap that plagues outcome-only RL. |
-| **Multi-dimensional supervision in one reward** | The rubrics per prompt jointly score *structure*, *logic*, *format*, and *key factual content* — so a single reward simultaneously shapes everything a long-CoT telecom response must satisfy, with no separate format-preservation or factuality-loss terms hanging off the loss. |
-| **Reference-grounded, runtime-cheap** | Rubrics are authored once per prompt — by a strong teacher LLM or projected from the expert reference solution — and at training time the per-rubric checks reduce to lightweight binary indicators. This buys teacher-quality grading criteria without paying a full LLM-judge inference on every rollout. |
 ---
@@ -151,35 +120,27 @@ Each pillar plugs into the next. Pillar 2 supplies the prompt distribution, Pill
 | # | Lesson | Why it matters |
 |:---:|---|---|
-| **1** | **Domain knowledge — not reasoning ability — is the bottleneck.** | A strong general reasoner produces well-formed chains operating on **wrong telecom facts**. RL cannot manufacture knowledge that was never in the model. Invest in SFT data curation *first*. |
-| **2** | **Self-rubric reward is what makes the model universal.** | Without rubric-decomposed credit, a 27B base produces zero correct rollouts on derivation-heavy axes for hundreds of training steps, and RL gets no gradient. Pillar 3 is the difference between "knowledge-QA specialist that guesses on hard axes" and "universal reasoner". |
-| **3** | **Verifier rigor matters as much as reward weights.** | A permissive verifier silently rewards lucky digit matches and penalizes correct reasoning in the wrong format. Unit normalization, tolerance bands, symbolic equivalence, and code execution were all as important as choosing the reward weights themselves. |
-| **4** | **Difficulty-mined curriculum prevents axis collapse.** | Easy axes (RFC-style knowledge QA) saturate within hundreds of RL steps; hard axes (math, code, complex logs) keep improving. Without curriculum, easy axes hog the gradient and stall the rest. |
-| **5** | **Mixing general-domain CoT preserves reasoning style.** | Reflective markers — *"wait…"*, *"hmm…"*, self-correction — are too thin in pure telecom traces. A small general-domain mix improves both naturalness and hardest-axis accuracy. |
 ---
-## 7 — In One Line
-> **For vertical-domain LLMs, knowledge curation deserves at least as much attention as the choice of RL algorithm.**
-TelecomGPT-R1 is what happens when you treat telecom reasoning as **one universal capability** — one corpus, one policy, four axes — and engineer the recipe around that single-whole property end-to-end.
-The model doesn't just *quote* the standard. It *reasons* through it.
----
-## 8 — What This Opens Up Next
-TelecomGPT-R1 is a foundation, not an endpoint. Three directions are within immediate reach, and one is the long-term ambition:
-1. **Production telecom copilots.** Incident-response assistants for NOC operators, real-time fault-diagnosis bots over live log feeds, and spec-compliance automation for vendor implementations — all benefit from a single model that *reasons over heterogeneous evidence* (logs + tables + math + code) instead of stitching together a RAG pipeline per task. With an open-weight 27B reasoner that already leads TeleLogs and TeleTables, the path from research artifact to operations tooling is short.
-2. **Scaling the recipe — bigger model, more modalities.** The unified-corpus + three-pillar-GRPO recipe is parameter- and modality-agnostic. The same eight-step pipeline scales naturally to a 70B / 200B telecom reasoner; the axis-indexed corpus extends naturally to KPI dashboards, network-topology graphs, RF spectrum images, and protocol message-flow diagrams. Both extensions are mechanical to engineer once the single-whole property of the corpus is in place.
-3. **A transferable recipe for any structured-derivation vertical.** The pattern — *heterogeneous sources curated into one axis-indexed corpus, then trained with self-rubric rewards decomposed across structure / logic / format / facts* — is not telecom-specific. Power-grid operations, semiconductor manufacturing, clinical decision support, automotive safety analysis, and other infrastructure verticals all share the same shape (heterogeneous tasks with structured intermediate sub-goals) and should be directly amenable to it.
-4. **Standards-grade co-drafting (the long bet).** Once a reasoning model can simulate a 3GPP procedure flow, verify an equation derivation, and flag cross-spec inconsistencies, the line between *AI that learns from telecom* and *AI that contributes to telecom* begins to blur. We see a plausible path where a future descendant of TelecomGPT-R1 sits inside a 3GPP / IEEE / IETF working group as a drafting assistant — detecting protocol-flow ambiguities, suggesting equation simplifications, and surfacing inconsistencies across releases. This is the direction we are most excited about, and it is what motivates keeping the recipe open: standards are a public good, and the AI that helps draft them should be too.
 ---
@@ -195,7 +156,7 @@ TelecomGPT-R1 is a foundation, not an endpoint. Three directions are within imme
 @inproceedings{wang2026telecomgptr1,
   title     = {TelecomGPT-R1: Post-Training Recipes for Universal Reasoning in Telecom},
   author    = {Wang, Bohao and Wu, Chenwei and Li, Haoyu and Zou, Hang and Tian, Yu
-               and Barial, Lina and Huang, Chongwen and Shen, Zhang, Zhaoyang and Debbah, M\'{e}rouane},
   booktitle = {[Venue coming soon!]},
   year      = {2026}
 }
@@ -206,9 +167,19 @@ TelecomGPT-R1 is a foundation, not an endpoint. Three directions are within imme
   year      ={2025},
   publisher ={IEEE}
 }
 ```
 ### Acknowledgements
 This work was supported by the Digital Future Institute of Khalifa University; the College of Information Science and Electronic Engineering, Zhejiang University; the College of Computer Science and Technology, Zhejiang University; and the Research Computing team of Khalifa University.

 ---
 # TelecomGPT-R1: The Best Open-Source Telecom Large Language Model
+> A 27B open model that ranks **#1 on the GSMA Open Telco Leaderboard** among all open-source models by a 27-point margin**, and **beats GPT-5 on 6 of 7 benchmarks**.
 ---
 ## 1 — A New State of the Art for Telecom LLMs
+**TelecomGPT-R1 is the strongest publicly available large language model (LLM) for telecommunications.**
+On the public **[GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)** — the complete benchmark suite that aggregates seven public telecom benchmarks across knowledge QA, protocol understanding, fault analysis, and modeling & computation, as shown in Figure 1, TelecomGPT-R1 is:
+- Ranked #1 Open-Source Globally – TelecomGPT-R1 secures the #1 overall spot out of all 86 evaluated models on the leaderboard, beating any open-source models by a large 27-point margin.
+- Outperforming General Domain Giants – In head-to-head match-ups against GPT-5, our 27B open policy wins 6 out of 7 benchmarks.
+- Cracking the Hardest Axis – On **TeleLogs**, the leaderboard's most notoriously difficult axis — multi-step root-cause analysis over dense RAN engineering features and drive-test measurements, TelecomGPT-R1 lifts the score to **97%**, a **+55-point** leap over the strongest open-source baseline (DeepSeek-V3, 685B) and ahead of every frontier closed-source generalist.
+![radar_chart_v0](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/-ZkxlB0p1XHmJCEDS6MKb.png)
+**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on four of eight axes (3GPP-TSG, TeleLogs, TeleTables, Average) and stays at or above `0.89` on every other axis — visibly tracing the outer edge of the radar where no other model can match it on all axes simultaneously.*
 ---
 ## 2 — Toward Universal Telecom Reasoning
+The telecommunications sector does not communicate in a single data language. As shown in Fig 2, true telecom intelligence demands working across highly diverse tasks and data modalities: interpreting the legalistic text of 3GPP standards, navigating through the highly structured layout of configuration tables, designing the algorithmic logic of MATLAB code, and debugging from the messy strings of raw hardware network logs.
+Until now, general-purpose AI giants have stumbled when confronted with these highly diverse domain-specific data landscapes, despite powerful native reasoning abilities. Meanwhile, most existing telecom domain LLMs has been to focus on narrow tasks such as log classification or domain knowledge question answering, leaving them unable to perform complex real-life tasks such as root-cause diagnostics.
+TelecomGPT-R1 represents a definitive leap forward. Built on top of an open-source 27B parameter base, it establishes a new state of the art as the industry's first true universal reasoning model capable of fluent, polymathic intelligence across diverse telecom task and data types — **knowledge QA, protocol understanding, fault analysis, and modeling & computation** — under a single unified policy.
 ![four_axes_radia](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/IYs4rpe9Ij1e6KJf5qKy5.png)
 **Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work — knowledge QA (15.3%), protocol understanding (22.7%), fault analysis (18.5%), modeling & computation (43.5%) — and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
+<!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
 | Job | What it looks like in practice |
 |---|---|
 | **Knowledge QA** | "What does 3GPP Release 18 say about [feature]? What's the typical PRACH timing budget?" |
 | **Protocol understanding** | Reading 3GPP / ITU / IETF specs and following multi-step procedure flows |
 | **Fault analysis** | Looking at a PCAP, RAN log, or KPI dashboard and finding the root cause |
+| **Modeling & Computation** | Closing a link-budget or queueing-theory derivation; reconstructing a system-model equation from a paper; writing a MATLAB beamformer that actually runs | -->
 ---
 ## 3 — How We Did It: The Recipe at a Glance
+To train an unified model capable of navigating through this diverse data  landscape, we had to rethink both data curation and post-training choices. The resulting recipe rests on two foundational design decisions:
+1) Instead of training separate models or disconnected datasets for standards QA, logs, tables, math, and code, we curate all sources into a single unified telecom reasoning corpus and train one policy over the whole space. This matters because telecom concepts do not stay inside one format. A scheduling rule may appear as prose in a standard, as a row in a configuration table, as a constraint in an equation, as a pattern in logs, or as logic inside code. TelecomGPT-R1 is trained on a 158,915-example unified corpus constructed through an eight-step pipeline. Each example is converted into the same chat format, tagged by reasoning axis and source type, verified with task-specific checks, and prepared for both supervised fine-tuning (SFT) and reinforcement learning (RL).
+2) We post-train with SFT followed by a **three-pillar RL recipe** that combines Dynamic sAmpling Policy Optimization (DAPO) for stable training of diverse tasks and data types, a difficulty-mined multi-stage curriculum learning, and dense reward signals from self-rubric on highly complex, derivation-heavy tasks.
+![recipe_4stage_v0.png-2026-05-19-00-20-52-309](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/v0pnV58Y3uu3hqPQ6ZQeu.png)
+**Figure 3 | The TelecomGPT-R1 three-stage post-training recipe.** *Stage ① curates heterogeneous telecom sources through an eight-step pipeline into one axis-indexed 158,915-example corpus. Stage ② installs cross-axis long-CoT reasoning on [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) via LoRA-SFT. Stage ③ is combined of **DAPO** which stabilizes the gradient, a **difficulty-mined curriculum** advances the prompt distribution from easy to hard, and a **self-rubric reward** — rubrics generated by LLM or projected from the expert reference, then scored as a decomposed sum of per-rubric binary indicators — densifies the sparse 0/1 outcome signal, yielding the final TelecomGPT-R1 27B policy.*
 ---
+## 4 — Stage 1: Training Data Curation
+TelecomGPT-R1 begins with corpus construction. We collect heterogeneous telecom sources — standards, technical documents, tables, logs, code, math problems, and modeling examples — and convert them into a unified training format.
+The pipeline has eight steps:
 ### The eight pipeline steps at a glance
 | Step | What it does | Why it matters |
 |---|---|---|
+| **S1** — Source-grounded extraction | Modality-specific extractors (AST for code, VLM PDF parsing for textbooks, working-group label projection for specs, row-window slicing for tables, formula masking for math papers, engineering-feature aggregation for raw logs) | Converts heterogeneous telecom data into a common schema. |
+| **S2** — Long-CoT generation | Three trace generators chosen by reasoning type: multiple teacher LLMs (with self-validation) for QA, executable-Python-grounded CoT for derivations, deterministic rule-replay CoT for diagnosis | Right tool for each reasoning type — not one teacher for everything. |
+| **S3** — Multi-pass verification | Axis-matched verifiers (exact match / unit-tolerant numeric closeness / rule-replay accuracy / on-policy re-answering) | Filters incorrect or ungrounded reasoning before training. |
+| **S4** — Augmentation | Variable resampling 5×–20×; prefix/suffix decomposition into intermediate-target + final-target pairs | Expands diversified data coverage while preserving structured reasoning. |
+| **S5** — Leakage prevention | Cross-benchmark dedup vs. all public eval splits; SHA-256-archived test sets | Ensures leaderboard gains reflect learned capability rather than benchmark contamination. |
+| **S6** — Difficulty stratification | Estimate example difficulty using model pass rates and verifier outcomes. | Provides the difficulty signal later used by the RL curriculum. |
+| **S7** — Format unification | One `{system, user, assistant}` chat schema; fixed answer-format vocabulary; `meta.axis` and `meta.source_track` tags on every row | Makes the corpus trainable, searchable, reweightable, and ablatable as one whole. |
+| **S8** — Reasoning-style mixing | Mix in a small amount of general long-reasoning data from student before SFT. | Preserves self-correction and reflective reasoning patterns.|
+![data_radar_v0](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/82qavUtMTYeOs3fEGcIWm.png)
+*Figure 4. The 158,915-example unified telecom corpus, broken down by source track. Outer ring: five source families that fold into the four reasoning axes. Middle ring: individual sub-corpora. Inner radial bars: per-corpus row counts on a log scale.*
+---
+## 5 — Stage 2: Post Training Algorithms
+We first perform supervised fine-tuning (LoRA adapters on a Qwen3.5 27B). This stage teaches the base model to speak the language of telecom reasoning: how to interpret standards, follow protocol constraints, reason over tables, analyze logs, solve derivations, and produce structured answers. This stage is critical because vertical-domain RL cannot create missing knowledge from nothing. A strong base model may know how to reason in general, but if it lacks the relevant telecom facts and conventions, RL can amplify fluent wrong reasoning.
+Reinforcement learning is appealing for telecom because many tasks have clear correctness signals: a table question has a right option, a derivation has a right result, and a code problem either satisfies the expected behavior or not. But naïve RL fails quickly in this setting.
+The first issue is sparse feedback. Telecom problems are structured but unforgiving: one wrong unit, condition, protocol branch, or table row can make the final answer wrong, even if most of the reasoning is useful. A pure final-answer reward turns these cases into zeros and gives the model little guidance about what went right.
+The second issue is uneven difficulty and learning progress across domains. In unified training, knowledge QA may improve early, while table reasoning, log analysis, math derivation, and code understanding often need much longer. If all domains are trained uniformly, the training will be long and inefficient.
+The third issue is shortcut learning. On benchmark-style tasks, a model can sometimes guess from answer priors, exploit formatting artifacts, or produce plausible explanations without using the right telecom evidence. For a domain model, this is unacceptable: we want grounded reasoning, not better guessing.
+TelecomGPT-R1 addresses these problems with three RL ingredients.
+1. DAPO-style stabilization
+We use DAPO-style optimization to make GRPO training more stable on long telecom reasoning traces. After SFT, the model already knows telecom terminology, answer formats, and response style. RL should improve the reasoning policy without destroying that structure. DAPO helps here through training mechanics: it reduces wasted updates from uninformative rollout groups, improves token-level credit assignment for long responses, and keeps policy updates from becoming too aggressive. In practice, this helps prevent format drift, repetitive reasoning, and over-optimization to lucky final answers.
+2. Difficulty-aware training
+Different telecom skills become learnable at different times. DAPO-style dynamic sampling focuses updates on prompt groups with meaningful reward variation: not groups where every rollout is already correct, and not groups where every rollout fails identically. Combined with difficulty-mined curriculum, this lets slower domains — tables, logs, math, and code — keep receiving useful training signal instead of being drowned out by easier QA examples. In this sense, DAPO is not only a stabilizer. It also helps manage asynchronous capability progression across heterogeneous telecom reasoning skills.
+3. Self-rubric reward
+Final-answer rewards are sparse. Self-rubric reward makes the signal denser and more trustworthy. For different task groups, we accumulate prior experiences, derive with LLM self-analysis or expert reference to define criteria over logic, evidence use, format, key facts, and final correctness. This gives partial credit when the model follows the right reasoning path but misses a local detail. It also reduces guessing and reward hacking: a response that jumps to the right option without using the right standard, table row, log evidence, or equation no longer receives the same credit as a grounded solution.
 ---
 | # | Lesson | Why it matters |
 |:---:|---|---|
+| **1** | **Domain knowledge is the biggest bottleneck.** | A strong general reasoner produces well-formed chains operating on **wrong telecom facts**. RL cannot manufacture knowledge that was never in the model. Invest in SFT data curation *first*. |
+| **2** | **Self-rubric reward is what makes the model universal.** | Without rubric-decomposed credit, a 27B base produces zero correct rollouts on derivation-heavy axes for hundreds of training steps, and RL gets no gradient. |
+| **3** | **Verifier rigor matters as much as reward weights.** | A general verifier (e.g. math verifier directly applied on Telecom Math reasoning) silently rewards lucky digit matches and penalizes correct reasoning in the wrong format. Unit normalization, tolerance bands, symbolic equivalence, and code execution were all as important as choosing the reward weights themselves. |
+| **4** | **Difficulty-mined curriculum prevents axis collapse.** | Easy axes (knowledge QA) saturate within hundreds of RL steps; hard axes (math, code, complex logs) keep improving. Without curriculum, easy axes stall the rest. |
+| **5** | **Mixing general-domain CoT preserves reasoning style.** | Student-specific reasoning and self-reflective style words are thin or different in CoTs distilled from teacher models. A small mix helps preserving self-correction and reflective reasoning patterns throughout SFT. |
 ---
+## 7 — Toward Telecom's Cognitive Architecture
+*TelecomGPT-R1 is the cognitive core. The next steps stack vertically on top of it — perception, action, world model — each docked into the brain, not bolted around it.*
+1. **The cognitive core is reasoning, not retrieval.** Telecom decisions stitch evidence across specs, tables, logs, equations, and code — no RAG pipeline can compose this cross-modal evidence into a single chain. TelecomGPT-R1 makes long-CoT reasoning the central primitive over which every other telecom capability can be layered. Everything that follows is plugged into this core, not bolted around it.
+2. **Senses: docking RF-GPT to read the spectrum.** Today the core reads telecom — specs, tables, logs, code. The substance of telecom, however, is waveforms. [RF-GPT](https://arxiv.org/abs/2602.14833) [[Zou et al., 2026](#cite-rfgpt)], our group's recent foundation model, encodes IQ samples as RF tokens that a decoder-only LLM can natively consume; fusing it with TelecomGPT-R1 yields a single reasoning chain that crosses the protocol–physical boundary — *spectrum capture → standard clause → log evidence → configuration fix*.
+3. **Hands: agents that act on the network, not just talk about it.** A reasoning core that only produces text is still a passive observer. Wrapped as a tool-using agent — invoking simulators, SDR rigs, srsRAN runtimes, OSS/BSS APIs, and digital-twin replays — TelecomGPT-R1 becomes an operator: a NOC co-pilot that pulls live KPIs, simulates a config change, and drafts the change ticket end-to-end. The interface is no longer "ask the model"; it is "deploy the model."
+4. **The long bet: a network that runs its own world model.** Close the loop — brain, senses, hands — onto a continuously updated *cognitive twin* that the policy can simulate against and reason about counterfactually. Predicting a failure hours before it happens is not a smarter monitor; it is a network that **thinks about itself**. This is the step from *"5G with AI features"* to a telecom architecture that is **categorically** intelligent — and it is the only item on this list we cannot promise; we can only commit to building toward it openly. The same brain–senses–hands–world template should generalize to any heterogeneous infrastructure vertical that reasons over standards, telemetry, and physical signals — but telecom is where it has to be proven first.
 ---
 @inproceedings{wang2026telecomgptr1,
   title     = {TelecomGPT-R1: Post-Training Recipes for Universal Reasoning in Telecom},
   author    = {Wang, Bohao and Wu, Chenwei and Li, Haoyu and Zou, Hang and Tian, Yu
+               and Bariah, Lina and Huang, Chongwen and Shen, Yongliang and Zhang, Zhaoyang and Debbah, M\'{e}rouane},
   booktitle = {[Venue coming soon!]},
   year      = {2026}
 }
   year      ={2025},
   publisher ={IEEE}
 }
+@article{zou2026rfgpt,
+  title     = {RF-GPT: Teaching AI to See the Wireless World},
+  author    = {Zou, Hang and Tian, Yu and Wang, Bohao and Bariah, Lina
+               and Lasaulce, Samson and Huang, Chongwen and Debbah, M\'{e}rouane},
+  journal   = {arXiv preprint arXiv:2602.14833},
+  year      = {2026},
+  url       = {https://arxiv.org/abs/2602.14833}
+}
 ```
 ### Acknowledgements
 This work was supported by the Digital Future Institute of Khalifa University; the College of Information Science and Electronic Engineering, Zhejiang University; the College of Computer Science and Technology, Zhejiang University; and the Research Computing team of Khalifa University.