KU-DFI
/

TelecomGPT-R1

Safetensors

qwen3_5

Model card Files Files and versions

xet

Community

wbhVince829 commited on 17 days ago

Commit

3aa9f37

1 Parent(s): 5843459

update figure 3, teletables, and prepared for outreach

Browse files

Files changed (1) hide show

README.md +10 -14

README.md CHANGED Viewed

@@ -9,20 +9,18 @@ license: apache-2.0
 ## 1 — A New State of the Art for Telecom LLMs
-**TelecomGPT-R1 (27B) reaches state-of-the-art (SOTA) performance on the [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard) at 89.6% average, matching or leading every open-source and closed-source entrant across both general-purpose and telecom-specialized categories.** The leaderboard aggregates 7 benchmarks spanning 4 evaluation axes (telecom knowledge QA, 3GPP protocol comprehension, fault and log diagnosis, and RF/network modeling), as reported in Figure 1.
 - **Among open-source models**, TelecomGPT-R1 leads DeepSeek-V3-0324 (685B) by **+30.3**, LLaMA-3.3-70B by **+34.9**, and Qwen2.5-72B by **+35.6**, while operating at roughly **25× fewer active parameters than the next-best open entrant**.
 - **Among closed-source models**, TelecomGPT-R1 reaches SOTA performance across both the general-purpose frontier tier and the telecom-specialized tier, as detailed in the two bullets below.
 - **Among general-purpose frontier models**, TelecomGPT-R1 leads Gemini-3.1-Pro by **+14.0**, Claude-Opus-4.6 by **+16.3**, and GPT-5 by **+17.7**. These systems sit at the **trillion-parameter-class frontier** (active-parameter counts are not publicly disclosed but are widely reported as orders of magnitude larger than 27B), making the margin a parameter-efficiency result as much as an accuracy result.
-- **Among telecom-specialized models**, TelecomGPT-R1 is **on par with the leading closed operator-internal telecom model AT&T's OTel-LLM-8.3B-QnA**, and leads SoftBank LTM by **+16.0**, demonstrating that an open telecom reasoning model can reach SOTA performance alongside top operator-internal baselines on the GSMA Open Telco Leaderboard. [^teletables]
 **In one line: TelecomGPT-R1 demonstrates that an open 27B telecom reasoning model can reach SOTA performance across the full breadth of the GSMA Open Telco Leaderboard.**
 ![Figure 1. TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/soASkkHnjJ7nm7iAMY5tv.png)
-**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on **five of eight axes** (3GPP-TSG, srsRANBench, TeleLogs, TeleTables, Average) and stays at or above `0.95` on every other axis, visibly tracing the outer edge of the radar where no other model, open or closed, matches it on all axes simultaneously.*
 ---
@@ -98,9 +96,9 @@ vllm serve KU-DFI/TelecomGPT-R1 \
 **Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
 **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
 ---
@@ -131,10 +129,9 @@ Therefore, the industry needs an **open-source telecom reasoner** that can be:
 ### 2.4 — What TelecomGPT-R1 improves
-TelecomGPT-R1 represents a definitive leap forward: a **27B open-weights base** trained to perform **universal reasoning across knowledge QA, 3GPP protocol comprehension, fault/log diagnosis, and RF/network modeling under a single unified policy**. Rather than stitching together specialized heads per task, one model handles the full four-axis surface evaluated by the GSMA Open Telco Leaderboard (producing the leaderboard result reported in §1), while remaining small enough to **self-host, fine-tune, and audit inside an operator environment**.
 ![Diverse reasoning tasks and data modalities a telecom engineer may encounter in day-to-day work.](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/M6SlDTzpx4W6wvAGE0eTp.png)
-**Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work (knowledge QA 15.3%, protocol understanding 22.7%, fault analysis 18.5%, modeling & computation 43.5%) and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
 <!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
@@ -155,10 +152,10 @@ The challenges in §2 (heterogeneous modalities, missing telecom domain knowledg
 **A multi-stage post-training procedure that grounds general reasoning in telecom facts.** Supervised fine-tuning installs the telecom "language" (how to read standards, follow protocol constraints, walk a log, close a derivation) that subsequent reinforcement learning then sharpens. Without this grounding step, RL amplifies *fluent wrong reasoning*: well-formed chains that happen to operate on hallucinated 3GPP clauses, mis-read log features, or unit-dropped derivations. The RL stage targets the three failure modes that naïve outcome-reward training suffers on heterogeneous telecom data (sparse final-answer signal, uneven learning progress across axes, and reward gaming via shortcut answers), with the full algorithmic details described in the accompanying paper.
-The combined effect is what §1 reports: a single 27B open policy that reaches **89.0% average on the GSMA Open Telco Leaderboard**, leading every open-source, frontier-closed, and operator-internal entrant.
-![TelecomGPT-R1 end-to-end recipe](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/WkwUMMWWFpS1EAJls6jhO.png)
-**Figure 3 | The simplified end-to-end TelecomGPT-R1 recipe.** *Heterogeneous telecom sources → a fine-grained dataset processing pipeline → one unified, axis-indexed corpus of 158,915 examples → supervised fine-tuning of [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) → experience-pool-differentiated GRPO, yielding the final TelecomGPT-R1 27B policy.*
 ---
@@ -208,7 +205,6 @@ KU/DFI's role is to build that open commons. The program now spans the key layer
 - **Model weights.** [KU-DFI/TelecomGPT-R1](https://huggingface.co/KU-DFI/TelecomGPT-R1/tree/main)
 - **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
 ### Citation
 ```bibtex
@@ -236,4 +232,4 @@ This work was supported by the Digital Future Institute of Khalifa University; t
 ---
-[^teletables]: On TeleTables, we follow the original paper's evaluation protocol by attaching the table content directly to the prompt — a table-grounded reasoning setup rather than retrieval without table id or content.

 ## 1 — A New State of the Art for Telecom LLMs
+**TelecomGPT-R1 (27B) reaches state-of-the-art (SOTA) performance on the [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard) at 89.6% average, matching or leading every open-source and closed-source entrant across both general-purpose and telecom-specialized categories.** The leaderboard aggregates 7 benchmarks spanning 4 reasoning axes — **protocol understanding** (3GPP/O-RAN normative prose), **knowledge QA** (vendor and operator facts), **modeling & computation** (RF/queueing derivations), and **fault analysis** (RAN drive-test logs) — as reported in Figure 1.
 - **Among open-source models**, TelecomGPT-R1 leads DeepSeek-V3-0324 (685B) by **+30.3**, LLaMA-3.3-70B by **+34.9**, and Qwen2.5-72B by **+35.6**, while operating at roughly **25× fewer active parameters than the next-best open entrant**.
 - **Among closed-source models**, TelecomGPT-R1 reaches SOTA performance across both the general-purpose frontier tier and the telecom-specialized tier, as detailed in the two bullets below.
 - **Among general-purpose frontier models**, TelecomGPT-R1 leads Gemini-3.1-Pro by **+14.0**, Claude-Opus-4.6 by **+16.3**, and GPT-5 by **+17.7**. These systems sit at the **trillion-parameter-class frontier** (active-parameter counts are not publicly disclosed but are widely reported as orders of magnitude larger than 27B), making the margin a parameter-efficiency result as much as an accuracy result.
+- **Among telecom-specialized models**, TelecomGPT-R1 is **on par with the leading closed operator-internal telecom model AT&T's OTel-LLM-8.3B-QnA**, and leads SoftBank LTM by **+16.0**, demonstrating that an open telecom reasoning model can reach SOTA performance alongside top operator-internal baselines on the GSMA Open Telco Leaderboard.<sup>†</sup>
 **In one line: TelecomGPT-R1 demonstrates that an open 27B telecom reasoning model can reach SOTA performance across the full breadth of the GSMA Open Telco Leaderboard.**
 ![Figure 1. TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/soASkkHnjJ7nm7iAMY5tv.png)
+**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on **six of eight axes** (3GPP-TSG, srsRANBench, TeleLogs, TeleQnA, TeleTables, Average) and stays at or above `0.94` on every other axis, visibly tracing the outer edge of the radar where no other model, open or closed, matches it on all axes simultaneously.*
 ---
 **Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
 **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
 ---
 ### 2.4 — What TelecomGPT-R1 improves
+TelecomGPT-R1 represents a definitive leap forward: a **27B open-weights base** trained to perform **universal reasoning across protocol understanding, knowledge QA, modeling & computation, and fault analysis under a single unified policy**. Rather than stitching together specialized heads per task, one model handles the full four-axis surface evaluated by the GSMA Open Telco Leaderboard (producing the leaderboard result reported in §1), while remaining small enough to **self-host, fine-tune, and audit inside an operator environment**.
 ![Diverse reasoning tasks and data modalities a telecom engineer may encounter in day-to-day work.](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/M6SlDTzpx4W6wvAGE0eTp.png)
+**Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work (protocol understanding 22.7%, knowledge QA 15.3%, modeling & computation 43.5%, fault analysis 18.5%) and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
 <!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
 **A multi-stage post-training procedure that grounds general reasoning in telecom facts.** Supervised fine-tuning installs the telecom "language" (how to read standards, follow protocol constraints, walk a log, close a derivation) that subsequent reinforcement learning then sharpens. Without this grounding step, RL amplifies *fluent wrong reasoning*: well-formed chains that happen to operate on hallucinated 3GPP clauses, mis-read log features, or unit-dropped derivations. The RL stage targets the three failure modes that naïve outcome-reward training suffers on heterogeneous telecom data (sparse final-answer signal, uneven learning progress across axes, and reward gaming via shortcut answers), with the full algorithmic details described in the accompanying paper.
+The combined effect is what §1 reports: a single 27B open policy that reaches **89.6% average on the GSMA Open Telco Leaderboard**, leading every open-source, frontier-closed, and operator-internal entrant.
+![TelecomGPT-R1 end-to-end recipe](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/pYUNVVVjsE3vH8zik8woz.png)
+**Figure 3 | The end-to-end TelecomGPT-R1 recipe.** *Frame ① distills four families of heterogeneous telecom material — standards documents, network telemetry and drive-test logs, math papers and code, and Q&A seeds and glossaries — into a single axis-balanced curated corpus of 158,915 examples across four reasoning axes (protocol, knowledge, modeling, fault). Frame ② then drives the corpus through a three-stage post-training progression — domain grounding, policy stabilization, and verifiable reasoning refinement under axis-aligned signals — yielding TelecomGPT-R1.*
 ---
 - **Model weights.** [KU-DFI/TelecomGPT-R1](https://huggingface.co/KU-DFI/TelecomGPT-R1/tree/main)
 - **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
 ### Citation
 ```bibtex
 ---
+<small>**†** On TeleTables, we follow the original paper's evaluation protocol by attaching the table content directly to the prompt — a table-grounded reasoning setup rather than retrieval without table id or content.</small>