Commit ·
3aa9f37
1
Parent(s): 5843459
update figure 3, teletables, and prepared for outreach
Browse files
README.md
CHANGED
|
@@ -9,20 +9,18 @@ license: apache-2.0
|
|
| 9 |
|
| 10 |
## 1 — A New State of the Art for Telecom LLMs
|
| 11 |
|
| 12 |
-
**TelecomGPT-R1 (27B) reaches state-of-the-art (SOTA) performance on the [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard) at 89.6% average, matching or leading every open-source and closed-source entrant across both general-purpose and telecom-specialized categories.** The leaderboard aggregates 7 benchmarks spanning 4
|
| 13 |
|
| 14 |
- **Among open-source models**, TelecomGPT-R1 leads DeepSeek-V3-0324 (685B) by **+30.3**, LLaMA-3.3-70B by **+34.9**, and Qwen2.5-72B by **+35.6**, while operating at roughly **25× fewer active parameters than the next-best open entrant**.
|
| 15 |
- **Among closed-source models**, TelecomGPT-R1 reaches SOTA performance across both the general-purpose frontier tier and the telecom-specialized tier, as detailed in the two bullets below.
|
| 16 |
- **Among general-purpose frontier models**, TelecomGPT-R1 leads Gemini-3.1-Pro by **+14.0**, Claude-Opus-4.6 by **+16.3**, and GPT-5 by **+17.7**. These systems sit at the **trillion-parameter-class frontier** (active-parameter counts are not publicly disclosed but are widely reported as orders of magnitude larger than 27B), making the margin a parameter-efficiency result as much as an accuracy result.
|
| 17 |
-
- **Among telecom-specialized models**, TelecomGPT-R1 is **on par with the leading closed operator-internal telecom model AT&T's OTel-LLM-8.3B-QnA**, and leads SoftBank LTM by **+16.0**, demonstrating that an open telecom reasoning model can reach SOTA performance alongside top operator-internal baselines on the GSMA Open Telco Leaderboard.
|
| 18 |
|
| 19 |
**In one line: TelecomGPT-R1 demonstrates that an open 27B telecom reasoning model can reach SOTA performance across the full breadth of the GSMA Open Telco Leaderboard.**
|
| 20 |
|
| 21 |
|
| 22 |

|
| 23 |
-
**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on **
|
| 24 |
-
|
| 25 |
-
|
| 26 |
|
| 27 |
|
| 28 |
---
|
|
@@ -98,9 +96,9 @@ vllm serve KU-DFI/TelecomGPT-R1 \
|
|
| 98 |
|
| 99 |
**Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
|
| 100 |
|
| 101 |
-
|
| 102 |
**Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
|
| 103 |
|
|
|
|
| 104 |
---
|
| 105 |
|
| 106 |
|
|
@@ -131,10 +129,9 @@ Therefore, the industry needs an **open-source telecom reasoner** that can be:
|
|
| 131 |
|
| 132 |
### 2.4 — What TelecomGPT-R1 improves
|
| 133 |
|
| 134 |
-
TelecomGPT-R1 represents a definitive leap forward: a **27B open-weights base** trained to perform **universal reasoning across knowledge QA,
|
| 135 |
-
|
| 136 |

|
| 137 |
-
**Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work (
|
| 138 |
|
| 139 |
<!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
|
| 140 |
|
|
@@ -155,10 +152,10 @@ The challenges in §2 (heterogeneous modalities, missing telecom domain knowledg
|
|
| 155 |
|
| 156 |
**A multi-stage post-training procedure that grounds general reasoning in telecom facts.** Supervised fine-tuning installs the telecom "language" (how to read standards, follow protocol constraints, walk a log, close a derivation) that subsequent reinforcement learning then sharpens. Without this grounding step, RL amplifies *fluent wrong reasoning*: well-formed chains that happen to operate on hallucinated 3GPP clauses, mis-read log features, or unit-dropped derivations. The RL stage targets the three failure modes that naïve outcome-reward training suffers on heterogeneous telecom data (sparse final-answer signal, uneven learning progress across axes, and reward gaming via shortcut answers), with the full algorithmic details described in the accompanying paper.
|
| 157 |
|
| 158 |
-
The combined effect is what §1 reports: a single 27B open policy that reaches **89.
|
| 159 |
|
| 160 |
-

|
| 209 |
- **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
|
| 210 |
|
| 211 |
-
|
| 212 |
### Citation
|
| 213 |
|
| 214 |
```bibtex
|
|
@@ -236,4 +232,4 @@ This work was supported by the Digital Future Institute of Khalifa University; t
|
|
| 236 |
|
| 237 |
---
|
| 238 |
|
| 239 |
-
|
|
|
|
| 9 |
|
| 10 |
## 1 — A New State of the Art for Telecom LLMs
|
| 11 |
|
| 12 |
+
**TelecomGPT-R1 (27B) reaches state-of-the-art (SOTA) performance on the [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard) at 89.6% average, matching or leading every open-source and closed-source entrant across both general-purpose and telecom-specialized categories.** The leaderboard aggregates 7 benchmarks spanning 4 reasoning axes — **protocol understanding** (3GPP/O-RAN normative prose), **knowledge QA** (vendor and operator facts), **modeling & computation** (RF/queueing derivations), and **fault analysis** (RAN drive-test logs) — as reported in Figure 1.
|
| 13 |
|
| 14 |
- **Among open-source models**, TelecomGPT-R1 leads DeepSeek-V3-0324 (685B) by **+30.3**, LLaMA-3.3-70B by **+34.9**, and Qwen2.5-72B by **+35.6**, while operating at roughly **25× fewer active parameters than the next-best open entrant**.
|
| 15 |
- **Among closed-source models**, TelecomGPT-R1 reaches SOTA performance across both the general-purpose frontier tier and the telecom-specialized tier, as detailed in the two bullets below.
|
| 16 |
- **Among general-purpose frontier models**, TelecomGPT-R1 leads Gemini-3.1-Pro by **+14.0**, Claude-Opus-4.6 by **+16.3**, and GPT-5 by **+17.7**. These systems sit at the **trillion-parameter-class frontier** (active-parameter counts are not publicly disclosed but are widely reported as orders of magnitude larger than 27B), making the margin a parameter-efficiency result as much as an accuracy result.
|
| 17 |
+
- **Among telecom-specialized models**, TelecomGPT-R1 is **on par with the leading closed operator-internal telecom model AT&T's OTel-LLM-8.3B-QnA**, and leads SoftBank LTM by **+16.0**, demonstrating that an open telecom reasoning model can reach SOTA performance alongside top operator-internal baselines on the GSMA Open Telco Leaderboard.<sup>†</sup>
|
| 18 |
|
| 19 |
**In one line: TelecomGPT-R1 demonstrates that an open 27B telecom reasoning model can reach SOTA performance across the full breadth of the GSMA Open Telco Leaderboard.**
|
| 20 |
|
| 21 |
|
| 22 |

|
| 23 |
+
**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on **six of eight axes** (3GPP-TSG, srsRANBench, TeleLogs, TeleQnA, TeleTables, Average) and stays at or above `0.94` on every other axis, visibly tracing the outer edge of the radar where no other model, open or closed, matches it on all axes simultaneously.*
|
|
|
|
|
|
|
| 24 |
|
| 25 |
|
| 26 |
---
|
|
|
|
| 96 |
|
| 97 |
**Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
|
| 98 |
|
|
|
|
| 99 |
**Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
|
| 100 |
|
| 101 |
+
|
| 102 |
---
|
| 103 |
|
| 104 |
|
|
|
|
| 129 |
|
| 130 |
### 2.4 — What TelecomGPT-R1 improves
|
| 131 |
|
| 132 |
+
TelecomGPT-R1 represents a definitive leap forward: a **27B open-weights base** trained to perform **universal reasoning across protocol understanding, knowledge QA, modeling & computation, and fault analysis under a single unified policy**. Rather than stitching together specialized heads per task, one model handles the full four-axis surface evaluated by the GSMA Open Telco Leaderboard (producing the leaderboard result reported in §1), while remaining small enough to **self-host, fine-tune, and audit inside an operator environment**.
|
|
|
|
| 133 |

|
| 134 |
+
**Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work (protocol understanding 22.7%, knowledge QA 15.3%, modeling & computation 43.5%, fault analysis 18.5%) and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
|
| 135 |
|
| 136 |
<!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
|
| 137 |
|
|
|
|
| 152 |
|
| 153 |
**A multi-stage post-training procedure that grounds general reasoning in telecom facts.** Supervised fine-tuning installs the telecom "language" (how to read standards, follow protocol constraints, walk a log, close a derivation) that subsequent reinforcement learning then sharpens. Without this grounding step, RL amplifies *fluent wrong reasoning*: well-formed chains that happen to operate on hallucinated 3GPP clauses, mis-read log features, or unit-dropped derivations. The RL stage targets the three failure modes that naïve outcome-reward training suffers on heterogeneous telecom data (sparse final-answer signal, uneven learning progress across axes, and reward gaming via shortcut answers), with the full algorithmic details described in the accompanying paper.
|
| 154 |
|
| 155 |
+
The combined effect is what §1 reports: a single 27B open policy that reaches **89.6% average on the GSMA Open Telco Leaderboard**, leading every open-source, frontier-closed, and operator-internal entrant.
|
| 156 |
|
| 157 |
+

|
| 158 |
+
**Figure 3 | The end-to-end TelecomGPT-R1 recipe.** *Frame ① distills four families of heterogeneous telecom material — standards documents, network telemetry and drive-test logs, math papers and code, and Q&A seeds and glossaries — into a single axis-balanced curated corpus of 158,915 examples across four reasoning axes (protocol, knowledge, modeling, fault). Frame ② then drives the corpus through a three-stage post-training progression — domain grounding, policy stabilization, and verifiable reasoning refinement under axis-aligned signals — yielding TelecomGPT-R1.*
|
| 159 |
|
| 160 |
---
|
| 161 |
|
|
|
|
| 205 |
- **Model weights.** [KU-DFI/TelecomGPT-R1](https://huggingface.co/KU-DFI/TelecomGPT-R1/tree/main)
|
| 206 |
- **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
|
| 207 |
|
|
|
|
| 208 |
### Citation
|
| 209 |
|
| 210 |
```bibtex
|
|
|
|
| 232 |
|
| 233 |
---
|
| 234 |
|
| 235 |
+
<small>**†** On TeleTables, we follow the original paper's evaluation protocol by attaching the table content directly to the prompt — a table-grounded reasoning setup rather than retrieval without table id or content.</small>
|