Safetensors
qwen3_5
wbhVince829 commited on
Commit
3aa9f37
·
1 Parent(s): 5843459

update figure 3, teletables, and prepared for outreach

Browse files
Files changed (1) hide show
  1. README.md +10 -14
README.md CHANGED
@@ -9,20 +9,18 @@ license: apache-2.0
9
 
10
  ## 1 — A New State of the Art for Telecom LLMs
11
 
12
- **TelecomGPT-R1 (27B) reaches state-of-the-art (SOTA) performance on the [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard) at 89.6% average, matching or leading every open-source and closed-source entrant across both general-purpose and telecom-specialized categories.** The leaderboard aggregates 7 benchmarks spanning 4 evaluation axes (telecom knowledge QA, 3GPP protocol comprehension, fault and log diagnosis, and RF/network modeling), as reported in Figure 1.
13
 
14
  - **Among open-source models**, TelecomGPT-R1 leads DeepSeek-V3-0324 (685B) by **+30.3**, LLaMA-3.3-70B by **+34.9**, and Qwen2.5-72B by **+35.6**, while operating at roughly **25× fewer active parameters than the next-best open entrant**.
15
  - **Among closed-source models**, TelecomGPT-R1 reaches SOTA performance across both the general-purpose frontier tier and the telecom-specialized tier, as detailed in the two bullets below.
16
  - **Among general-purpose frontier models**, TelecomGPT-R1 leads Gemini-3.1-Pro by **+14.0**, Claude-Opus-4.6 by **+16.3**, and GPT-5 by **+17.7**. These systems sit at the **trillion-parameter-class frontier** (active-parameter counts are not publicly disclosed but are widely reported as orders of magnitude larger than 27B), making the margin a parameter-efficiency result as much as an accuracy result.
17
- - **Among telecom-specialized models**, TelecomGPT-R1 is **on par with the leading closed operator-internal telecom model AT&T's OTel-LLM-8.3B-QnA**, and leads SoftBank LTM by **+16.0**, demonstrating that an open telecom reasoning model can reach SOTA performance alongside top operator-internal baselines on the GSMA Open Telco Leaderboard. [^teletables]
18
 
19
  **In one line: TelecomGPT-R1 demonstrates that an open 27B telecom reasoning model can reach SOTA performance across the full breadth of the GSMA Open Telco Leaderboard.**
20
 
21
 
22
  ![Figure 1. TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/soASkkHnjJ7nm7iAMY5tv.png)
23
- **Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on **five of eight axes** (3GPP-TSG, srsRANBench, TeleLogs, TeleTables, Average) and stays at or above `0.95` on every other axis, visibly tracing the outer edge of the radar where no other model, open or closed, matches it on all axes simultaneously.*
24
-
25
-
26
 
27
 
28
  ---
@@ -98,9 +96,9 @@ vllm serve KU-DFI/TelecomGPT-R1 \
98
 
99
  **Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
100
 
101
-
102
  **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
103
 
 
104
  ---
105
 
106
 
@@ -131,10 +129,9 @@ Therefore, the industry needs an **open-source telecom reasoner** that can be:
131
 
132
  ### 2.4 — What TelecomGPT-R1 improves
133
 
134
- TelecomGPT-R1 represents a definitive leap forward: a **27B open-weights base** trained to perform **universal reasoning across knowledge QA, 3GPP protocol comprehension, fault/log diagnosis, and RF/network modeling under a single unified policy**. Rather than stitching together specialized heads per task, one model handles the full four-axis surface evaluated by the GSMA Open Telco Leaderboard (producing the leaderboard result reported in §1), while remaining small enough to **self-host, fine-tune, and audit inside an operator environment**.
135
-
136
  ![Diverse reasoning tasks and data modalities a telecom engineer may encounter in day-to-day work.](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/M6SlDTzpx4W6wvAGE0eTp.png)
137
- **Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work (knowledge QA 15.3%, protocol understanding 22.7%, fault analysis 18.5%, modeling & computation 43.5%) and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
138
 
139
  <!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
140
 
@@ -155,10 +152,10 @@ The challenges in §2 (heterogeneous modalities, missing telecom domain knowledg
155
 
156
  **A multi-stage post-training procedure that grounds general reasoning in telecom facts.** Supervised fine-tuning installs the telecom "language" (how to read standards, follow protocol constraints, walk a log, close a derivation) that subsequent reinforcement learning then sharpens. Without this grounding step, RL amplifies *fluent wrong reasoning*: well-formed chains that happen to operate on hallucinated 3GPP clauses, mis-read log features, or unit-dropped derivations. The RL stage targets the three failure modes that naïve outcome-reward training suffers on heterogeneous telecom data (sparse final-answer signal, uneven learning progress across axes, and reward gaming via shortcut answers), with the full algorithmic details described in the accompanying paper.
157
 
158
- The combined effect is what §1 reports: a single 27B open policy that reaches **89.0% average on the GSMA Open Telco Leaderboard**, leading every open-source, frontier-closed, and operator-internal entrant.
159
 
160
- ![TelecomGPT-R1 end-to-end recipe](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/WkwUMMWWFpS1EAJls6jhO.png)
161
- **Figure 3 | The simplified end-to-end TelecomGPT-R1 recipe.** *Heterogeneous telecom sources a fine-grained dataset processing pipeline one unified, axis-indexed corpus of 158,915 examples supervised fine-tuning of [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) experience-pool-differentiated GRPO, yielding the final TelecomGPT-R1 27B policy.*
162
 
163
  ---
164
 
@@ -208,7 +205,6 @@ KU/DFI's role is to build that open commons. The program now spans the key layer
208
  - **Model weights.** [KU-DFI/TelecomGPT-R1](https://huggingface.co/KU-DFI/TelecomGPT-R1/tree/main)
209
  - **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
210
 
211
-
212
  ### Citation
213
 
214
  ```bibtex
@@ -236,4 +232,4 @@ This work was supported by the Digital Future Institute of Khalifa University; t
236
 
237
  ---
238
 
239
- [^teletables]: On TeleTables, we follow the original paper's evaluation protocol by attaching the table content directly to the prompt — a table-grounded reasoning setup rather than retrieval without table id or content.
 
9
 
10
  ## 1 — A New State of the Art for Telecom LLMs
11
 
12
+ **TelecomGPT-R1 (27B) reaches state-of-the-art (SOTA) performance on the [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard) at 89.6% average, matching or leading every open-source and closed-source entrant across both general-purpose and telecom-specialized categories.** The leaderboard aggregates 7 benchmarks spanning 4 reasoning axes **protocol understanding** (3GPP/O-RAN normative prose), **knowledge QA** (vendor and operator facts), **modeling & computation** (RF/queueing derivations), and **fault analysis** (RAN drive-test logs) — as reported in Figure 1.
13
 
14
  - **Among open-source models**, TelecomGPT-R1 leads DeepSeek-V3-0324 (685B) by **+30.3**, LLaMA-3.3-70B by **+34.9**, and Qwen2.5-72B by **+35.6**, while operating at roughly **25× fewer active parameters than the next-best open entrant**.
15
  - **Among closed-source models**, TelecomGPT-R1 reaches SOTA performance across both the general-purpose frontier tier and the telecom-specialized tier, as detailed in the two bullets below.
16
  - **Among general-purpose frontier models**, TelecomGPT-R1 leads Gemini-3.1-Pro by **+14.0**, Claude-Opus-4.6 by **+16.3**, and GPT-5 by **+17.7**. These systems sit at the **trillion-parameter-class frontier** (active-parameter counts are not publicly disclosed but are widely reported as orders of magnitude larger than 27B), making the margin a parameter-efficiency result as much as an accuracy result.
17
+ - **Among telecom-specialized models**, TelecomGPT-R1 is **on par with the leading closed operator-internal telecom model AT&T's OTel-LLM-8.3B-QnA**, and leads SoftBank LTM by **+16.0**, demonstrating that an open telecom reasoning model can reach SOTA performance alongside top operator-internal baselines on the GSMA Open Telco Leaderboard.<sup>†</sup>
18
 
19
  **In one line: TelecomGPT-R1 demonstrates that an open 27B telecom reasoning model can reach SOTA performance across the full breadth of the GSMA Open Telco Leaderboard.**
20
 
21
 
22
  ![Figure 1. TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/soASkkHnjJ7nm7iAMY5tv.png)
23
+ **Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on **six of eight axes** (3GPP-TSG, srsRANBench, TeleLogs, TeleQnA, TeleTables, Average) and stays at or above `0.94` on every other axis, visibly tracing the outer edge of the radar where no other model, open or closed, matches it on all axes simultaneously.*
 
 
24
 
25
 
26
  ---
 
96
 
97
  **Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
98
 
 
99
  **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
100
 
101
+
102
  ---
103
 
104
 
 
129
 
130
  ### 2.4 — What TelecomGPT-R1 improves
131
 
132
+ TelecomGPT-R1 represents a definitive leap forward: a **27B open-weights base** trained to perform **universal reasoning across protocol understanding, knowledge QA, modeling & computation, and fault analysis under a single unified policy**. Rather than stitching together specialized heads per task, one model handles the full four-axis surface evaluated by the GSMA Open Telco Leaderboard (producing the leaderboard result reported in §1), while remaining small enough to **self-host, fine-tune, and audit inside an operator environment**.
 
133
  ![Diverse reasoning tasks and data modalities a telecom engineer may encounter in day-to-day work.](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/M6SlDTzpx4W6wvAGE0eTp.png)
134
+ **Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work (protocol understanding 22.7%, knowledge QA 15.3%, modeling & computation 43.5%, fault analysis 18.5%) and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
135
 
136
  <!-- A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
137
 
 
152
 
153
  **A multi-stage post-training procedure that grounds general reasoning in telecom facts.** Supervised fine-tuning installs the telecom "language" (how to read standards, follow protocol constraints, walk a log, close a derivation) that subsequent reinforcement learning then sharpens. Without this grounding step, RL amplifies *fluent wrong reasoning*: well-formed chains that happen to operate on hallucinated 3GPP clauses, mis-read log features, or unit-dropped derivations. The RL stage targets the three failure modes that naïve outcome-reward training suffers on heterogeneous telecom data (sparse final-answer signal, uneven learning progress across axes, and reward gaming via shortcut answers), with the full algorithmic details described in the accompanying paper.
154
 
155
+ The combined effect is what §1 reports: a single 27B open policy that reaches **89.6% average on the GSMA Open Telco Leaderboard**, leading every open-source, frontier-closed, and operator-internal entrant.
156
 
157
+ ![TelecomGPT-R1 end-to-end recipe](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/pYUNVVVjsE3vH8zik8woz.png)
158
+ **Figure 3 | The end-to-end TelecomGPT-R1 recipe.** *Frame ① distills four families of heterogeneous telecom material standards documents, network telemetry and drive-test logs, math papers and code, and Q&A seeds and glossaries — into a single axis-balanced curated corpus of 158,915 examples across four reasoning axes (protocol, knowledge, modeling, fault). Frame ② then drives the corpus through a three-stage post-training progression — domain grounding, policy stabilization, and verifiable reasoning refinement under axis-aligned signals — yielding TelecomGPT-R1.*
159
 
160
  ---
161
 
 
205
  - **Model weights.** [KU-DFI/TelecomGPT-R1](https://huggingface.co/KU-DFI/TelecomGPT-R1/tree/main)
206
  - **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
207
 
 
208
  ### Citation
209
 
210
  ```bibtex
 
232
 
233
  ---
234
 
235
+ <small>**†** On TeleTables, we follow the original paper's evaluation protocol by attaching the table content directly to the prompt — a table-grounded reasoning setup rather than retrieval without table id or content.</small>