| | --- |
| | language: |
| | - uz |
| | - en |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3-4B |
| | tags: |
| | - uzbek |
| | - qwen3 |
| | - language-model |
| | - text-generation |
| | - nlp |
| | - central-asia |
| | - low-resource |
| | - tokenizer-optimization |
| | datasets: |
| | - behbudiy/alpaca-cleaned-uz |
| | - NeuronUz/uzbek-spelling-mcq |
| | pipeline_tag: text-generation |
| | model-index: |
| | - name: NeuronAI-Uzbek |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Uzbek Language Understanding |
| | dataset: |
| | name: UzLiB Benchmark |
| | type: uzlib |
| | metrics: |
| | - type: accuracy |
| | value: 0.662 |
| | name: Overall Accuracy |
| | --- |
| | |
| | <div align="center"> |
| |
|
| | # πΊπΏ NeuronAI-Uzbek |
| |
|
| | ### The Most Advanced Open-Source Language Model for Uzbek |
| |
|
| | [](https://huggingface.co/NeuronUz/NeuronAI-Uzbek) |
| | [](https://opensource.org/licenses/Apache-2.0) |
| | [](https://huggingface.co/Qwen/Qwen3-4B) |
| |
|
| | **π 4th Place Globally | π₯ 1st Place in Uzbekistan on UzLiB Benchmark** |
| |
|
| | *Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks* |
| |
|
| | </div> |
| |
|
| | --- |
| |
|
| | ## π Key Results |
| |
|
| | <div align="center"> |
| |
|
| | | Achievement | Value | |
| | |-------------|-------| |
| | | **UzLiB Overall Score** | **0.662** | |
| | | **Global Ranking** | **#4** | |
| | | **Regional Ranking** | **#1 in Uzbekistan** | |
| | | **Tokenizer Efficiency Improvement** | **+22.5%** vs Qwen3-4B | |
| |
|
| | </div> |
| |
|
| | --- |
| |
|
| | ## π UzLiB Benchmark Performance |
| |
|
| | NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding. |
| |
|
| | ### Leaderboard Position |
| |
|
| | [](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md) |
| |
|
| |
|
| | > **Note**: NeuronAI-Uzbek is the **smallest model** in the top 10, with only **4B parameters**, while competing against models with 100B+ parameters. |
| |
|
| | ### Performance Comparison vs Original Qwen3-4B |
| |
|
| | | Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement | |
| | |--------|:-------------------:|:--------------:|:-----------:| |
| | | **Overall (All)** | 0.345 | **0.662** | **+91.9%** | |
| | | Correct Word | 0.351 | 0.718 | +104.6% | |
| | | Meaning | 0.309 | 0.466 | +50.8% | |
| | | Meaning in Context | 0.347 | 0.333 | -4.0% | |
| | | Fill-in | 0.327 | 0.385 | +17.7% | |
| |
|
| | --- |
| |
|
| | ## π€ Tokenizer Efficiency |
| |
|
| | We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs). |
| |
|
| | ### Fertility Rate Comparison |
| |
|
| | | Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 | |
| | |-------|:--------------:|:-------:|:----------:|:--------------------:| |
| | | **NeuronAI-Uzbek (Ours)** π | **2.67** | 0.15 | 180,000 | **+22.5%** | |
| | | Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% | |
| | | LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% | |
| | | DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% | |
| | | Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - | |
| |
|
| | > **Fertility Rate**: Average number of tokens per word. Lower is better for efficiency. |
| |
|
| | <div align="center"> |
| | <img src="assets/fertility_comparison_chart.png" alt="Tokenizer Fertility Rate Comparison" width="700"/> |
| | </div> |
| |
|
| | ### What This Means |
| |
|
| | - **22.5% fewer tokens** needed to represent Uzbek text |
| | - **Faster inference** due to shorter sequences |
| | - **Lower API costs** when deployed |
| | - **Better context utilization** - fit more content in the same context window |
| |
|
| | --- |
| |
|
| | ## π οΈ Model Details |
| |
|
| | ### Architecture |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | **Base Model** | Qwen3-4B | |
| | | **Parameters** | 4 Billion | |
| | | **Vocabulary Size** | 180,000 tokens | |
| | | **Context Length** | 32,768 tokens | |
| | | **Architecture** | Transformer (Decoder-only) | |
| | | **Precision** | BFloat16 | |
| |
|
| | ### Training Methodology |
| |
|
| | 1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens |
| | 2. **Embedding Initialization**: Semantic initialization using subword composition |
| | 3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus |
| | 4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets |
| |
|
| | ### Training Data |
| |
|
| | | Dataset | Type | Purpose | |
| | |---------|------|---------| |
| | | Uzbek Web Corpus | Pretraining | Language modeling | |
| | | behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions | |
| | | NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training | |
| | | vicgalle/alpaca-gpt4 | SFT | English capability retention | |
| |
|
| | --- |
| |
|
| | ## π Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers torch |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "NeuronUz/NeuronAI-Uzbek" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | torch_dtype="auto", |
| | device_map="auto", |
| | trust_remote_code=True |
| | ) |
| | |
| | prompt = "O'zbekiston haqida qisqacha ma'lumot bering." |
| | |
| | messages = [ |
| | {"role": "user", "content": prompt} |
| | ] |
| | |
| | text = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=False, |
| | add_generation_prompt=True |
| | ) |
| | |
| | inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | top_p=0.9, |
| | do_sample=True |
| | ) |
| | |
| | response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| |
|
| | ### With Thinking Mode (Chain-of-Thought) |
| |
|
| | ```python |
| | messages = [ |
| | {"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."} |
| | ] |
| | |
| | text = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=False, |
| | add_generation_prompt=True, |
| | enable_thinking=True # Enable step-by-step reasoning |
| | ) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Use Cases |
| |
|
| | NeuronAI-Uzbek excels at: |
| |
|
| | - **π Text Generation**: Creative writing, content creation in Uzbek |
| | - **β Question Answering**: Answering questions about Uzbek culture, history, and general knowledge |
| | - **π Reading Comprehension**: Understanding and analyzing Uzbek texts |
| | - **π€ Grammar & Spelling**: Uzbek language correctness tasks |
| | - **π Translation Assistance**: Uzbek-English language tasks |
| | - **π¬ Conversational AI**: Building Uzbek chatbots and assistants |
| |
|
| | --- |
| |
|
| | ## β οΈ Limitations |
| |
|
| | - **Knowledge Cutoff**: Training data has a knowledge cutoff date |
| | - **Hallucinations**: May generate plausible-sounding but incorrect information |
| | - **Bias**: May reflect biases present in training data |
| | - **Not for Critical Applications**: Should not be used for medical, legal, or safety-critical applications without human oversight |
| |
|
| | --- |
| |
|
| | ## π License |
| |
|
| | This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
| |
|
| | --- |
| |
|
| | ## π Acknowledgments |
| |
|
| | - **Qwen Team** at Alibaba for the excellent Qwen3-4B base model |
| | - **UzLiB Benchmark** creators for the comprehensive evaluation framework |
| | - **Uzbek NLP Community** for datasets and linguistic resources |
| |
|
| | --- |
| |
|
| | ## π Citation |
| |
|
| | ```bibtex |
| | @misc{neuronai-uzbek-2025, |
| | title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek}, |
| | author={NeuronAI Team}, |
| | year={2025}, |
| | publisher={Hugging Face}, |
| | url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | <div align="center"> |
| |
|
| | **Built with β€οΈ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)** |
| |
|
| | </div> |
| |
|