DatarrX
/

myX-Semantic

@@ -2,149 +2,148 @@
 license: apache-2.0
 language:
 - my
-pipeline_tag: feature-extraction
 tags:
 - myanmar
 - burmese
 - nlp
-- embeddings
-- semantic
-- fasttext
-library_name: fasttext
 datasets:
 - DatarrX/myX-Mega-Corpus
 ---
-# myX-Semantic: A High-Performance Burmese Word Embedding Model
-## ၁။ နိဒါန်း (Introduction)
-**myX-Semantic** သည် မြန်မာဘာသာစကား၏ အနက်အဓိပ္ပာယ် ဆက်စပ်မှုများကို ကိန်းဂဏန်းများအဖြစ် ပြောင်းလဲပေးနိုင်သော (Word Embedding) မော်ဒယ်တစ်ခုဖြစ်သည်။ ဤမော်ဒယ်သည် မြန်မာစာသားများအတွင်းရှိ စကားလုံးများ၏ ရှေ့နောက်ဆက်စပ်မှု (Contextual relationships) နှင့် အဓိပ္ပာယ်တူညီမှု (Semantic similarity) များကို နားလည်နိုင်ရန် FastText (Skip-gram) နည်းပညာကို အခြေခံ၍ တည်ဆောက်ထားခြင်းဖြစ်သည်။
-## ၂။ ထုတ်လုပ်သူ (Developer Information)
-ဤ Model ကို [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီး [**Khant Sint Heinn (Kalix Louis)**(https://huggingface.co/kalixlouiis)] မှ အဓိက ဖန်တီးတည်ဆောက်ထားခြင်း ဖြစ်ပါသည်။ မြန်မာဘာသာစကားဆိုင်ရာ သဘာဝဘာသာစကား စီမံဆောင်ရွက်မှု (Natural Language Processing - NLP) အရင်းအမြစ်များကို ပိုမိုပေါများလာစေရန် ရည်ရွယ်၍ ဖန်တီးခဲ့ခြင်းဖြစ်သည်။
-## ၃။ အသုံးပြုနိုင်သည့် နယ်ပယ်များ (Intended Use)
-myX-Semantic ကို အောက်ပါ NLP လုပ်ငန်းစဉ်များတွင် အခြေခံအုတ်မြစ်အဖြစ် အသုံးပြုနိုင်သည် -
-* **Semantic Search:** စာလုံးပေါင်း တိတိကျကျ မတူသော်လည်း အဓိပ္ပာယ်တူညီသည့် စာသားများကို ရှာဖွေခြင်း။
-* **Text Classification:** စာသားများကို အမျိုးအစား ခွဲခြားခြင်း။
-* **Sentiment Analysis:** စာသားများ၏ ခံစားချက်ဖော်ပြမှုကို ခွဲခြားခြင်း။
-* **Foundation for LLMs:** ကြီးမားသော ဘာသာစကားမော်ဒယ်များ (Large Language Models) အတွက် အဓိပ္ပာယ်ပိုင်းဆိုင်ရာ အခြေခံအဖြစ် အသုံးပြုခြင်း။
-## ၄။ နည်းပညာဆိုင်ရာ အချက်အလက်များ (Technical Details)
-ဤမော်ဒယ်ကို လေ့ကျင့်ရာတွင် အောက်ပါ နည်းပညာဆိုင်ရာ သတ်မှတ်ချက်များကို အသုံးပြုထားသည် -
-* **မော်ဒယ်တည်ဆောက်ပုံ (Architecture):** FastText (Skip-gram)။
-* **လေ့ကျင့်ထားသော ဒေတာပမာဏ (Training Data):** စာကြောင်းရေ ၁၆ သန်းကျော် (၅.၃ GB ဝန်းကျင်) ရှိသော [**myX-Mega-Corpus**](https://huggingface.co/datasets/DatarrX/myX-Mega-Corpus)။
-* **စကားလုံးခွဲစနစ် (Tokenizer):** myX-Tokenizer (64,000 Vocabulary size)။
-* **Vector Dimension:** 100။
-* **အနည်းဆုံးပါဝင်မှုနှုန်း (Min Count):** 20။
-* **Window Size:** 5။
-* **Epochs:** 3။
-## ၅။ ကန့်သတ်ချက်များနှင့် လိုင်စင် (Limitations and License)
-### ၅.၁ ကန့်သတ်ချက်များ (Limitations)
-* ဤမော်ဒယ်သည် Unicode စံနှုန်းဖြင့် ရေးသားထားသော စာသားများတွင်သာ အကောင်းဆုံး စွမ်းဆောင်နိုင်မည်ဖြစ်သည်။
-* လေ့ကျင့်ထားသော ဒေတာများအတွင်းမှ ဘက်လိုက်မှု (Bias) များသည် မော်ဒယ်၏ ရလဒ်အပေါ် သက်ရောက်မှု ရှိနိုင်သည်။
-### ၅.၂ လိုင်စင် (License)
-ဤမော်ဒယ်အား **Apache License 2.0** အောက်တွင် ထုတ်ဝေထားပါသည်။ စီးပွားရေးလုပ်ငန်းများနှင့် သုတေသနလုပ်ငန်းများတွင် လွတ်လပ်စွာ အသုံးပြုနိုင်သော်လည်း မူရင်းဖန်တီးသူကို သတ်မှတ်ထားသည့်အတိုင်း ကိုးကားဖော်ပြရမည်ဖြစ်သည်။
-## ၆။ အသုံးပြုနည်း လမ်းညွှန် (How to Use)
-ဤ Model ကို Python environment တွင် အောက်ပါအဆင့်များအတိုင်း အသုံးပြုနိုင်သည်။
-### ၆.၁ လိုအပ်သော Library များ ထည့်သွင်းခြင်း (Installation)
-ပထမဦးစွာ Model ကို Load လုပ်ရန်နှင့် Hugging Face မှ Download ရယူရန် လိုအပ်သော Library များကို Install လုပ်ပါ။
-```BASH
-pip install fasttext huggingface_hub
 ```
-### ၆.၂ Model ကို Load လုပ်ခြင်း (Loading the Model)
-Hugging Face Hub မှ Model ကို တိုက်ရိုက် Download ရယူပြီး Load လုပ်ရန် အောက်ပါ Code ကို အသုံးပြုပါ။
-```Python
-import fasttext
-from huggingface_hub import hf_hub_download
-# Hugging Face မှ model ဖိုင်ကို download ဆွဲခြင်း
-model_path = hf_hub_download(repo_id="DatarrX/myX-Semantic", filename="myX-Semantic.bin")
-# fasttext ကို သုံးပြီး model ကို load လုပ်ခြင်း
-model = fasttext.load_model(model_path)
-```
-### ၆.၃ အခြေခံ အသုံးပြုနည်းများ (Basic Operations)
-Model ရရှိပြီးနောက် အောက်ပါ NLP လုပ်ငန်းစဉ်များကို စမ်းသပ်နိုင်သည်။
-- က) အဓိပ္ပာယ်တူညီသော စကားလုံးများ ရှာဖွေခြင်း (Finding Nearest Neighbors)
-စကားလုံးတစ်လုံးနှင့် အနီးစပ်ဆုံး အဓိပ္ပာယ်ရှိသော စကားလုံး (၁၀) လုံးကို ရှာဖွေရန်:
-```Python
-# 'နည်းပညာ' နှင့် အနီးစပ်ဆုံးစကားလုံးများ ရှာခြင်း
-neighbors = model.get_nearest_neighbors("နည်းပညာ")
-for score, neighbor in neighbors:
-    print(f"{neighbor}: {score:.4f}")
 ```
-- ခ) စကားလုံးနှစ်လုံး၏ အဓိပ္ပာယ် နီးစပ်မှုကို စစ်ဆေးခြင်း (Calculating Similarity Score)
-စကားလုံးနှစ်လုံးသည် အဓိပ္ပာယ်အရ မည်မျှ နီးစပ်သလဲဆိုသည်ကို တွက်ချက်ရန်:
-```Python
-import numpy as np
-def get_similarity(w1, w2):
-    v1 = model.get_word_vector(w1)
-    v2 = model.get_word_vector(w2)
-    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
-score = get_similarity("ပျော်တယ်", "ဝမ်းသာတယ်")
-print(f"Similarity Score: {score:.4f}")
-```
-- ဂ) စာကြောင်းတစ်ခုလုံး၏ Vector ကို ရယူခြင်း (Getting Sentence Vector)
-စာကြောင်းတစ်ခုလုံးကို Vector အဖြစ် ပြောင်းလဲရန် (Text Classification သို့မဟုတ် Semantic Search လုပ်ရန်အတွက် အသုံးဝင်သည်):
-```Python
-sentence_vector = model.get_sentence_vector("မြန်မာနိုင်ငံ၏ နည်းပညာ ကဏ္ဍ တိုးတက်လာပုံ")
-print(sentence_vector)
 ```
-## ၇။ လေ့ကျင့်မှု ဖြစ်စဉ် အကျဉ်းချုပ် (Training Procedure Summary)
-ဤမော်ဒယ်ကို အဆင့် (၂) ဆင့်ဖြင့် စနစ်တကျ လေ့ကျင့်ခဲ့သည် -
-* **အဆင့် (၁) - Tokenization:** [myX-Tokenizer](https://huggingface.co/DatarrX/myX-Tokenizer) ကို အသုံးပြု၍ ၁၆ သန်းကျော်သော စာကြောင်းများကို Subword units များအဖြစ် ခွဲခြားခဲ့သည်။ လုပ်ဆောင်ချက် မြန်ဆန်စေရန် Multiprocessing စနစ်ကို အသုံးပြုခဲ့သည်။
-* **အဆင့် (၂) - FastText Training:** ခွဲခြားထားသော Token များကို FastText (Skip-gram) algorithm သုံး၍ Dimension 100 ဖြင့် လေ့ကျင့်ခဲ့သည်။ ပိုမိုတိကျသော Context များရရှိရန် Window Size 5 နှင့် Negative Sampling နည်းလမ်းကို အသုံးပြုခဲ့သည်။
-## ၈။ လေ့ကျင့်မှုဆိုင်ရာ ကုဒ်များ (Training Code)
-မော်ဒယ်အား ပြန်လည်စမ်းသပ်နိုင်ရန်နှင့် ပွင့်လင်းမြင်သာမှုရှိစေရန်အတွက် အသုံးပြ��ခဲ့သော ကုဒ်အပြည့်အစုံကို အောက်ပါ GitHub link တွင် လေ့လာနိုင်သည် -
-👉 [https://github.com/DatarrX/myX-Semantic](https://github.com/DatarrX/myX-Semantic)
-## ၉။ မော်ဒယ်ဆိုင်ရာ အချက်အလက်များ (Model File Info)
-* **Model Version:** 1.0
-* **File Format:** Binary (.bin)
-* **File Size:** ~851.71 MB
-* **Vector Dimension:** 100
-* **Architecture:** FastText (Skip-gram)
-## ၁၀။ DatarrX အကြောင်း (About DatarrX)
-[**DatarrX**](https://huggingface.co/DatarrX) သည် မြန်မာဘာသာစကားအတွက် အဆင့်မြင့် သဘာဝဘာသာစကား စီမံဆောင်ရွက်မှု (Natural Language Processing) အရင်းအမြစ်များကို ဖန်တီးပေးနေသည့် Open-source NGO အဖွဲ့အစည်းတစ်ခုဖြစ်သည်။ မြန်မာနိုင်ငံ၏ ဒစ်ဂျစ်တယ်နည်းပညာကဏ္ဍတွင် AI နှင့် Open Data များ ပိုမိုပေါများလာစေရန်နှင့် မြန်မာဘာသာစကားဆိုင်ရာ ဒေတာစုများ၊ မော်ဒယ်များကို လူတိုင်း အခမဲ့ အသုံးပြုနိုင်ရန် ရည်ရွယ်၍ ဖွဲ့စည်းထားခြင်းဖြစ်သည်။
-## ၁၁။ ကိုးကားအသုံးပြုရန် (Citation)
-သင်၏ သုတေသန သို့မဟုတ် ပရောဂျက်များတွင် ဤမော်ဒယ်ကို အသုံးပြုပါက အောက်ပါအတိုင်း ကိုးကားပေးပါရန် မေတ္တာရပ်ခံအပ်ပါသည် -
-### APA Style
-Khant Sint Heinn. (2026). *myX-Semantic: A Burmese word embedding model for NLP tasks* [Computer software]. DatarrX. https://huggingface.co/DatarrX/myX-Semantic
-### BibTeX
 ```bibtex
 @software{khantsintheinn2026myxsemantic,
   author = {Khant Sint Heinn},
-  title = {myX-Semantic: A Burmese Word Embedding Model for NLP Tasks},
   year = {2026},
   publisher = {DatarrX},
-  url = {https://huggingface.co/DatarrX/myX-Semantic},
-  note = {Myanmar Open Source NGO}
 }
 ```
-## ၁၂။ အသုံးပြုနိုင်သည့် ဘာသာစကား (Intended Language)
-ဤမော်ဒယ်ကို **မြန်မာဘာသာစကား (Burmese)** တစ်မျိုးတည်းအတွက်သာ ရည်ရွယ်၍ တည်ဆောက်ထားခြင်းဖြစ်သည်။ အခြားဘာသာစကားများအတွက် အသုံးပြုပါက ရလဒ်ကောင်းမွန်ရန် အာမမခံပါ။

 license: apache-2.0
 language:
 - my
+pipeline_tag: sentence-similarity
 tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dense
+- generated_from_trainer
 - myanmar
 - burmese
 - nlp
+library_name: sentence-transformers
+dataset_size: 1000000
+loss: MSELoss
+base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
+widget:
+- source_sentence: >-
+    ▁ထို အလုပ်ရုံ သည် ▁ကျနော် ၏ ▁ကိုယ်ရေး အချက်အလက် များကို ဖတ်ရှု ကာ
+    ▁မေးခွန်းများ ▁မေး ကာ ▁ကျနော့်ကို ▁ဝယ် လိုက်ပါတော့သည်။
+  sentences:
+  - >-
+    ▁ထုံးတမ်းစဉ်လာ ▁လေး ပါး တွင် ▁ကံ ▁၊ ▁တရား ▁၊ ▁သ မ် စာ ▁၊ ▁မော သံ ▁နှင့်
+    ▁ယောဂ ▁အမျိုးအစား ▁အမျိုးမျိုး တို့ ▁ပါဝင် သည်။
+  - >-
+    ▁ကိုယ်ပိုင် ဟန် ၊ ▁ကိုယ်ပိုင် ဒီဇိုင်း ၊ ▁ကိုယ်ပိုင် စိတ်ကူး ၊ ▁ကိုယ်ပိုင်
+    ဖန်တီး မှုကို ▁ပြသ သည့် ▁ဝတ်စုံ များကို ▁ဒီဇိုင်နာ ▁မ မီး မီး က ▁ပန်းချီကား
+    တစ်ချပ် သဖွယ် ▁ဖန်တီး သူဖြစ်သည်။
 datasets:
 - DatarrX/myX-Mega-Corpus
 ---
+# 📝 myX-Semantic: A Burmese Sentence Embedding Model
+## Model Description
+**myX-Semantic** is a sentence-transformer model fine-tuned for the Burmese (Myanmar) language. It maps sentences and paragraphs into a **768-dimensional dense vector space**.
+This model is built using a **Knowledge Distillation** approach. It utilizes a `paraphrase-multilingual-MiniLM-L12-v2` student architecture, which has been trained to mimic the high-dimensional output of a larger teacher model (`paraphrase-multilingual-mpnet-base-v2`). To ensure compatibility with the teacher's embeddings, a dedicated Dense layer was integrated to project the student's native 384-dimensions into the final 768-dimensional space.
+### Key Applications
+*   **Semantic Textual Similarity (STS):** Measuring how similar two sentences are in meaning.
+*   **Semantic Search:** Retrieving relevant documents based on intent rather than keywords.
+*   **Text Classification & Clustering:** Grouping similar Burmese texts based on their semantic vectors.
+*   **Information Retrieval:** Finding answers or paraphrases in large Burmese datasets.
+## Development & Distribution
+*   **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
+*   **Published by:** [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)
+*   **Training Dataset:** [DatarrX/myX-Mega-Corpus](https://huggingface.co/datasets/DatarrX/myX-Mega-Corpus) (1 Million Rows)
+*   **Tokenization:** Processed using [DatarrX/myX-Tokenizer](https://huggingface.co/DatarrX/myX-Tokenizer).
+## Technical Specifications
+- **Base Model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
+- **Max Sequence Length:** 512 tokens
+- **Output Dimension:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+- **Loss Function:** MSELoss (Mean Squared Error)
+### Model Architecture
+```text
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
+  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
+  (2): Dense({'in_features': 384, 'out_features': 768, 'bias': True, 'activation_function': 'Identity'})
+)
 ```
+## Usage
+### Installation
+```bash
+pip install -U sentence-transformers
 ```
+### Direct Usage (Inference)
+```python
+from sentence_transformers import SentenceTransformer, util
+# Load the model
+model = SentenceTransformer("DatarrX/myX-Semantic")
+# Define sentences
+sentences = [
+    "သူနှင့် ကျွန်မ ခဏ ငြိမ်နေလိုက်၏။",
+    "ကျွန်တော်တို့ အတူတူ ထိုင်နေကြသည်။",
+    "နည်းပညာသည် လူသားတို့အတွက် အရေးကြီးသည်။"
+]
+# Compute embeddings
+embeddings = model.encode(sentences)
+# Compute similarity scores
+similarities = model.similarity(embeddings, embeddings)
+print(similarities)
 ```
+## Implementation Guidelines (Thresholds)
+When using this model for similarity detection or semantic search, the choice of a similarity threshold is crucial for balancing precision and recall. Based on empirical testing:
+*   **Recommended Threshold:** A Cosine Similarity score of **0.60 or higher** is recommended to determine a strong semantic match.
+*   **Comparison:** Compared to lighter models (e.g., 500K-row variants), this 1M-row model exhibits higher confidence in its vector representations. While lower-capacity models might require a threshold around 0.40, **myX-Semantic** is optimized for a more distinctive separation at the 0.60 level.
+## Training Details
+*   **Samples:** 1,000,000 training pairs.
+*   **Batch Size:** 64
+*   **Learning Rate:** 3e-5
+*   **Optimizer:** AdamW with `round_robin` batch sampling.
+*   **Teacher Model:** `paraphrase-multilingual-mpnet-base-v2` (768-dim).
+### Training Logs
+| Epoch | Step | Training Loss |
+| :--- | :--- | :--- |
+| 0.06 | 500 | 0.0086 |
+| 0.25 | 2000 | 0.0045 |
+| 0.64 | 5000 | 0.0031 |
+| 0.96 | 7500 | 0.0028 |
+## Limitations & Bias
+*   **Language:** This model is specifically optimized for Unicode Burmese. It may not perform accurately with Zawgyi-encoded text.
+*   **Data Bias:** The model reflects the patterns and biases found in the `myX-Mega-Corpus`. Users should validate results for specific sensitive domains.
+## License
+This model is licensed under the **Apache License 2.0**. You are free to use it for research and commercial purposes, provided appropriate credit is given.
+## Citation
+If you find this model useful in your project, please cite it:
 ```bibtex
 @software{khantsintheinn2026myxsemantic,
   author = {Khant Sint Heinn},
+  title = {myX-Semantic: A Burmese Sentence Embedding Model},
   year = {2026},
   publisher = {DatarrX},
+  url = {https://huggingface.co/DatarrX/myX-Semantic}
 }
 ```
+## About the Author
+**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
+He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
+Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
+His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
+**Connect with the Author:**
+[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)