Vanilla Transformer + Morpheme Tokenizer v4 — EViRAL
Cross-lingual retrieval model:
Ede query → Vietnamese passage retrieval
Files
| File | Description |
|---|---|
mlm.pt |
MLM pre-trained encoder |
align.pt |
Cross-lingual aligned encoder |
finetune.pt |
Contrastive fine-tuned encoder (best) |
eval_results.csv |
Full val/test metrics |
eval_100queries.csv |
Metrics on 100 sampled test queries |
morph_tokenizer/vocab.json |
Token → id mapping (size: 57,029) |
morph_tokenizer/synonym_graph.json |
Ede–Ede synonym graph via Vi-pivot (IDF-weighted) |
morph_tokenizer/mi_table.json |
Global MI boundary scores (Upgrade A) |
morph_tokenizer/tokenizer_config.json |
Special tokens, IDs, hyperparams, model dims |
Tokenizer
Morpheme tokenizer v4:
- Global MI boundary scoring (
is_plausible_boundary+ corpus-wide MI table) - IDF-weighted Ede–Ede synonym graph via Vi pivot
- Gumbel stochastic segmentation at train, MAP (deterministic) at eval
- Vi side: whitespace passthrough (syllable-level)
- Morpheme-aware vocab:
MIN_FREQ_MORPH=3for fragments (≤4 chars),MIN_FREQ_WORD=2for whole words
Architecture
- Vanilla Transformer encoder
d_model=512,n_heads=8,n_layers=6,d_ff=2048,dropout=0.1max_length=128- Special tokens: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']
Training pipeline
- MLM pre-training (Ede + Vi)
- Cross-lingual alignment (Ede query ↔ Vi query, InfoNCE)
- Contrastive fine-tuning on (Ede query, Vi passage) pairs
Reload tokenizer
import json
vocab = json.load(open("morph_tokenizer/vocab.json"))
synonym_graph = json.load(open("morph_tokenizer/synonym_graph.json"))
cfg = json.load(open("morph_tokenizer/tokenizer_config.json"))
# mi_table: decode "word|cut_idx" -> (word, int(cut_idx))
mi_raw = json.load(open("morph_tokenizer/mi_table.json"))
mi_table = {tuple([k.rsplit("|", 1)[0], int(k.rsplit("|", 1)[1])]): v
for k, v in mi_raw.items()}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support