Vanilla Transformer + Morpheme Tokenizer v4 — EViRAL

Cross-lingual retrieval model:

Ede query → Vietnamese passage retrieval

Files

File Description
mlm.pt MLM pre-trained encoder
align.pt Cross-lingual aligned encoder
finetune.pt Contrastive fine-tuned encoder (best)
eval_results.csv Full val/test metrics
eval_100queries.csv Metrics on 100 sampled test queries
morph_tokenizer/vocab.json Token → id mapping (size: 57,029)
morph_tokenizer/synonym_graph.json Ede–Ede synonym graph via Vi-pivot (IDF-weighted)
morph_tokenizer/mi_table.json Global MI boundary scores (Upgrade A)
morph_tokenizer/tokenizer_config.json Special tokens, IDs, hyperparams, model dims

Tokenizer

Morpheme tokenizer v4:

  • Global MI boundary scoring (is_plausible_boundary + corpus-wide MI table)
  • IDF-weighted Ede–Ede synonym graph via Vi pivot
  • Gumbel stochastic segmentation at train, MAP (deterministic) at eval
  • Vi side: whitespace passthrough (syllable-level)
  • Morpheme-aware vocab: MIN_FREQ_MORPH=3 for fragments (≤4 chars), MIN_FREQ_WORD=2 for whole words

Architecture

  • Vanilla Transformer encoder
  • d_model=512, n_heads=8, n_layers=6, d_ff=2048, dropout=0.1
  • max_length=128
  • Special tokens: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']

Training pipeline

  1. MLM pre-training (Ede + Vi)
  2. Cross-lingual alignment (Ede query ↔ Vi query, InfoNCE)
  3. Contrastive fine-tuning on (Ede query, Vi passage) pairs

Reload tokenizer

import json
vocab          = json.load(open("morph_tokenizer/vocab.json"))
synonym_graph  = json.load(open("morph_tokenizer/synonym_graph.json"))
cfg            = json.load(open("morph_tokenizer/tokenizer_config.json"))

# mi_table: decode "word|cut_idx" -> (word, int(cut_idx))
mi_raw = json.load(open("morph_tokenizer/mi_table.json"))
mi_table = {tuple([k.rsplit("|", 1)[0], int(k.rsplit("|", 1)[1])]): v
             for k, v in mi_raw.items()}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support