DualEmbLM

A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, with dual character-level + word-level embeddings.

Architecture

DualEmbLM combines:

  • Character-level tokenisation (1 character = 1 token) — enables precise lacuna restoration at the character level
  • Word-level context embeddings — provides morphological and lexical context via a 50k word vocabulary
  • Transformer encoder (BERT architecture, trained from scratch) — 6 layers, hidden size 512, 8 attention heads

The dual embeddings are concatenated and projected into the shared hidden space before being passed to the transformer encoder.

Training

The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:

Source Language Word Tokens Link
Birchbark manuscripts Old Novgorodian (mostly) 21,464 gramoty.ru
Epigraphy Old Church Slavonic (mostly) 8,102 epigraphica.ru
DIACU Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian 1,683,307 ACL Anthology
TOROT Old Russian; Church Slavonic 682,430 torottreebank.github.io
Bible (Ponomar) Church Slavonic 603,047 GitHub
Byliny Old Russian (XI–XVII c.) 430,103 rusneb.ru
Pushkin House Old Russian 256,503 lib2.pushkinskijdom.ru
Military Statute (Part 2) Old Russian 49,787 rusneb.ru
NKRYA (historical) Old Russian; Old Rus (XI–XVIII c.) 42,412 ruscorpora.ru

Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.

Usage

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(
    "MaximEremeev/DualEmb-slav",
    trust_remote_code=True,
)

Tasks

  • Generated lacunae restoration (Test A Hit@1: 0.817, CER: 0.183)
  • Real lacunae restoration (Test B char Hit@1: 0.466, span Hit@1: 0.222)

Contact

Maxim Eremeev, maeremeev@edu.hse.ru

Downloads last month
-
Safetensors
Model size
29M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Space using MaximEremeev/DualEmb-slav 1