LazuriMT — Turkish ↔ Laz (Lazuri) Translation

🇬🇧 English — LazuriMT is an open-source machine translation adapter for Turkish ↔ Laz (Lazuri), an endangered Kartvelian language spoken in northeastern Türkiye and parts of Georgia.

🇹🇷 Türkçe — LazuriMT, Türkçe ↔ Lazca (Lazuri) arasında çeviri için açık kaynaklı bir adaptördür. Lazca, Türkiye'nin kuzeydoğusu ve Gürcistan'ın bazı bölgelerinde konuşulan, nesli tükenmekte olan bir Kartvel dilidir.

🌊 Lazuri — LazuriMT, Turkuli do Lazuri nenape şeni açikkaynaki tercüme modeli ren. Lazuri, Turkias do Gurcistanis na isinapunan Kartveluri nena ren.

LoRA adapter for Gemma 4 E4B. v0.2 research preview.

⚠️ Status: research preview, not production-quality

chrF on 200 held-out test pairs (TR→LZ): 26.97 (v0.1 was 24.66)
Real Laz output for natural sentences, but uneven on rare vocabulary and dialect conditioning.
Built for endangered-language preservation, research, and community use.
Full training pipeline + iteration log: https://github.com/CidQu/lazca_ai

Quick start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "unsloth/gemma-4-e4b-it-unsloth-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", load_in_4bit=True)
model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT")
tok = AutoTokenizer.from_pretrained("CidQuLimited/LazuriMT")

def translate(text, to="lzz"):
    prompt = (f"Translate this Turkish sentence into Laz (Lazuri):\n\n{text}"
              if to == "lzz"
              else f"Translate this Laz (Lazuri) sentence into Turkish:\n\n{text}")
    inputs = tok.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True, add_generation_prompt=True, return_tensors="pt",
    ).to(model.device)
    out = model.generate(
        input_ids=inputs, max_new_tokens=128, do_sample=False,
        no_repeat_ngram_size=3, repetition_penalty=1.15, num_beams=4,
    )
    return tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True).strip()

print(translate("Su içmek istiyorum."))

Pin to a specific release with revision="v0.2" (or "v0.1" for the older one):

model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT", revision="v0.2")

Performance

chrF computed on 200 held-out TR→LZ pairs from the corpus's test split (5%), with beam-search decoding (no_repeat_ngram_size=3, repetition_penalty=1.15, num_beams=4).

Version	chrF (TR→LZ)	Notes
baseline Gemma 4 E4B (no adapter)	≈ 0	does not translate Laz
v0.1	24.66	LoRA r=32, 10,500 masked-loss steps (~2.15 epochs), Kaggle T4
v0.2 (this release)	26.97	LoRA r=64, 18,000 steps (3 epochs), A100, cosine-restart LR, 3× dialect upweight

For context, chrF roughly maps:

~10: garbled
~20–30: readable but flawed
~40+: useful translations
~50+: professional-level

LazuriMT v0.2 is in the "readable but flawed" range — a real but early baseline for a language with almost no prior MT.

Training setup

Base model: unsloth/gemma-4-e4b-it-unsloth-bnb-4bit (Gemma 4 E4B, pre-quantized to 4-bit)
Adapter: LoRA on language layers (attention + MLP), r=64, α=64, dropout 0
Trainable params: 146,800,640 of 8,142,957,088 (1.80 %)
Loss masking: response-only (loss computed on Laz output tokens, instruction prompt masked)
Optimizer: 8-bit AdamW, lr=2e-4, cosine-with-restarts (2 cycles), warmup_ratio 0.03, bf16
Batch: 16 effective (8 per-device × 2 grad-accum)
Steps: 18,000 (3 epochs over 102,461 conversations, incl. 3× dialect upweighting + grammar examples)
Hardware: 1× NVIDIA A100-40GB (Modal), Unsloth runtime
Training time: ~8 h (full run, no timeout)
Bidirectional: every TR↔LZ pair is presented in both directions during training

Known limitations (and v0.3 roadmap)

Dialect conditioning still doesn't differentiate output. "Atina (Pazar)" vs "Xopa (Hopa)" prompts produce near-identical translations. v0.2 attempted a fix — 3× upweighting of dialect-tagged pairs plus a front-loaded [Laz dialect: X] label in the prompt — but it did not meaningfully change behavior. The likely cause: even at 3×, dialect-tagged pairs are only ~9 % of the training mix, so the model defaults to general-form Laz. v0.3 will try a dialect-balanced sampler (equal exposure per dialect rather than blunt upweighting) plus additional dialect-tagged parallel data.
Short single-word queries collapse onto plausible-wrong tokens (e.g. dictionary-style TR words sometimes yield a wrong Laz lemma). The corpus's still-dominant vocab slice teaches vocabulary lookup imperfectly.
Long, content-dense sentences degrade — they can diverge substantially from the reference (more a coverage/data-volume issue than a decoding one).
Vocabulary edge cases — some real Laz words are mistranslated (model emits a wrong-but-plausible Laz word).
Single dialect bias in output — the corpus is mostly general-form Laz with the largest single-dialect contribution being Atina (Pazar); expect output to lean general / Atina.

Bias and intended use

Intended for: Laz language preservation, research, language-learning aids, accessibility tools, community projects. Not a replacement for human translators in any setting where accuracy matters (legal, medical, etc.).
Bias: trained on a mix of written sources (dictionaries, school textbooks, folktales, news articles). Will reflect the registers and dialects of those sources.
Out of scope: code translation, modern colloquial / internet Laz (the corpus is mostly literary/educational).

License

The adapter is derivative work of Gemma 4 and inherits the Gemma Terms of Use — commercial-friendly but with acceptable-use restrictions. Downstream users must comply with Gemma's terms.

The training corpus mixes open-license sources (Wikipedia CC-BY-SA, Mozilla Common Voice CC-BY, GPL-3.0 Wiktionary, public-domain Lazuri Paramitepe 1982) with academically-attributed sources used under fair-use for endangered-language research. The adapter weights are released for research and community use under these combined terms.

Citation

@misc{lazurimt2026,
  title  = {LazuriMT: A Turkish-Laz Machine Translation Adapter for an Endangered Kartvelian Language},
  author = {Yavuz Selimhan Kaya},
  year   = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/CidQuLimited/LazuriMT}},
  note   = {v0.2 research preview, chrF 26.97 on 200 TR→LZ test pairs}
}

Acknowledgments

The Lazuri community: İsmail Avcı Bucaklişi, Hasan Uzunhasanoğlu (Lazuri.Com), Ali İhsan Aksamaz, Özlem Durmaz (translator of Anadolu Dillerinde Küçük Prens), the Laz Institute, contributors to lazcasozluk.org and the Lazuri Wiktionary GitHub project, the Ministry of National Education of Türkiye, the broader Laz language preservation community, and every Laz speaker who has kept this language alive.

Tools: Unsloth for the QLoRA training stack, Google's Gemma 4 as the base model, the Hugging Face ecosystem, and Kaggle for the GPU compute.

Full reproduction code, training data sources, and iteration history: https://github.com/CidQu/lazca_ai

Downloads last month: 37