RSLM Tokenizer 65K

CPU-safe Byte-Level BPE tokenizer for RSLM.

Training data

Dataset: turkish-nlp-suite/BellaTurca

Subsets:

  • AkademikDerlem
  • OzenliDerlem
  • temiz-OSCAR
  • temiz-mC4

Column: text

Target estimated tokens: 700,000,000 total, approximately 175,000,000 per subset.

Vocab

  • Requested vocab size: 65,536
  • Actual vocab size: 65,536
  • BPE min frequency: 3

Special tokens

  • <|pad|>
  • <|bos|>
  • <|eos|>
  • <|unk|>
  • <|system|>
  • <|user|>
  • <|assistant|>
  • <|answer|>
  • <|end|>
  • <think>
  • </think>
Downloads last month
80
Safetensors
Model size
1.0B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support