Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up

rtferraz
/
domainTokenizer

Model card Files Files and versions
xet
Community
domainTokenizer
Ctrl+K
Ctrl+K
  • 1 contributor
History: 48 commits
rtferraz's picture
rtferraz
Fix label leakage: temporal split β€” use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features.
e4d8561 verified 1 day ago
  • docs
    Add e-commerce pre-training report β€” successful demo, behavioral clusters found, future improvements noted 1 day ago
  • examples
    Phase 3.0: Pipeline validation demo on mindweave/bank-transactions-us β€” ALL 10 CHECKS PASSED 7 days ago
  • notebooks
    Fix label leakage: temporal split β€” use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features. 1 day ago
  • src
    CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer β€” fixes 42% UNK rate on domain token sequences 2 days ago
  • tests
    Add fine-tuning test suite β€” 15 tests covering dataset, batching, forward/backward, Trainer smoke, multiclass 8 days ago
  • .gitattributes
    1.52 kB
    initial commit 8 days ago
  • .gitignore
    452 Bytes
    Add .gitignore β€” Python, Jupyter, training artifacts, IDE files 7 days ago
  • README.md
    8.46 kB
    Update README v0.3.0 β€” add usage example, update roadmap status, add implementation report link 8 days ago