domainTokenizer / docs /phase2_implementation_report.md

Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API

7aac458 verified 8 days ago

19.2 kB

Phase 2A–2D Implementation Report

domainTokenizer v0.4.0 — Core library complete: tokenizers, models, pre-training, fine-tuning

139 tests passing (72 tokenizer + 33 model + 19 pre-training + 15 fine-tuning)

April 2026

Overview

Phase 2 implements the complete domainTokenizer library — everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a fine-tuned downstream prediction model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.

The library is organized as four layers, each built and tested independently before composing into the next:

Phase 2A: Tokenizers  →  Phase 2B: Models  →  Phase 2C: Pre-training  →  Phase 2D: Fine-tuning
(schema → tokens)        (tokens → loss)      (CLM on sequences)         (joint fusion on labels)

Phase 2A: Domain Tokenizer Library (Weeks 1–3)

What Was Built

A declarative schema system and 5 per-field tokenizers that convert raw domain events into HuggingFace-compatible token sequences.

Component	Purpose	Output
`DomainSchema` + `FieldSpec`	Declarative event definition — fields, types, bin counts	Schema object
`SignTokenizer`	Credit/debit, +/-	`79.99 → [AMT_SIGN_POS]`
`MagnitudeBucketTokenizer`	Quantile-based numerical bins (fits on data)	`79.99 → [AMT_15]`
`CalendarTokenizer`	Timestamp → month/dow/dom/hour decomposition	`Mar 15 2pm → 4 tokens`
`CategoricalTokenizer`	Fixed category mapping with UNK fallback	`"purchase" → [EVT_001]`
`DiscreteNumericalTokenizer`	Small integers with overflow	`3 → [QTY_03]`, `15 → [QTY_OVER]`
`DomainTokenizerBuilder`	Assembles per-field tokenizers → HF `PreTrainedTokenizerFast`	HF tokenizer

Three predefined schemas ship out of the box:

FINANCE_SCHEMA — 97 domain tokens (Nubank-compatible: sign + amount bins + calendar)
ECOMMERCE_SCHEMA — event type + price + quantity + category + calendar + product title
HEALTHCARE_SCHEMA — clinical event type + cost + severity + provider + calendar + description

Key Technical Decisions

Hybrid vocabulary: special tokens + BPE. Following Nubank exactly, structured fields (amounts, dates, categories) become single special tokens while free-text fields (descriptions, product titles) use standard BPE. This compresses each event to ~14 tokens vs ~35-50 with pure text serialization, tripling the number of events that fit in a 2048-token context window.
Quantile-based magnitude binning (not linear). The MagnitudeBucketTokenizer uses quantile percentiles on absolute values, not uniform bins. Financial data is heavily skewed (many small transactions, few large ones). Quantile bins ensure each bin gets roughly equal representation in the training data, maximizing the model's ability to distinguish between common transaction sizes.
Separate sign and magnitude tokenization. Following Nubank's ϕ_sign + ϕ_amt pattern, the sign (credit/debit) is tokenized independently from the magnitude. This lets the model learn that "a $500 inflow" and "a $500 outflow" share magnitude semantics but differ in direction — without wasting bins on both positive and negative ranges.
Schema-driven factory pattern. Field tokenizers are created automatically from FieldSpec declarations via create_field_tokenizer(). Adding a new domain requires only defining a DomainSchema — no code changes to the tokenizer pipeline. This enables rapid domain iteration (finance → e-commerce → healthcare) without engineering overhead.
Data-dependent tokenizers require explicit fitting. MagnitudeBucketTokenizer must be .fit() on training data before use. Calling .build() on an unfitted schema raises RuntimeError. This prevents a subtle bug where bin edges are computed on test data, leaking information.
HuggingFace-native output. The DomainTokenizerBuilder.build() method produces a standard PreTrainedTokenizerFast — the same type returned by AutoTokenizer.from_pretrained(). This means zero adaptation for HF Trainer, push_to_hub(), save_pretrained(), ONNX export, etc.

Test Results

72 tests passing covering: field spec validation, all 5 tokenizer types (including edge cases: NaN, None, overflow, unknown categories), predefined schemas (including Nubank 97-token compatibility check), builder fit/build/tokenize/encode pipeline, and full end-to-end sequence encoding.

Phase 2B: Model Architecture (Weeks 3–5)

What Was Built

A GPT-style causal decoder Transformer registered as a HuggingFace PreTrainedModel, plus numerical embeddings and joint fusion components.

Component	Purpose	Based On
`DomainTransformerConfig`	HF-compatible config with presets (`"24m"`, `"85m"`, `"330m"`)	Nubank nuFormer sizes
`DomainTransformerForCausalLM`	Causal decoder: NoPE, pre-norm, SDPA attention, weight tying	NoPE (arXiv:2305.19466) + GPT-2
`PeriodicLinearReLU`	Learned sin/cos embeddings for numerical features	Gorishniy et al. (arXiv:2203.05556)
`DCNv2` + `JointFusionModel`	Transformer + tabular feature fusion for fine-tuning	Nubank + DCN V2 (arXiv:2008.13535)

Key Technical Decisions

NoPE (No Positional Encoding). Following Kazemnejad et al. (NeurIPS 2023), the model uses zero positional encoding — no absolute, no RoPE, no ALiBi. NoPE outperforms all PE schemes on length generalization benchmarks. For domain sequences where users have vastly different history lengths (20 to 2000+ events), length generalization is critical. The model implicitly learns relative position from the causal attention mask pattern.
F.scaled_dot_product_attention with is_causal=True, not nn.MultiheadAttention. PyTorch's nn.MultiheadAttention(is_causal=True) has a known bug requiring an explicit attn_mask even when is_causal=True is set. We implement attention directly using F.scaled_dot_product_attention, which auto-dispatches to FlashAttention/cuDNN when available on CUDA, and uses an efficient C++ kernel on CPU.
HF attention mask conversion. HuggingFace Trainer sends attention masks as (B, T) long tensors (1=attend, 0=pad). PyTorch SDPA requires either None (use is_causal) or a float mask where masked positions are -inf. The attention module handles this conversion: when a mask is provided, it's expanded to (B, 1, 1, T), converted to float, and inverted (0 → -inf, 1 → 0.0). When no mask is provided, is_causal=True handles causality for free.
Weight tying via HF v5.7+ dict format. The _tied_weights_keys API changed from a list to a dict in transformers 5.7. We use {"lm_head.weight": "model.embed_tokens.weight"} with proper get/set_input_embeddings and get/set_output_embeddings implementations. post_init() handles the actual tying.
Pre-norm architecture (LayerNorm before attention/FFN). GPT-2 and most modern LLMs use pre-norm. This makes training more stable than post-norm, especially at the 24M–330M scale where we don't have the luxury of extensive hyperparameter tuning.
get_user_embedding() method on the CausalLM class. For downstream tasks (classification, joint fusion), we need a single vector representing the user's transaction history. This method extracts the hidden state at the last non-padding position — the standard approach for decoder-only models. It uses attention_mask.sum(dim=1) - 1 to find the last real token position per sequence.
PLR frequencies and phases are learned parameters. Unlike fixed Fourier features, PLR initializes frequencies and phases as trainable nn.Parameter tensors. This lets the model discover the most informative frequency decomposition for each numerical feature during training — crucial for financial data where relevant scales span 4+ orders of magnitude.

Test Results

33 tests passing covering: config presets and serialization, base model forward shapes, CausalLM with/without labels, loss differentiability, weight tying verification, user embedding extraction (with and without mask), parameter counts for tiny and 24M configs, gradient checkpointing, causal masking verification, PLR shapes and gradients, DCNv2 cross layers, JointFusion binary and multiclass, and full tokenizer→model→loss integration.

Phase 2C: Pre-training Pipeline (Weeks 5–7)

What Was Built

A data pipeline and training harness that connects the tokenizer and model layers into a complete CLM pre-training workflow.

Component	Purpose
`tokenize_user_sequences()`	Converts lists of user event sequences → variable-length token ID lists
`pack_sequences()`	Packs variable-length sequences into fixed-length blocks (run_clm.py pattern)
`prepare_clm_dataset()`	Convenience pipeline: user events → tokenize → pack → HFDataset
`pretrain_domain_model()`	Pre-trains via HF Trainer with DataCollatorForLanguageModeling, cosine schedule

Key Technical Decisions

Sequence packing, not padding. Following the official HF run_clm.py pattern, all tokenized user sequences are concatenated into one long stream and split into fixed-length blocks. This achieves 100% token utilization — every position in every training example is a real token contributing gradient signal. Padding wastes 30-70% of tokens for variable-length sequences, which is unacceptable when training data is finite (typical business scenario). The trade-off: cross-sequence boundaries exist within blocks. For domain events delimited by [BOS]/[EOS]/[SEP_EVENT] tokens, this is benign — the model learns to handle delimiters naturally.
DataCollatorForLanguageModeling(mlm=False) handles label creation. The HF Trainer does NOT auto-inject labels. The data collator does: it clones input_ids, sets labels = input_ids, and masks any padding positions (token_id == pad_token_id) with -100 so they don't contribute to loss. Our packed sequences have no padding, so labels == input_ids exactly — every token is a training target.
processing_class parameter (not tokenizer). HuggingFace Trainer v5.7 renamed tokenizer to processing_class in Trainer.__init__(). The old name raises TypeError. This is a silent API break that only manifests at runtime — caught and fixed during testing.
Cosine learning rate schedule with warmup. Following Nubank and standard GPT pre-training practice. The cosine schedule decays smoothly from peak LR to near-zero, avoiding the abrupt drops of step schedules. Warmup prevents early training instability when loss gradients are large and noisy.
disable_tqdm=True and logging_strategy="steps". For cloud/headless execution, tqdm progress bars are useless (they produce thousands of \r characters in log files). Plain text step-by-step logging (loss=X.XXX, grad_norm=Y.YYY, lr=Z.ZZZ) is greppable and parseable by monitoring tools.
Dataset yields only {"input_ids": [...]}. The collator adds labels and attention_mask. The Trainer's remove_unused_columns=True (default) auto-drops any extra columns not in the model's forward() signature. This means you can safely store metadata (user IDs, sequence lengths) in the dataset — they're dropped before batching.

Smoke Test Results

24-step training on CPU with a tiny model (64-dim, 2 layers) confirmed the full pipeline:

Step  1: loss=5.419  grad_norm=7.227  lr=1.000e-03
Step 12: loss=4.510  grad_norm=3.668  lr=5.653e-04
Step 24: loss=4.322  grad_norm=3.636  lr=4.278e-06

Loss decreased monotonically from 5.42 to 4.32 with cosine decay — the tokenizer→packing→collator→model→loss→optimizer pipeline is end-to-end functional.

Test Results

19 tests passing covering: tokenization of user sequences (variable lengths, BOS/EOS presence), packing (fixed blocks, concatenation, remainder dropping, error on insufficient data), full dataset preparation, DataCollator behavior (label creation, shapes, all-ones attention mask for packed data), integration forward pass with backward, Trainer smoke test (24 steps), and validation that missing pad_token raises correctly.

Phase 2D: Fine-tuning Pipeline (Weeks 7–9)

What Was Built

A supervised fine-tuning pipeline for the JointFusionModel — the nuFormer-style architecture that combines a pre-trained transaction Transformer with DCNv2(PLR) tabular features for downstream prediction tasks.

Component	Purpose
`DomainFinetuneDataset`	Per-user torch Dataset yielding `{input_ids, attention_mask, tabular_features, labels}`
`prepare_finetune_dataset()`	Convenience constructor with validation and logging
`finetune_domain_model()`	Fine-tunes JointFusionModel via HF Trainer — zero subclassing needed

Key Technical Decisions

HF Trainer Pattern A — zero custom code required. The critical discovery: HuggingFace Trainer inspects JointFusionModel.forward(self, input_ids, attention_mask, tabular_features, labels) via inspect.signature(). Because tabular_features is a named parameter in the forward signature, the Trainer auto-keeps it from the dataset and passes it to the model. No compute_loss override, no remove_unused_columns=False, no Trainer subclass. This was verified empirically on transformers 5.7.0 — the Trainer's _set_signature_columns_if_needed() method builds the allowed column list directly from the model's forward() parameters, and this works identically for plain nn.Module and PreTrainedModel.
Per-user padding, not packing. Unlike pre-training (which packs sequences for 100% token utilization), fine-tuning uses per-user padded sequences. The reason: each training sample needs its own label. In pre-training, the "label" is the next token — shared across the packed block. In fine-tuning, the label is a user-level outcome (e.g., "will this user activate a product?") — each user is a separate sample with its own label. Padding tokens are masked in the attention via attention_mask, so they don't affect the user embedding extracted by get_user_embedding().
Dataset returns tensors directly, no custom collator. The DomainFinetuneDataset.__getitem__() returns pre-tokenized, pre-padded torch tensors. The default PyTorch DataLoader collation (stack tensors into batches) is sufficient. No DataCollatorForLanguageModeling needed — that's pre-training only. This simplifies the pipeline and avoids double-padding issues.
save_strategy is configurable (not hardcoded). During testing, we discovered that saving JointFusionModel checkpoints via safetensors fails because the wrapped DomainTransformerForCausalLM has tied weights (lm_head ↔ embed_tokens), and safetensors rejects shared tensor storage by default. The fix: save_strategy is exposed as a parameter so users can set "no" during experimentation or use custom saving logic for production. This is a known HF issue with wrapper models containing tied-weight sub-models.
Binary and multiclass via n_classes parameter. The same JointFusionModel and finetune_domain_model() handle both binary classification (n_classes=1, BCE loss) and multiclass (n_classes>1, CE loss). The loss function switches automatically based on n_classes. Labels are float for binary and long for multiclass — the dataset returns float32 by default, and the caller casts to long for multiclass.

Smoke Test Results

5-step fine-tuning on CPU with a tiny model confirmed the full pipeline:

Step 1: loss=0.750  grad_norm=7.158  lr=1.000e-03
Step 3: loss=0.996  grad_norm=3.771  lr=6.545e-04
Step 5: loss=0.818  grad_norm=2.681  lr=9.549e-05
Train loss: 0.752 (5 steps, 20 samples, batch=4)

Both the Transformer branch and PLR+DCNv2 tabular branch received gradients — end-to-end joint training is functional.

Test Results

15 tests passing covering: dataset creation (length, keys, shapes, padding correctness, attention mask alignment, dtypes, length mismatch error, stats), DataLoader batching, forward pass on real dataset batches, backward gradient flow through both branches, multiclass classification, HF Trainer smoke test (5 steps), and the prepare_finetune_dataset convenience function.

Cumulative Test Summary

Phase	Tests	Coverage
2A: Tokenizers	72	Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding
2B: Models	33	Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizer→model integration
2C: Pre-training	19	Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling
2D: Fine-tuning	15	Dataset creation/validation, batching, forward/backward through JointFusion, 5-step Trainer smoke test, multiclass, convenience function
Total	139	All passing

Library API Summary (v0.4.0)

from domain_tokenizer import (
    # Schemas
    DomainSchema, FieldSpec, FieldType,
    # Tokenizers
    DomainTokenizerBuilder,
    # Models
    DomainTransformerConfig, DomainTransformerForCausalLM,
    PeriodicLinearReLU, JointFusionModel, DCNv2,
    # Pre-training
    prepare_clm_dataset, pretrain_domain_model,
    # Fine-tuning
    DomainFinetuneDataset, prepare_finetune_dataset, finetune_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA

End-to-End Usage: Pre-training → Fine-tuning

# 1. Build tokenizer from schema
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create and pre-train model
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",
    num_epochs=10, learning_rate=3e-4, bf16=True,
)

# 4. Create joint fusion model for fine-tuning
fusion = JointFusionModel(
    transformer_model=model,        # pre-trained, unfrozen
    n_tabular_features=291,         # hand-crafted tabular features
    n_classes=1,                    # binary: will user activate product?
)

# 5. Prepare fine-tuning data
ft_dataset = prepare_finetune_dataset(
    user_sequences, tabular_features, labels,
    builder, hf_tokenizer, max_length=512,
)

# 6. Fine-tune
finetune_domain_model(
    fusion, ft_dataset,
    num_epochs=5, learning_rate=1e-4, bf16=True,
)