rtferraz commited on
Commit
51149fa
·
verified ·
1 Parent(s): f930fef

Add Nubank nuFormer reverse-engineering analysis — full pipeline reconstruction

Browse files
Files changed (1) hide show
  1. docs/nubank_nuformer_analysis.md +610 -0
docs/nubank_nuformer_analysis.md ADDED
@@ -0,0 +1,610 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reverse-Engineering Nubank's nuFormer: A Transaction Foundation Model
2
+
3
+ > **How Nubank built a domain tokenizer for 100M+ customers and O(100 billion) transactions — and how to replicate this for finance, e-commerce, and other domains.**
4
+ >
5
+ > *Analysis based on: arXiv:2507.23267 ("Your Spending Needs Attention"), the Building Nubank blog series, and all referenced academic papers.*
6
+
7
+ ---
8
+
9
+ ## Table of Contents
10
+
11
+ 1. [Why This Matters for domainTokenizer](#1-why-this-matters-for-domaintokenizer)
12
+ 2. [The Nubank Blog Series: Complete Inventory](#2-the-nubank-blog-series-complete-inventory)
13
+ 3. [The nuFormer Architecture: Full Reconstruction](#3-the-nuformer-architecture-full-reconstruction)
14
+ - 3.1 [Step 1: The Domain Tokenizer — Transactions → Tokens](#31-step-1-the-domain-tokenizer--transactions--tokens)
15
+ - 3.2 [Step 2: The Transaction Transformer — Pre-training](#32-step-2-the-transaction-transformer--pre-training)
16
+ - 3.3 [Step 3: Joint Fusion — Combining Sequences + Tabular Features](#33-step-3-joint-fusion--combining-sequences--tabular-features)
17
+ 4. [The Four Academic Pillars](#4-the-four-academic-pillars)
18
+ - 4.1 [RecFormer: Items as Sentences, Not IDs](#41-recformer-items-as-sentences-not-ids)
19
+ - 4.2 [PLR Embeddings: Making Numbers First-Class Citizens](#42-plr-embeddings-making-numbers-first-class-citizens)
20
+ - 4.3 [DCN V2: Explicit Feature Crossing](#43-dcn-v2-explicit-feature-crossing)
21
+ - 4.4 [NoPE: No Positional Encoding Needed](#44-nope-no-positional-encoding-needed)
22
+ 5. [Results & Scaling Laws](#5-results--scaling-laws)
23
+ 6. [Connection to domainTokenizer Research](#6-connection-to-domaintokenizer-research)
24
+ 7. [The Playbook: How to Walk Nubank's Path](#7-the-playbook-how-to-walk-nubanks-path)
25
+ 8. [Complete Reference List](#8-complete-reference-list)
26
+
27
+ ---
28
+
29
+ ## 1. Why This Matters for domainTokenizer
30
+
31
+ Nubank didn't just build a model — they built **exactly what domainTokenizer envisions**: a domain-specific tokenizer that converts financial transactions into tokens, trains a small Transformer on those tokens, and uses it as a foundation model for downstream business tasks.
32
+
33
+ **The connection is direct:**
34
+
35
+ | domainTokenizer Concept | Nubank's Implementation |
36
+ |------------------------|------------------------|
37
+ | Domain tokens (not words) | Special tokens for amount, date, sign + BPE for descriptions |
38
+ | Small models that understand domain data | 24M and 330M parameter Transformers |
39
+ | Pre-training on domain sequences | Next-token prediction on transaction sequences |
40
+ | Fine-tuning for business tasks | Product recommendation (binary: will user activate?) |
41
+ | Beating traditional ML baselines | +1.25% relative AUC over LightGBM = 3× launch threshold |
42
+
43
+ Nubank **validated** the domainTokenizer thesis at production scale (100M+ users, 100B+ transactions) and published both the recipe and results. This is our blueprint.
44
+
45
+ ---
46
+
47
+ ## 2. The Nubank Blog Series: Complete Inventory
48
+
49
+ Nubank published a comprehensive blog series on Building Nubank documenting their foundation model journey:
50
+
51
+ | # | Title | Focus | URL |
52
+ |---|-------|-------|-----|
53
+ | 1 | **Unlocking financial insights: How Nubank powers personalized experiences with foundation models** | Overview & motivation | [building.nubank.com/unlocking-financial-insights...](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) |
54
+ | 2 | **Defining an interface between transaction data and foundation models** | The tokenizer design | [Braithwaite & Udagawa, 2025a] |
55
+ | 3 | **Fine-tuning transaction user models** | nuFormer fine-tuning recipe | [Braithwaite, Cavalcanti & Udagawa, 2025b] |
56
+ | 4 | **Understanding our customers' finances through foundation models** | Application layer & results | [Braithwaite & Udagawa, 2025c] |
57
+ | 5 | **Optimizing user narratives for foundation models** | Context window optimization | [Foust, 2025] |
58
+ | 6 | **Building foundation models into Nubank's AI platform** | MLOps & infrastructure | [Udagawa, 2025] |
59
+
60
+ **The arXiv paper** consolidating all technical details:
61
+ - **"Your spending needs attention: Modeling financial habits with transformers"** — [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) (Braithwaite et al., July 2025)
62
+
63
+ ---
64
+
65
+ ## 3. The nuFormer Architecture: Full Reconstruction
66
+
67
+ ### 3.1 Step 1: The Domain Tokenizer — Transactions → Tokens
68
+
69
+ This is the **core innovation** and the part most relevant to domainTokenizer. Nubank's tokenizer converts raw financial transactions into discrete token sequences.
70
+
71
+ #### Raw Transaction Data
72
+ Each transaction has three raw fields:
73
+ ```
74
+ {
75
+ "amount": 79.99, // float (positive or negative)
76
+ "date": "2025-03-15T14:23:00", // timestamp
77
+ "description": "AMAZON MARKETPLACE" // free text
78
+ }
79
+ ```
80
+
81
+ #### The Tokenization Decision
82
+
83
+ Nubank explicitly considered and **rejected** two extremes:
84
+
85
+ 1. ❌ **Pure text serialization** (JSON stringification → BPE): Too many tokens per transaction. A JSON string like `{"amount": 79.99, "date": "2025-03-15", "desc": "AMAZON MARKETPLACE"}` would consume ~30-50 BPE tokens per transaction, leaving only ~40-60 transactions in a 2048-token context window.
86
+
87
+ 2. ❌ **Pure numerical encoding** (all fields as embeddings, no text): Loses the rich information in transaction descriptions (merchant names, payment categories, etc.)
88
+
89
+ 3. ✅ **Hybrid: Special tokens for structured fields + BPE for text**: Best of both worlds.
90
+
91
+ #### The Special Token Vocabulary
92
+
93
+ Each structured field gets its own small, fixed vocabulary of **special tokens**:
94
+
95
+ | Field | Tokenizer Function | Vocabulary Size | Example |
96
+ |-------|-------------------|-----------------|---------|
97
+ | **Amount Sign** | `ϕ_sign : ℝ → V_sign` | **2 tokens** | `[CREDIT]` or `[DEBIT]` |
98
+ | **Amount Bucket** | `ϕ_amt : ℝ → V_amt` (quantized bins) | **21 tokens** | `[AMT_BIN_14]` (e.g., $50-$100 range) |
99
+ | **Month** | `ϕ_month : date → V_month` | **12 tokens** | `[MARCH]` |
100
+ | **Day of Week** | `ϕ_dow : date → V_dow` | **7 tokens** | `[WEDNESDAY]` |
101
+ | **Day of Month** | `ϕ_dom : date → V_dom` | **31 tokens** | `[DAY_15]` |
102
+ | **Hour** | `ϕ_hour : date → V_hour` | **24 tokens** | `[HOUR_14]` |
103
+
104
+ **Total special tokens:** 2 + 21 + 12 + 7 + 31 + 24 = **97 special tokens**
105
+
106
+ The text description field uses standard **BPE tokenization**, producing a variable number of subword tokens.
107
+
108
+ #### Combined Vocabulary
109
+
110
+ ```
111
+ V = V_special (97 tokens) ∪ V_BPE (standard BPE vocabulary)
112
+ ```
113
+
114
+ #### Token Sequence Layout Per Transaction
115
+
116
+ ```
117
+ Transaction t_i = [
118
+ AMT_SIGN_TOKEN, # 1 token: CREDIT or DEBIT
119
+ AMT_BUCKET_TOKEN, # 1 token: one of 21 quantized bins
120
+ MONTH_TOKEN, # 1 token: Jan–Dec
121
+ DOW_TOKEN, # 1 token: Mon–Sun
122
+ DOM_TOKEN, # 1 token: 1–31
123
+ HOUR_TOKEN, # 1 token: 0–23
124
+ desc_tok_1, # variable: BPE tokens for "AMAZON"
125
+ desc_tok_2, # "MARKET"
126
+ desc_tok_3, # "PLACE"
127
+ ...
128
+ ]
129
+ ```
130
+
131
+ **Average: ~14 tokens per transaction.**
132
+
133
+ This means a **2048-token context window holds approximately 146 transactions** — enough to capture several months of financial behavior for a typical consumer.
134
+
135
+ #### User Sequence Construction
136
+
137
+ For each user, transactions are ordered chronologically:
138
+ ```
139
+ user_sequence = [t_1, t_2, t_3, ..., t_N]
140
+ ```
141
+ Where N varies per user (truncated to fit context window, taking the most recent transactions).
142
+
143
+ #### Why This Design Wins
144
+
145
+ | Metric | Pure Text | Pure Embedding | Nubank Hybrid |
146
+ |--------|-----------|----------------|---------------|
147
+ | Tokens per transaction | ~35-50 | 1 (but fixed-dim) | **~14** |
148
+ | Transactions in 2048 context | ~40-60 | 2048 | **~146** |
149
+ | Captures description text | ✅ | ❌ | ✅ |
150
+ | Captures numerical structure | ❌ (fragmented) | ✅ | ✅ |
151
+ | Captures temporal patterns | ❌ | Partial | ✅ |
152
+ | Works with standard Transformer | ✅ | Needs custom arch | ✅ |
153
+
154
+ ### 3.2 Step 2: The Transaction Transformer — Pre-training
155
+
156
+ #### Architecture Choice: GPT-style Causal Decoder
157
+
158
+ Nubank chose a **decoder-only, GPT-style causal Transformer**, not BERT-style bidirectional. Reasons:
159
+
160
+ 1. **Industry precedent:** State-of-the-art sequential recommendation systems (Pinterest PinnerFormer, Meta NxtPost) use causal architectures
161
+ 2. **No autoregressive generation needed:** At inference, the model produces a single user embedding from the full sequence — no token-by-token generation required
162
+ 3. **Better for long-range dependencies:** Causal attention naturally models temporal ordering
163
+
164
+ #### No Positional Encoding (NoPE)
165
+
166
+ Based on Kazemnejad et al. (2023), nuFormer uses **no explicit positional encoding**. The finding: NoPE outperforms RoPE, ALiBi, and learned absolute position embeddings on length generalization. Since users have varying transaction history lengths, length generalization is critical.
167
+
168
+ #### Model Sizes
169
+
170
+ | Variant | Parameters | Hidden Dim | Layers | Heads | Context |
171
+ |---------|-----------|------------|--------|-------|---------|
172
+ | **nuFormer-Small** | **24M** | 256 | 24 | 16 | 2048 |
173
+ | **nuFormer-Large** | **330M** | 1024 | 24 | 16 | 2048 |
174
+
175
+ Both share the same depth (24 layers, 16 heads) — they differ only in embedding dimension.
176
+
177
+ #### Pre-training Objective
178
+
179
+ **Causal Language Modeling (CLM):** Standard next-token prediction on the flattened transaction token sequences.
180
+
181
+ Given a user's transaction sequence tokenized as `[w_1, w_2, ..., w_T]`, the loss is:
182
+
183
+ ```
184
+ L = -Σ_{t=1}^{T} log P(w_t | w_1, ..., w_{t-1})
185
+ ```
186
+
187
+ This is the same objective as GPT — but instead of predicting the next word in a sentence, the model predicts the next token in a transaction sequence. This could be the next amount bucket, the next merchant name token, or the next month token.
188
+
189
+ #### Pre-training Data
190
+
191
+ - **20M user rows** for baseline experiments
192
+ - Up to **203M labeled rows** for fine-tuning experiments
193
+ - Data spans credit card, debit card, open finance, wires, transfers, and bill items
194
+ - **O(100 billion) total transactions** across Nubank's 100M+ member base
195
+
196
+ ### 3.3 Step 3: Joint Fusion — Combining Sequences + Tabular Features
197
+
198
+ Nubank explored three fusion strategies for combining the transaction transformer with traditional tabular features:
199
+
200
+ #### Strategy A: Early Fusion (Extract → Downstream)
201
+ ```
202
+ Transaction Sequence → Pre-trained Transformer → User Embedding (frozen)
203
+
204
+ Feed into LightGBM with other features
205
+ ```
206
+ Fastest to iterate but loses end-to-end gradients.
207
+
208
+ #### Strategy B: Late Fusion (Concatenate → Joint Head)
209
+ ```
210
+ Transaction Sequence → Transformer → User Embedding ─┐
211
+ ├─→ MLP Head → Prediction
212
+ Tabular Features (291) → Simple Embedding ────────────┘
213
+ ```
214
+ Better than early fusion but the tabular branch is underparameterized.
215
+
216
+ #### Strategy C: Joint Fusion = nuFormer (Best)
217
+ ```
218
+ Transaction Sequence → Transformer → User Embedding ─────────────────┐
219
+ ├─→ Shared MLP → Prediction
220
+ Tabular Features (291) → PLR Embeddings → DCNv2 → Feature Embedding ─┘
221
+ ```
222
+
223
+ **This is the production architecture.** The key insight: the tabular branch needs its own powerful backbone (DCNv2) to match the expressiveness of the transformer branch. Joint end-to-end training allows both branches to co-adapt.
224
+
225
+ #### The Tabular Branch: DCNv2 + PLR
226
+
227
+ **291 hand-crafted features** (numerical + categorical), processed as follows:
228
+
229
+ 1. **Numerical features:** Transformed via PLR (Periodic Linear Representation):
230
+ ```
231
+ PLR(x) = ReLU(Linear([sin(2πw₁x + b₁), cos(2πw₁x + b₁), ..., sin(2πwₙx + bₙ), cos(2πwₙx + bₙ)]))
232
+ ```
233
+ Where frequencies `w` and phases `b` are **learned parameters**. This maps scalars to high-dimensional dense vectors that capture both magnitude and periodicity.
234
+
235
+ 2. **Categorical features:** Standard embedding lookup tables.
236
+
237
+ 3. **Feature interaction:** DCN V2 (Deep Cross Network V2) models explicit feature interactions:
238
+ ```
239
+ x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l
240
+ ```
241
+ Full-rank weight matrices enable capturing all pairwise and higher-order feature interactions.
242
+
243
+ 4. **Regularization:** L2 regularization on DCNv2 cross-layer weights to prevent overfitting.
244
+
245
+ ---
246
+
247
+ ## 4. The Four Academic Pillars
248
+
249
+ Nubank's architecture stands on four papers. Understanding them is essential for replication.
250
+
251
+ ### 4.1 RecFormer: Items as Sentences, Not IDs
252
+
253
+ **Paper:** "Text Is All You Need: Learning Language Representations for Sequential Recommendation"
254
+ **Authors:** Li et al. (UCSD + Amazon) | **KDD 2023** | [arXiv: 2305.13731](https://arxiv.org/abs/2305.13731) | [GitHub 130⭐](https://github.com/aaronheee/recformer)
255
+
256
+ **Core idea:** Abolish item IDs entirely. Represent each item as a key-value attribute dictionary flattened into text:
257
+ ```
258
+ Item: {Color: Black, Brand: Nike, Category: Shoes}
259
+ → Tokens: ["Color", "Black", "Brand", "Nike", "Category", "Shoes"]
260
+ ```
261
+
262
+ A user's interaction sequence becomes a sequence of these "item sentences."
263
+
264
+ **Four-embedding architecture:**
265
+ ```
266
+ E_token = LayerNorm(A_token + B_position + C_type + D_item_position)
267
+ ```
268
+ - A = token embedding (shared vocabulary)
269
+ - B = token position in full sequence
270
+ - C = token type (key vs. value vs. special)
271
+ - D = item position (which item in the user sequence)
272
+
273
+ **What Nubank took:** The key-value flattening philosophy, but modified it with special tokens for structured fields (amount, date) to reduce tokens per transaction from ~35 to ~14.
274
+
275
+ ### 4.2 PLR Embeddings: Making Numbers First-Class Citizens
276
+
277
+ **Paper:** "On Embeddings for Numerical Features in Tabular Deep Learning"
278
+ **Authors:** Gorishniy et al. (Yandex) | **NeurIPS 2022** | [arXiv: 2203.05556](https://arxiv.org/abs/2203.05556) | [GitHub](https://github.com/yandex-research/tabular-dl-num-embeddings)
279
+
280
+ **Core idea:** Raw scalar features fed into MLPs/Transformers are poorly optimized. **Lifting scalars into high-dimensional periodic embeddings** dramatically improves performance.
281
+
282
+ **PLR (Periodic → Linear → ReLU):**
283
+ ```python
284
+ def plr_embedding(x, frequencies, phases):
285
+ # x: scalar feature value
286
+ # frequencies, phases: LEARNED parameters
287
+ periodic = torch.cat([
288
+ torch.sin(2 * π * frequencies * x + phases),
289
+ torch.cos(2 * π * frequencies * x + phases)
290
+ ])
291
+ return relu(linear(periodic))
292
+ ```
293
+
294
+ **Key result:** With PLR embeddings, a plain MLP can match attention-based Transformers on tabular benchmarks. PLR is what lets DCNv2 beat LightGBM.
295
+
296
+ **What Nubank took:** PLR embeddings for all 291 numerical tabular features in the joint fusion branch. This was the critical ingredient:
297
+
298
+ | Model | Relative AUC vs. LightGBM |
299
+ |-------|--------------------------|
300
+ | DCNv2 (without PLR) | -0.09% |
301
+ | DCNv2 + PLR | **+0.06%** ← first to beat GBDT |
302
+ | DCNv2 + PLR + L2 | +0.08% |
303
+ | **nuFormer (full)** | **+0.31% to +0.52%** |
304
+
305
+ ### 4.3 DCN V2: Explicit Feature Crossing
306
+
307
+ **Paper:** "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems"
308
+ **Authors:** Wang et al. (Google) | **WebConf 2021** | [arXiv: 2008.13535](https://arxiv.org/abs/2008.13535) | **Production at Google**
309
+
310
+ **Core idea:** Explicitly model feature interactions (crosses) via specialized cross layers with full-rank weight matrices:
311
+ ```
312
+ x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l # element-wise product with input anchor
313
+ ```
314
+
315
+ This captures feature interactions of degree L+1 for an L-layer cross network. DCNv2 improves on DCN (2017) by using full-rank matrices instead of rank-1.
316
+
317
+ **What Nubank took:** DCNv2 as the backbone for the tabular feature branch (291 features). Combined with PLR embeddings, it forms the "tabular half" of the joint fusion nuFormer architecture.
318
+
319
+ ### 4.4 NoPE: No Positional Encoding Needed
320
+
321
+ **Paper:** "The Impact of Positional Encoding on Length Generalization in Transformers"
322
+ **Authors:** Kazemnejad et al. (McGill/Mila) | **NeurIPS 2023** | [arXiv: 2305.19466](https://arxiv.org/abs/2305.19466) | [HF Paper](https://huggingface.co/papers/2305.19466)
323
+
324
+ **Core finding:** Decoder-only Transformers with **no positional encoding** (NoPE) outperform those with RoPE, ALiBi, and absolute position embeddings on length generalization tasks.
325
+
326
+ **Why it works (theoretically):**
327
+ - **Theorem 1:** The first layer of a NoPE causal Transformer can recover absolute positions from causal attention patterns alone
328
+ - **Theorem 2:** Subsequent layers can implement relative PE via learned query-key interactions
329
+ - **Empirically:** NoPE's learned attention patterns converge to T5's relative PE — it gets relative PE "for free"
330
+
331
+ **What Nubank took:** No positional encoding in the transaction Transformer. Since users have vastly different transaction history lengths (some have 20 transactions, some have 2000+), length generalization is critical for production deployment.
332
+
333
+ ---
334
+
335
+ ## 5. Results & Scaling Laws
336
+
337
+ ### Production Results
338
+
339
+ | Model | Relative AUC vs. LightGBM |
340
+ |-------|--------------------------|
341
+ | MLP (raw features) | -0.44% |
342
+ | DCNv2 | -0.09% |
343
+ | MLP + PLR | -0.23% |
344
+ | LightGBM (baseline) | 0.00% |
345
+ | DCNv2 + PLR | +0.06% |
346
+ | DCNv2 + PLR + L2 | +0.08% |
347
+ | **nuFormer-Small (24M, Joint Fusion)** | **+0.31%** |
348
+ | **nuFormer-Large (330M, Joint Fusion)** | **+0.52%** |
349
+
350
+ **Final production deployment: +1.25% relative AUC improvement** — cited as **3× the typical model launch threshold** at Nubank. This is a massive result for a production recommendation system.
351
+
352
+ ### Scaling Laws
353
+
354
+ Nubank observed clear scaling laws across three dimensions:
355
+
356
+ **Model size scaling:**
357
+ | Model | Parameters | AUC Improvement |
358
+ |-------|-----------|-----------------|
359
+ | nuFormer-Small | 24M | +0.31% |
360
+ | nuFormer-Large | 330M | +0.52% |
361
+
362
+ **Context length scaling:**
363
+ | Context | Transactions Covered | Effect |
364
+ |---------|---------------------|--------|
365
+ | 512 tokens | ~36 transactions | Baseline |
366
+ | 1024 tokens | ~73 transactions | Better |
367
+ | 2048 tokens | ~146 transactions | **Best** (monotonic improvement) |
368
+
369
+ Larger models benefit more from longer context — the 330M model extracts more value from additional transaction history than the 24M model.
370
+
371
+ **Fine-tuning data scaling:**
372
+ | Training Rows | Effect |
373
+ |--------------|--------|
374
+ | 5M | Baseline |
375
+ | 20M | Better |
376
+ | 40M | Better still |
377
+ | 100M | Best |
378
+
379
+ Again, larger models show steeper improvement with more data.
380
+
381
+ ### Data Source Ablation (Critical Insight)
382
+
383
+ Nubank tested three anonymized data sources (A, B, C — likely credit card, debit, open finance):
384
+
385
+ | Sources | AUC vs. ABC Baseline |
386
+ |---------|---------------------|
387
+ | A alone | +0.72 |
388
+ | B alone | -8.21 |
389
+ | C alone | -20.52 |
390
+ | **AB** | **+0.91 (best!)** |
391
+ | BC | -12.24 |
392
+ | AC | -0.27 |
393
+ | ABC (all) | 0.00 (baseline) |
394
+
395
+ **Key insight:** More data sources can **hurt** performance. Source B and C are lower-information-density — when they crowd out high-signal transactions (source A) in the fixed 2048-token context window, overall performance drops. **AB outperforms ABC**, meaning the debit/open-finance data was actually diluting the credit card signal.
396
+
397
+ **Implication for domainTokenizer:** Context window is a **resource allocation problem**. You must carefully choose which data to include, not just maximize volume.
398
+
399
+ ---
400
+
401
+ ## 6. Connection to domainTokenizer Research
402
+
403
+ ### Direct Mapping to Our Framework
404
+
405
+ | Our Research Report Section | Nubank's Implementation |
406
+ |---------------------------|------------------------|
407
+ | §4.1 Semantic ID Tokenization | Not used — Nubank uses special tokens instead of RQ-VAE |
408
+ | §4.2 Action Sequence Tokenization (ActionPiece) | Partially analogous — the BPE-on-descriptions is similar, but no cross-field merging |
409
+ | §4.3 Financial Transaction Tokenization | **Exact match** — special tokens for amount/date + BPE for text |
410
+ | §4.4 Tabular Feature Tokenization (PLR) | **Exact match** — PLR embeddings for the 291 tabular features |
411
+ | §6.1 Quantization-Based (RQ-VAE) | Not used |
412
+ | §6.2 BPE-Inspired Merging | Only for text descriptions, not for structured fields |
413
+ | §6.3 Magnitude & Binning | **Exact match** — amount quantized to 21 bins |
414
+ | §6.5 Serialization-Based | Explicitly rejected as too token-hungry |
415
+
416
+ ### What Nubank Validates
417
+
418
+ 1. ✅ **Domain tokens work better than text tokens** — the special token vocabulary is the key innovation
419
+ 2. ✅ **Small models (24M-330M) are sufficient** — you don't need 7B+ parameter LLMs
420
+ 3. ✅ **Self-supervised pre-training transfers** — pre-trained transaction Transformer improves downstream tasks
421
+ 4. ✅ **Hybrid tokenization wins** — special tokens for structured data + BPE for text
422
+ 5. ✅ **GPT-style causal modeling works for event sequences** — not just BERT-style masking
423
+
424
+ ### What Nubank Didn't Do (Opportunities for domainTokenizer)
425
+
426
+ 1. ❌ **No Semantic IDs (RQ-VAE):** Nubank tokenizes merchant descriptions via BPE but doesn't create learned codebook-based product/merchant IDs. This could be a significant improvement — merchants that always appear together could share semantic ID prefixes.
427
+
428
+ 2. ❌ **No cross-field composite tokens (ActionPiece-style):** Each field is tokenized independently. A BPE-like merging of `{amount_bin + category + time_of_day}` into composite tokens could further compress the sequence and capture higher-order patterns.
429
+
430
+ 3. ❌ **No continual learning (HOPE-style):** nuFormer is frozen after pre-training. The Nested Learning / HOPE paradigm could enable continuous adaptation to new spending patterns, new merchants, and seasonal shifts.
431
+
432
+ 4. ❌ **No multi-resolution memory (CMS):** All tokens are treated equally in the attention window. A Continuum Memory System with different update frequencies could better handle the difference between recent transactions (high signal) and historical patterns (persistent knowledge).
433
+
434
+ ### Nubank's Recipe = Our Blueprint for Phase 2
435
+
436
+ Nubank's exact pipeline maps to domainTokenizer's planned implementation:
437
+
438
+ ```
439
+ domainTokenizer Phase 2 Implementation Plan
440
+ (directly following Nubank's validated recipe)
441
+
442
+ 1. Schema Analysis → Identify field types
443
+ [Nubank: amount(float), date(timestamp), description(text)]
444
+
445
+ 2. Per-Field Tokenizer Construction
446
+ [Nubank: ϕ_sign(2), ϕ_amt(21), ϕ_month(12), ϕ_dow(7), ϕ_dom(31), ϕ_hour(24), BPE(text)]
447
+ [Us: same pattern, extensible to any domain schema]
448
+
449
+ 3. Pre-train GPT-style Causal Transformer (NoPE)
450
+ [Nubank: 24M-330M params, 2048 context, CLM objective]
451
+ [Us: configurable sizes, same objective]
452
+
453
+ 4. Joint Fusion Fine-tuning
454
+ [Nubank: Transformer embeddings + DCNv2(PLR) on tabular features]
455
+ [Us: pluggable fusion with any tabular backbone]
456
+ ```
457
+
458
+ ---
459
+
460
+ ## 7. The Playbook: How to Walk Nubank's Path
461
+
462
+ ### For Finance (Replicating Nubank)
463
+
464
+ **Step 1: Define your transaction schema**
465
+ ```python
466
+ schema = {
467
+ "amount": {"type": "numerical", "tokenizer": "sign_bucket", "sign_vocab": 2, "bucket_vocab": 21},
468
+ "timestamp": {"type": "temporal", "tokenizer": "calendar",
469
+ "fields": ["month(12)", "dow(7)", "dom(31)", "hour(24)"]},
470
+ "description": {"type": "text", "tokenizer": "bpe"},
471
+ # Extensions beyond Nubank:
472
+ "merchant_category": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 50},
473
+ "channel": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 10},
474
+ }
475
+ ```
476
+
477
+ **Step 2: Build tokenizer (97 special tokens + BPE)**
478
+ ```python
479
+ class TransactionTokenizer:
480
+ def __init__(self, schema):
481
+ self.special_tokens = build_special_vocab(schema) # ~97-150 tokens
482
+ self.bpe_tokenizer = AutoTokenizer.from_pretrained("...") # for text fields
483
+
484
+ def tokenize_transaction(self, txn):
485
+ tokens = []
486
+ tokens.append(self.sign_token(txn.amount)) # 1 token
487
+ tokens.append(self.amount_bucket(txn.amount)) # 1 token
488
+ tokens.extend(self.calendar_tokens(txn.timestamp)) # 4 tokens
489
+ tokens.extend(self.bpe_tokenizer(txn.description)) # ~8 tokens avg
490
+ return tokens # ~14 tokens total
491
+ ```
492
+
493
+ **Step 3: Pre-train (24M params, CLM)**
494
+ ```python
495
+ model = GPTCausalLM(
496
+ vocab_size=len(special_tokens) + bpe_vocab_size,
497
+ d_model=256, n_layers=24, n_heads=16,
498
+ max_seq_len=2048,
499
+ positional_encoding=None, # NoPE!
500
+ )
501
+ # Pre-train on transaction sequences
502
+ train_clm(model, transaction_sequences, epochs=...)
503
+ ```
504
+
505
+ **Step 4: Joint Fusion Fine-tuning**
506
+ ```python
507
+ class NuFormer(nn.Module):
508
+ def __init__(self, txn_transformer, tabular_features):
509
+ self.txn_branch = txn_transformer # pre-trained, unfrozen
510
+ self.tab_branch = DCNv2(
511
+ input_dim=len(tabular_features),
512
+ num_embeddings=PLREmbed(n_frequencies=64),
513
+ cross_layers=3, deep_layers=3,
514
+ )
515
+ self.head = MLP(txn_dim + tab_dim, hidden, 1)
516
+
517
+ def forward(self, txn_tokens, tabular_features):
518
+ txn_embed = self.txn_branch(txn_tokens)[:, -1, :] # last token embedding
519
+ tab_embed = self.tab_branch(tabular_features)
520
+ combined = torch.cat([txn_embed, tab_embed], dim=-1)
521
+ return self.head(combined)
522
+ ```
523
+
524
+ ### For E-Commerce (Adapting Nubank's Recipe)
525
+
526
+ **The adaptation is straightforward — replace transaction fields with e-commerce event fields:**
527
+
528
+ | Finance (Nubank) | E-Commerce (Adaptation) |
529
+ |------------------|----------------------|
530
+ | amount (float) | price (float) — same ϕ_amt tokenizer |
531
+ | amount sign (credit/debit) | event_type (view/cart/purchase/return) — expand to 4+ tokens |
532
+ | timestamp (month/dow/dom/hour) | timestamp — same calendar tokens |
533
+ | description (merchant text) | product_title (BPE) — same approach |
534
+ | — | category (hierarchical) — add special tokens |
535
+ | — | brand — add special tokens or BPE |
536
+ | — | quantity — small fixed vocab (1-10+) |
537
+
538
+ **E-commerce special token vocabulary:**
539
+ ```python
540
+ e_commerce_special_tokens = {
541
+ "event_type": 5, # view, cart, purchase, return, wishlist
542
+ "price_bucket": 21, # same binning as Nubank
543
+ "quantity": 11, # 1-10, 10+
544
+ "category_l1": 30, # top-level categories
545
+ "category_l2": 200, # subcategories
546
+ "month": 12,
547
+ "dow": 7,
548
+ "dom": 31,
549
+ "hour": 24,
550
+ }
551
+ # Total: ~341 special tokens + BPE for product titles
552
+ # ~16 tokens per event → 2048 context ≈ 128 events
553
+ ```
554
+
555
+ **Pre-training objectives (same as Nubank):**
556
+ - Causal LM: predict next token in the event sequence
557
+ - Downstream: next purchase prediction, churn, product recommendation, customer segmentation
558
+
559
+ ### For Healthcare (Same Pattern)
560
+
561
+ ```python
562
+ healthcare_special_tokens = {
563
+ "event_type": 10, # diagnosis, procedure, lab, medication, visit, ...
564
+ "icd_category": 50, # top-level ICD-10 groups
565
+ "cpt_category": 40, # procedure categories
566
+ "cost_bucket": 21, # same binning
567
+ "provider_type": 15, # PCP, specialist, ER, ...
568
+ "month": 12, "dow": 7, "dom": 31,
569
+ }
570
+ # Description: BPE on clinical notes/medication names
571
+ ```
572
+
573
+ ---
574
+
575
+ ## 8. Complete Reference List
576
+
577
+ ### Nubank Sources
578
+
579
+ | Ref | Authors | Title | Link |
580
+ |-----|---------|-------|------|
581
+ | **Primary** | Braithwaite et al. | Your spending needs attention: Modeling financial habits with transformers | [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) |
582
+ | Blog 1 | — | Unlocking financial insights: How Nubank powers personalized experiences | [building.nubank.com](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) |
583
+ | Blog 2 | Braithwaite & Udagawa | Defining an interface between transaction data and foundation models | Building Nubank, 2025a |
584
+ | Blog 3 | Braithwaite, Cavalcanti & Udagawa | Fine-tuning transaction user models | Building Nubank, 2025b |
585
+ | Blog 4 | Braithwaite & Udagawa | Understanding our customers' finances through foundation models | Building Nubank, 2025c |
586
+ | Blog 5 | Foust | Optimizing user narratives for foundation models | Building Nubank, 2025 |
587
+ | Blog 6 | Udagawa | Building foundation models into Nubank's AI platform | Building Nubank, 2025 |
588
+
589
+ ### Academic References (Used by nuFormer)
590
+
591
+ | Paper | Authors | Year | ArXiv | Role in nuFormer |
592
+ |-------|---------|------|-------|-----------------|
593
+ | **RecFormer** | Li et al. | 2023 | [2305.13731](https://arxiv.org/abs/2305.13731) | Tokenization philosophy: items as key-value text |
594
+ | **PLR Embeddings** | Gorishniy et al. | 2022 | [2203.05556](https://arxiv.org/abs/2203.05556) | Numerical feature → periodic embeddings |
595
+ | **DCN V2** | Wang et al. | 2021 | [2008.13535](https://arxiv.org/abs/2008.13535) | Tabular feature cross-interaction backbone |
596
+ | **NoPE** | Kazemnejad et al. | 2023 | [2305.19466](https://arxiv.org/abs/2305.19466) | No positional encoding for length generalization |
597
+ | **FlashAttention** | Dao et al. | 2022 | [2205.14135](https://arxiv.org/abs/2205.14135) | Efficient attention computation |
598
+ | **Banking TF** | Delestre & Sola | 2024 | [2410.08243](https://arxiv.org/abs/2410.08243) | Parallel work: French bank transaction tokenizer |
599
+
600
+ ### Related Papers from domainTokenizer Research
601
+
602
+ | Paper | Year | ArXiv | Connection |
603
+ |-------|------|-------|-----------|
604
+ | **TIGER** | 2023 | [2305.05065](https://arxiv.org/abs/2305.05065) | Alternative: RQ-VAE Semantic IDs (Nubank didn't use) |
605
+ | **ActionPiece** | 2025 | [2502.13581](https://arxiv.org/abs/2502.13581) | Alternative: BPE-like merging of action features (Nubank didn't use) |
606
+ | **Nested Learning (HOPE)** | 2025 | [2512.24695](https://arxiv.org/abs/2512.24695) | Future: continual learning for domain models |
607
+
608
+ ---
609
+
610
+ *This analysis reconstructs Nubank's full pipeline from public sources. The actual production system may have additional proprietary components not disclosed in the blog series or arXiv paper.*