1.1: The Autoregressive Loop and the Redundancy Problem - LLM Inference

Community Article Published January 26, 2026

What Does "Autoregressive" Mean?

Tracing Through a Concrete Example
Generation Step 1: Generating "on"

Generation Step 2: Generating "the"

Generation Step 3: Generating "mat"

Generation Step 4: Generating "." (end)

Visualizing the Redundancy

Why Are K and V Reusable? (The Crucial Insight)
The Causal Masking Guarantee

Quantifying the Waste
Naive Approach: Total K,V Computations

Optimal Approach: Minimum Necessary Computations

The Waste Factor

The Scaling Crisis
The Quadratic Growth Problem

Per-Layer, Per-Head Multiplication

This Is Why KV Cache Exists

Summary: The Problem We Need to Solve

Check Your Understanding

What Does "Autoregressive" Mean?

Large language models generate text one token at a time. The term "autoregressive" means the model uses its own previous outputs as inputs for generating the next output. This creates a sequential dependency chain: to generate token 5, you need token 4; to generate token 4, you need token 3; and so on.

Here's the generation loop in its simplest form:

input_sequence = [user's prompt tokens]

while not done:
    next_token = model(input_sequence)    # Run full forward pass
    input_sequence.append(next_token)     # Add generated token to sequence

This seems straightforward, but there's a devastating inefficiency hiding in this loop. To see it, we need to trace through exactly what happens inside model(input_sequence) at each step.

Tracing Through a Concrete Example

Let's walk through generating text with a specific example. Suppose the user's prompt is "The cat sat" (3 tokens), and we want to generate " on the mat" (4 tokens).

Generation Step 1: Generating "on"

Input to model: \\\["The", "cat", "sat"\\\] (3 tokens)

Inside each attention layer, here's what happens:

Project to Q, K, V: Each token's embedding gets projected into Query, Key, and Value vectors.

Token "The" (position 0) → Q₀, K₀, V₀
Token "cat" (position 1) → Q₁, K₁, V₁
Token "sat" (position 2) → Q₂, K₂, V₂

Compute attention: Each position attends to itself and all previous positions (causal masking).
Get output: The model produces logits, we sample, and get token "on".

Computation performed: We computed K and V for positions 0, 1, 2.

Generation Step 2: Generating "the"

Input to model: \\\["The", "cat", "sat", "on"\\\] (4 tokens)

Inside each attention layer:

Project to Q, K, V:

Token "The" (position 0) → Q₀, K₀, V₀    ← RECOMPUTED
Token "cat" (position 1) → Q₁, K₁, V₁    ← RECOMPUTED
Token "sat" (position 2) → Q₂, K₂, V₂    ← RECOMPUTED
Token "on"  (position 3) → Q₃, K₃, V₃    ← NEW

Compute attention and get output: We sample token "the".

Computation performed: We computed K and V for positions 0, 1, 2, 3. But wait—K₀, V₀, K₁, V₁, K₂, V₂ are exactly the same values we computed in Step 1. We just threw away that work and did it again!

Generation Step 3: Generating "mat"

Input to model: \\\["The", "cat", "sat", "on", "the"\\\] (5 tokens)

Token "The" (position 0) → Q₀, K₀, V₀    ← RECOMPUTED (3rd time!)
Token "cat" (position 1) → Q₁, K₁, V₁    ← RECOMPUTED (3rd time!)
Token "sat" (position 2) → Q₂, K₂, V₂    ← RECOMPUTED (3rd time!)
Token "on"  (position 3) → Q₃, K₃, V₃    ← RECOMPUTED (2nd time!)
Token "the" (position 4) → Q₄, K₄, V₄    ← NEW

The pattern is becoming painfully clear.

Generation Step 4: Generating "." (end)

Input to model: \\\["The", "cat", "sat", "on", "the", "mat"\\\] (6 tokens)

Token "The" (position 0) → K₀, V₀    ← RECOMPUTED (4th time!)
Token "cat" (position 1) → K₁, V₁    ← RECOMPUTED (4th time!)
Token "sat" (position 2) → K₂, V₂    ← RECOMPUTED (4th time!)
Token "on"  (position 3) → K₃, V₃    ← RECOMPUTED (3rd time!)
Token "the" (position 4) → K₄, V₄    ← RECOMPUTED (2nd time!)
Token "mat" (position 5) → K₅, V₅    ← NEW

Visualizing the Redundancy

Here's a visual representation of K and V computations across all generation steps. Each cell represents computing K and V for one token at one generation step. Red cells are redundant recomputation.

                Generation Steps
                Step 1   Step 2   Step 3   Step 4
               (gen on) (gen the)(gen mat)(gen .)
              ┌────────┬────────┬────────┬────────┐
"The" (pos 0) │ COMPUTE│  REDO  │  REDO  │  REDO  │
              ├────────┼────────┼────────┼────────┤
"cat" (pos 1) │ COMPUTE│  REDO  │  REDO  │  REDO  │
              ├────────┼────────┼────────┼────────┤
"sat" (pos 2) │ COMPUTE│  REDO  │  REDO  │  REDO  │
              ├────────┼────────┼────────┼────────┤
"on"  (pos 3) │        │ COMPUTE│  REDO  │  REDO  │
              ├────────┼────────┼────────┼────────┤
"the" (pos 4) │        │        │ COMPUTE│  REDO  │
              ├────────┼────────┼────────┼────────┤
"mat" (pos 5) │        │        │        │ COMPUTE│
              └────────┴────────┴────────┴────────┘

Legend:  COMPUTE = necessary work
         REDO    = redundant recomputation (wasted!)

Count the cells:

Necessary computations: 6 (the diagonal of first-time computations)
Redundant computations: 3 + 3 + 3 + 2 + 1 = 12

We did 12 redundant computations to avoid storing just 6 results. And this is a tiny example!

Why Are K and V Reusable? (The Crucial Insight)

You might wonder: "How do we know K₀ and V₀ don't change when we add more tokens?"

This is guaranteed by the causal structure of decoder-only transformers. Let me explain:

The Causal Masking Guarantee

In causal (decoder-only) attention, position i can only attend to positions 0 through i. This is enforced by the causal mask that sets attention weights to zero for future positions.

This creates a key property: the hidden state at position i depends only on tokens 0 through i.

Let's trace through why:

Layer 1 input: The embedding for token at position i is determined solely by that token and its position.
Layer 1 attention output: Position i's output depends only on attending to positions 0 through i (causal mask). Future tokens don't exist yet from position i's perspective.
Layer 1 → Layer 2: The hidden state passed to layer 2 at position i depends only on tokens 0 through i.
This propagates through all layers: At every layer, position i's hidden state is determined solely by tokens 0 through i.

Therefore: When we compute K_i = hidden_state_i @ W_K and V_i = hidden_state_i @ W_V, these values depend only on tokens 0 through i. Adding token i+1 or i+2 or i+100 to the sequence cannot change K_i or V_i.

This is why the redundancy is so wasteful—we're recomputing values that are mathematically guaranteed to be identical!

Quantifying the Waste

Let's put numbers to this problem.

Scenario: You have a prompt of p tokens and want to generate g tokens.

Naive Approach: Total K,V Computations

At each generation step t (for t = 1 to g), you compute K and V for all (p + t - 1) tokens:

Step 1: p tokens
Step 2: p + 1 tokens
Step 3: p + 2 tokens
...
Step g: p + g - 1 tokens

Total K,V computations:

p + (p+1) + (p+2) + ... + (p+g-1)
= g×p + (0 + 1 + 2 + ... + (g-1))
= g×p + g(g-1)/2
= g×p + g²/2 - g/2

For large g, this is approximately O(g×p + g²).

Optimal Approach: Minimum Necessary Computations

If we could reuse previous computations:

Compute K,V for p prompt tokens once
Compute K,V for g generated tokens (each computed once when generated)

Total K,V computations: p + g = O(p + g)

The Waste Factor

Waste ratio = (Naive) / (Optimal) = (g×p + g²/2) / (p + g)

Let's plug in realistic numbers:

Prompt (p)	Generated (g)	Naive Computations	Optimal	Waste Factor
100	50	6,225	150	41.5×
500	200	119,900	700	171×
1000	500	624,750	1500	416×
2000	1000	2,499,500	3000	833×

For a typical chat interaction (500 prompt tokens, 200 generated tokens), the naive approach does 171× more work than necessary for K and V projections alone!

The Scaling Crisis

This isn't just about raw computation—it's about what happens as models and sequences get larger.

The Quadratic Growth Problem

Notice that the naive computation grows as O(g²) in the number of generated tokens. This means:

Generating 2× more tokens costs 4× more compute
Generating 10× more tokens costs 100× more compute

For long-form generation (stories, reports, code), this becomes catastrophic.

Per-Layer, Per-Head Multiplication

Remember that modern LLMs have:

Many layers: GPT-3 has 96 layers, LLaMA-70B has 80 layers
Many attention heads: Often 32-128 heads per layer

The redundancy happens at every layer and every head. For a 96-layer model with 96 heads, multiply all those waste factors by 9,216.

This Is Why KV Cache Exists

The solution—which we'll cover in Artifact 2—is conceptually simple: store the K and V values after computing them once, and reuse them in subsequent generation steps.

This storage is called the KV cache. It transforms:

O(g×p + g²) → O(p + g) for K,V computation
Trades compute for memory

But this trade-off creates its own challenges, which fundamentally shapes how LLM inference works. Understanding the KV cache is the key to understanding everything about inference optimization.

Summary: The Problem We Need to Solve

Autoregressive generation means generating one token at a time, feeding outputs back as inputs.
The naive approach recomputes K and V for all tokens at every generation step.
This is wasteful because K and V for previous positions are mathematically guaranteed not to change (due to causal masking).
The waste is massive: For typical requests, we do 100-800× more K,V computation than necessary.
The waste grows quadratically in the number of generated tokens, making long-form generation especially expensive.

Next up Artifact 2: We'll see exactly how the KV cache solves this problem, what gets stored, and what new challenges this creates.

Check Your Understanding

Before moving on, make sure you can answer:

Why does adding token 5 to a sequence not change the K and V values for tokens 0-4?
- In a decoder-only transformer, causal masking prevents position i from attending to any future positions (i+1, i+2, ...).
- So the hidden states for tokens 0–4 depend only on tokens up to their own positions, and future tokens cannot influence them.
- Since Kᵢ = hᵢ·W_K and Vᵢ = hᵢ·W_V, and hᵢ does not change, Kᵢ and Vᵢ do not change either.
If you generate 100 tokens after a 100-token prompt, roughly how many times does the naive approach compute K₀ and V₀?
- About 100 times. In the naive loop, each of the 100 generation steps reruns the forward pass over the entire sequence, so token 0’s K and V get recomputed once per step.
Why does the waste factor grow as sequences get longer?
- With naive decoding, each step recomputes K,V for all previous tokens, so total work grows like:
  - Naive: g×p + g(g−1)/2 (includes a quadratic term in g)
  - Optimal: p + g (linear)
- As g increases, the quadratic term dominates, so the ratio (waste factor) increases rapidly with longer generations.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote