What If LLMs Could Imagine Consequences?
A weekend rabbit hole that might be nothing—or might be everything
I've been nerd-sniped.
Last week I fell down a rabbit hole thinking about why LLMs are fundamentally limited at planning. They're incredible at generating text, but ask one to predict what happens three steps into a workflow? It starts hallucinating. It loses track. It confuses states.
And then it hit me: LLMs don't actually model the world. They model language about the world.
That's a huge difference.
The JEPA Brain Worm
If you haven't been following Yann LeCun's work on JEPA (Joint Embedding Predictive Architecture), here's the tldr: instead of predicting pixels or tokens, you predict embeddings. You learn what matters about a state, not its surface representation.
Video models that predict pixels waste compute on irrelevant details—the exact shade of a shadow, the texture of a wall. JEPA models predict what changes in an abstract space. They learn physics, not photography.
So I started wondering: what if we made decoder-only LLMs work like JEPA?
The Experiment
The idea is stupid simple:
Normal LLM:
Input: tokens → Output: next token probabilities
JEPA-style LLM:
Input: (state embedding + action embedding) → Output: next state embedding
That's it. You swap the vocabulary head for a state prediction head. You train with MSE loss instead of cross-entropy. The model learns state dynamics instead of text generation.
I hacked together three prototypes over the weekend:
- Sentence encoder approach — use off-the-shelf embeddings, train a tiny predictor
- LLM hidden states — use GPT-2's internal representations as the state space
- Full autoencoder — learn domain-specific state embeddings end-to-end
All three work. Like, actually work. On synthetic enterprise workflow data, they learn to predict "if user submits document for review, state changes from draft to pending." They chain predictions. They do multi-step rollouts.
Why This Might Matter for Enterprise AI
Here's where my CTO brain kicks in.
At Writer, we're building AI for enterprises. And enterprise workflows are stateful. Documents go through approval chains. Projects have phases. Customer tickets escalate. Everything is a state machine.
Current LLMs handle this by... generating text about state machines. They describe what might happen. But they don't model the transitions. They can't reliably simulate five steps ahead and tell you where you'll end up.
A JEPA-style approach could change that:
- Workflow prediction: Given current project state + proposed action → predict outcome
- Planning: Search through action sequences to reach desired states
- Anomaly detection: "This state transition has never been seen before"
- What-if analysis: "If we skip the legal review, what's the probability of reaching 'approved' state?"
The model learns the physics of your enterprise domain. Not text patterns. Physics.
Why This Might Be Nothing
I'm not gonna pretend I've solved AGI in a weekend.
A few obvious problems:
- State representation is hard — what even is the state of a complex enterprise workflow?
- Data collection — you need (state, action, next_state) triplets, not just text
- Scaling questions — does this approach even make sense at 70B parameters?
- Integration — how do you combine this with actual text generation?
Maybe the right answer is that diffusion models or test-time training or some other paradigm handles this better. Maybe this is a dead end.
But my gut says there's something here.
The Test-Time Scaling Angle
Here's what's been rattling around my head: the big unlock of 2024 was test-time compute. Spending more inference-time thinking = better answers.
What if the next unlock is test-time world modeling?
Instead of just "think longer," it's "imagine the consequences." Run forward simulations. Evaluate trajectories. Plan in latent space.
That's basically what humans do. We don't generate token-by-token action sequences. We imagine outcomes, evaluate them, iterate.
JEPA-style LLMs could be the architecture that enables this. Or could be a toy that breaks at scale. Only one way to find out.
Try It Yourself
I put the prototypes on Hugging Face. Three approaches, synthetic data, ~30 minutes to train each on a free GPU.
🔗 JEPA-Style LLM Prototypes on Hugging Face
If you're thinking about enterprise AI, world models, or just want to see a decoder-only transformer do something weird, give it a spin.
Maybe you'll find the same thing I did: that watching a model learn to predict state transitions instead of tokens feels different. Like it's actually reasoning about the world, not just talking about it.
Or maybe you'll find bugs in my code. That's also valuable.
This is very much a "thinking out loud" post, not a "we've productionized this" post. But sometimes the fun experiments become the important ones. And sometimes they're just fun.
Either way, I learned something. That's enough for a weekend.
—Waseem
Coauthors: Writer Agent & OpenCode
Links:
- Our experiment: huggingface.co/wassemgtk/jepa_llm_prototypes
- JEPA paper: arxiv.org/abs/2301.08243
- V-JEPA 2: github.com/facebookresearch/vjepa2
- COCONUT (latent reasoning): github.com/facebookresearch/coconut