File size: 4,719 Bytes
40b58e8 c91c97c 050ce07 c91c97c 050ce07 40b58e8 19267db 40b58e8 fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 19267db fbce1fc 40b58e8 19267db fbce1fc 40b58e8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ---
license: mit
datasets:
- svjack/pokemon-blip-captions-en-zh
pipeline_tag: unconditional-image-generation
tags:
- diffusion
- tiny
- pokemon
- U-Net
- from_scratch
- 9m
- pokepixels
- pixels
- diff
- diffusers
---
# PokéPixels1-9M (CPU)
A minimal diffusion model trained **from scratch on CPU**.
This project explores the lower limits of diffusion models:
**How small and simple can a diffusion model be while still producing recognizable images?**
---
Here are some "Fakemons" generated by the model: (64x64 Resolution)


## 🧠 Overview
TinyPokemonDiffusion is a lightweight DDPM-based generative model trained on Pokémon images.
Despite its small size and CPU-only training, the model learns:
- Color distributions
- Basic shapes
- Early-stage object structure
---
## ⚙️ Specifications
| Component | Value |
|------------------|------|
| Parameters | ~9M |
| Resolution | 64x64 |
| Training Device | CPU (Ryzen 5 5600G) |
| Training Time | ~5.5 hours |
| Dataset | pokemon-blip-captions |
| Architecture | Custom UNet |
| Precision | float32 |
---
## 🧪 Features
- Full DDPM implementation from scratch
- Custom UNet with attention blocks
- CPU-optimized training
- Deterministic sampling (seed support)
- Config-driven architecture
---
## 🖼️ Results
The model generates:
- Coherent color palettes
- Recognizable Pokémon-like silhouettes
- Early-stage structure formation
Limitations:
- Blurry outputs
- Weak spatial consistency
- No semantic understanding
---
## THE INITIAL IDEA WAS A STUDENT U-NET FROM A TEACHER U-NET, BUT THIS WAS DISCONTINUED BECAUSE THE TEACHER WAS INITIALIZATED WITH RANDOM WEIGHTS, THAT WOULD KILL THE STUDENT LEARNING
## 🚀 Usage
### Generate images
```python
import torch
from pathlib import Path
from PIL import Image
# ===== CONFIG =====
CHECKPOINT = "model.pt"
N_IMAGES = 8
STEPS = 50
SEED = 42
OUT = "generated.png"
# ===== IMPORT MODEL =====
from train import StudentUNet, DDPMScheduler, Config
# ===== LOAD =====
torch.manual_seed(SEED)
ckpt = torch.load(CHECKPOINT, map_location="cpu")
cfg = ckpt.get("config", Config())
model = StudentUNet(cfg)
model.load_state_dict(ckpt["model_state"])
model.eval()
scheduler = DDPMScheduler(cfg.timesteps, cfg.beta_start, cfg.beta_end)
# ===== SAMPLING =====
@torch.no_grad()
def sample(model, scheduler, n, steps):
x = torch.randn(n, 3, cfg.image_size, cfg.image_size)
step_size = scheduler.T // steps
timesteps = list(range(0, scheduler.T, step_size))[::-1]
for t_val in timesteps:
t = torch.full((n,), t_val, dtype=torch.long)
noise_pred = model(x, t)
if t_val > 0:
ab = scheduler.alpha_bar[t_val]
prev_t = max(t_val - step_size, 0)
ab_prev = scheduler.alpha_bar[prev_t]
beta_t = 1.0 - (ab / ab_prev)
alpha_t = 1.0 - beta_t
mean = (1.0 / alpha_t.sqrt()) * (
x - (beta_t / (1.0 - ab).sqrt()) * noise_pred
)
x = mean + beta_t.sqrt() * torch.randn_like(x)
else:
x = scheduler.predict_x0(x, noise_pred, t)
return x.clamp(-1, 1)
samples = sample(model, scheduler, N_IMAGES, STEPS)
# ===== SAVE =====
samples = (samples + 1) / 2
samples = (samples * 255).byte().permute(0, 2, 3, 1).numpy()
grid = Image.new("RGB", (cfg.image_size * N_IMAGES, cfg.image_size))
for i, img in enumerate(samples):
grid.paste(Image.fromarray(img), (i * cfg.image_size, 0))
grid.save(OUT)
print(f"✅ Saved to {OUT}")
```
```bash
python generate.py \
--checkpoint model.pt \
--n_images 8 \
--steps 50 \
--seed 42
```
📁 Output
Generated images are saved as a horizontal grid:
outputs/generated.png
>> ⚠️ Limitations
Unconditional model (no prompts)
Limited dataset diversity
Early training stage
No DDIM (yet)
>> 🔬 Research Direction
This project demonstrates that:
Diffusion models can learn meaningful visual structure even at extremely small scales.
Future work:
Conditional generation (class-based)
Text-to-image (v2.0)
DDIM sampling
Larger model variants
💡 Motivation
Most diffusion research focuses on scaling up.
This project explores the opposite direction:
What is the minimum viable diffusion model?
📜 License
MIT
🙌 Acknowledgments
Hugging Face datasets
PyTorch
The open-source AI community
⭐ If you like this project:
Give it a star and follow the evolution to v2.0(conditional) 🚀 |