| --- |
| license: mit |
| datasets: |
| - svjack/pokemon-blip-captions-en-zh |
| pipeline_tag: unconditional-image-generation |
| tags: |
| - diffusion |
| - tiny |
| - pokemon |
| - U-Net |
| - from_scratch |
| - 9m |
| - pokepixels |
| - pixels |
| - diff |
| - diffusers |
| --- |
| |
| # PokéPixels1-9M (CPU) |
|
|
| A minimal diffusion model trained **from scratch on CPU**. |
|
|
| This project explores the lower limits of diffusion models: |
| **How small and simple can a diffusion model be while still producing recognizable images?** |
|
|
| --- |
|
|
| Here are some "Fakemons" generated by the model: (64x64 Resolution) |
|
|
|  |
|
|
|  |
|
|
| ## Overview |
|
|
| TinyPokemonDiffusion is a lightweight DDPM-based generative model trained on Pokémon images. |
|
|
| Despite its small size and CPU-only training, the model learns: |
| - Color distributions |
| - Basic shapes |
| - Early-stage object structure |
|
|
| --- |
|
|
| ## Specifications |
|
|
| | Component | Value | |
| |------------------|------| |
| | Parameters | ~9M | |
| | Resolution | 64x64 | |
| | Training Device | CPU (Ryzen 5 5600G) | |
| | Training Time | ~5.5 hours | |
| | Dataset | pokemon-blip-captions | |
| | Architecture | Custom UNet | |
| | Precision | float32 | |
|
|
| --- |
|
|
| ## Features |
|
|
| - Full DDPM implementation from scratch |
| - Custom UNet with attention blocks |
| - CPU-optimized training |
| - Deterministic sampling (seed support) |
| - Config-driven architecture |
|
|
| --- |
|
|
| ## Results |
|
|
| The model generates: |
|
|
| - Coherent color palettes |
| - Recognizable Pokémon-like silhouettes |
| - Early-stage structure formation |
|
|
| Limitations: |
| - Blurry outputs |
| - Weak spatial consistency |
| - No semantic understanding |
|
|
| --- |
|
|
|
|
| ## Usage |
|
|
| ### Generate images |
|
|
|
|
|
|
| ```python |
| |
| import torch |
| from pathlib import Path |
| from PIL import Image |
| |
| # ===== CONFIG ===== |
| CHECKPOINT = "model.pt" |
| N_IMAGES = 8 |
| STEPS = 50 |
| SEED = 42 |
| OUT = "generated.png" |
| |
| # ===== IMPORT MODEL ===== |
| from train import StudentUNet, DDPMScheduler, Config |
| |
| # ===== LOAD ===== |
| torch.manual_seed(SEED) |
| |
| ckpt = torch.load(CHECKPOINT, map_location="cpu") |
| cfg = ckpt.get("config", Config()) |
| |
| model = StudentUNet(cfg) |
| model.load_state_dict(ckpt["model_state"]) |
| model.eval() |
| |
| scheduler = DDPMScheduler(cfg.timesteps, cfg.beta_start, cfg.beta_end) |
| |
| # ===== SAMPLING ===== |
| @torch.no_grad() |
| def sample(model, scheduler, n, steps): |
| x = torch.randn(n, 3, cfg.image_size, cfg.image_size) |
| |
| step_size = scheduler.T // steps |
| timesteps = list(range(0, scheduler.T, step_size))[::-1] |
| |
| for t_val in timesteps: |
| t = torch.full((n,), t_val, dtype=torch.long) |
| |
| noise_pred = model(x, t) |
| |
| if t_val > 0: |
| ab = scheduler.alpha_bar[t_val] |
| prev_t = max(t_val - step_size, 0) |
| ab_prev = scheduler.alpha_bar[prev_t] |
| |
| beta_t = 1.0 - (ab / ab_prev) |
| alpha_t = 1.0 - beta_t |
| |
| mean = (1.0 / alpha_t.sqrt()) * ( |
| x - (beta_t / (1.0 - ab).sqrt()) * noise_pred |
| ) |
| |
| x = mean + beta_t.sqrt() * torch.randn_like(x) |
| else: |
| x = scheduler.predict_x0(x, noise_pred, t) |
| |
| return x.clamp(-1, 1) |
| |
| samples = sample(model, scheduler, N_IMAGES, STEPS) |
| |
| # ===== SAVE ===== |
| samples = (samples + 1) / 2 |
| samples = (samples * 255).byte().permute(0, 2, 3, 1).numpy() |
| |
| grid = Image.new("RGB", (cfg.image_size * N_IMAGES, cfg.image_size)) |
| |
| for i, img in enumerate(samples): |
| grid.paste(Image.fromarray(img), (i * cfg.image_size, 0)) |
| |
| grid.save(OUT) |
| |
| print(f"✅ Saved to {OUT}") |
| |
| |
| ``` |
|
|
| ```bash |
| python generate.py \ |
| --checkpoint model.pt \ |
| --n_images 8 \ |
| --steps 50 \ |
| --seed 42 |
| ``` |
|
|
| Output |
|
|
| Generated images are saved as a horizontal grid: |
|
|
| outputs/generated.png |
|
|
| >> Limitations |
|
|
| Unconditional model (no prompts) |
|
|
| Limited dataset diversity |
| Early training stage |
| No DDIM (yet) |
|
|
| >> Research Direction |
|
|
| This project demonstrates that: |
|
|
| Diffusion models can learn meaningful visual structure even at extremely small scales. |
|
|
| Future work: |
|
|
| Conditional generation (class-based) |
| Text-to-image (v2.0) |
| DDIM sampling |
| Larger model variants |
| Motivation |
|
|
| Most diffusion research focuses on scaling up. |
|
|
| This project explores the opposite direction: |
|
|
| What is the minimum viable diffusion model? |
|
|
| License |
|
|
| MIT |
|
|
| Acknowledgments |
|
|
| Hugging Face datasets |
| PyTorch |
| The open-source AI community |
|
|
| If you like this project: |
|
|
| Give it a star and follow the evolution to v2.0(conditional) |
|
|
| ## Other |
| THE INITIAL IDEA WAS A STUDENT U-NET FROM A TEACHER U-NET, BUT THIS WAS DISCONTINUED BECAUSE THE TEACHER WAS INITIALIZATED WITH RANDOM WEIGHTS, THAT WOULD KILL THE STUDENT LEARNING |
|
|