File size: 4,787 Bytes

---
language:
- en
license: other
license_name: compvis-research-license
license_link: LICENSE.md
pipeline_tag: image-to-image
tags:
- novel-view-synthesis
- nvs
- 3d
- self-supervised
- scaling-laws
---

# RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://compvis.github.io/rayder/)
[![Paper](https://img.shields.io/badge/arXiv-paper-b31b1b)](https://arxiv.org/abs/2605.31535)
[![Paper](https://img.shields.io/badge/Huggingface-Papers-yellow)](https://huggingface.co/papers/2605.31535)
[![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/CompVis/rayder)
[![Weights](https://img.shields.io/badge/HuggingFace-Weights-orange)](https://huggingface.co/CompVis/rayder)


RayDer is a self-supervised novel view synthesis model that unifies camera estimation and view synthesis in a single transformer. Unlike prior self-supervised NVS approaches, which are bottlenecked by scarce static-scene data, RayDer is trained on general, dynamic real-world video — and its performance scales predictably with data, model size, and compute, following power-law relationships (R² > 0.99) analogous to those observed in LLMs.

## Paper and Abstract

The RayDer model was presented in the paper [RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video](https://arxiv.org/abs/2605.31535).

Self-supervised novel view synthesis methods are fundamentally *data-limited*: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute.

![RayDer enables training NVS from abundant general video](https://compvis.github.io/rayder/static/images/paper-svg/teaser-concept.svg)

*Existing approaches rely on scarce data sources: supervised NVS requires posed multi-view images, while prior self-supervised methods require unposed videos of static scenes. RayDer instead trains from generic unposed videos that may contain dynamic objects, enabling learning from the dominant form of visual data and unlocking improved scaling with dataset size.*

![RayDer architecture overview](https://compvis.github.io/rayder/static/images/paper-svg/architecture-overview.svg)

*A single transformer unifies camera estimation and novel view synthesis, replacing the three separate networks used by prior self-supervised NVS pipelines.*

## Usage

To integrate RayDer into your own codebase, copy `rayder/model.py` from the [GitHub repository](https://github.com/CompVis/rayder) and instantiate the model as:

```python
import torch
from rayder.model import RayDer_L

model = RayDer_L()
model.load_state_dict(torch.load("rayder_l_576.pt", weights_only=True))
model.requires_grad_(False)
model.eval()
```

The `RayDer` class exposes two high-level inference methods:
- `predict_cameras(x)`: estimate camera parameters from a set of input views (trained for 8 views, but the models extrapolate quite well).
- `predict_views(x_in, cam_in, cam_target)`: synthesize novel views at target camera poses (trained for 1–7 input views, arbitrarily many output views).

Images are channels-last `(b, t, h, w, 3)` with pixel values in [-1, 1]. Camera extrinsics use the camera-to-world (c2w) convention, and the focal length `f` is normalized by the shorter image side (`f = f_pixels / min(h-1, w-1)`).

See the [GitHub repository](https://github.com/CompVis/rayder) for `generate_video.py` (smooth view-interpolation videos from a set of input images) and `app.py` (Gradio demo).

## Models

We currently release the following model variants:

| Variant | Width | Depth | Params | Resolution | File |
| :------ | ----: | ----: | -----: | ---------: | :--- |
| RayDer-L | 1024 | 24 | ~743M | 256² | `rayder_l.pt` |
| RayDer-L-576² | 1024 | 24 | ~743M | 576² | `rayder_l_576.pt` |

Additional model variants and licensing available upon request.

## License

This model is released under a license for personal and scientific non-commercial research purposes — see [LICENSE.md](LICENSE.md) for the full terms. For any commercial use or exploitation, please contact <license.compvis@ifi.lmu.de>.

## Citation

If you find our model or code useful, please cite our paper:

```bibtex
@misc{prestel2026rayderscalableselfsupervisednovel,
    title={RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video}, 
    author={Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer},
    year={2026},
}
```