arxiv:2605.31535

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Published on May 29

· Submitted by

Stefan Baumann on Jun 1

CompVis

Upvote

Authors:

Stefan Andreas Baumann ,

Abstract

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

View arXiv page View PDF Project page GitHub 18 Add to collection

Community

stefan-baumann

Paper author Paper submitter 1 day ago

Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute – following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.

librarian-bot

about 15 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31535

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31535 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31535 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.