InSpace: Structure-Aware 3D Indoor Scene Generation from a Single 360° Image

Model Name: InSpace

Venue: ECCV 2026

Paper: Coming soon

Repository: Coming soon

Project Page: Coming soon

Introduction

InSpace is a structure-aware framework that generates a complete, asset-aware 3D indoor scene, a full-room mesh together with the individual, separable furniture meshes and their PBR materials, from a single 360° equirectangular (ERP) panorama.

Existing single image-to-3D methods focus on asset-level generation and neglect the structural layout, which is essential for grounding assets in space. A single perspective image also lacks the field of view to recover a coherent global layout. InSpace addresses this by operating on a 360° ERP image and generating the scene in three cascaded flow-matching stages: (1) estimating Partial Scene Geometry (PSG) as a spatial prior, (2) generating coarse scene structure with view-selective cross-attention, and (3) producing detailed layout and asset geometry with textures through a global-local hybrid attention.

InSpace is built on the TRELLIS.2 O-Voxel representation. This repository hosts only the four InSpace-finetuned components; the base VAEs and decoders are pulled automatically from microsoft/TRELLIS.2-4B and microsoft/TRELLIS-image-large at run time.

Model Details

Developed by: Gwanhyeong Koo, Hyunsu Kim, Youngji Kim, Taejae Lee, Siwoo Lim, Sunjae Yoon, Suyong Yeon, Chang D. Yoo (KAIST, NAVER LABS, Chung-Ang University)
Model Type: Three-stage flow-matching framework on O-Voxel structured latents, with a CenterPoint-based 3D bounding-box estimator
Input: Single 360° equirectangular (ERP) panorama
Output: Complete 3D indoor scene (structural layout + separable, textured asset meshes with PBR materials)
Base Model: TRELLIS.2-4B (sparse-structure / shape / texture VAEs and decoders)

Key Features

Structure-aware scene generation: Recovers a coherent global layout from a single 360° image, not just isolated assets.
Asset-aware output: The scene is decomposed into a layout (floor and walls) and individual objects, each exported as its own mesh rather than a single fused blob.
View-selective cross-attention: The panorama is unwrapped into 6 cubemap faces (FOV 120°), and each voxel attends only to the faces visible from its 3D position.
Layout-Guided Structure Inversion (optional): A monocular-depth (Depth-Anything-2) point cloud, the Partial Scene Geometry, seeds coarse generation via SDEdit-style noise inversion for better room-scale fidelity.
PBR materials: Base color, roughness, metallic, and opacity, inherited from the TRELLIS.2 texture decoder.

Checkpoints

Folder	Component	Role	Size
`erp_ss_flow_img_dit_L_16l8_bf16_spatial/`	Coarse geometry	Coarse scene structure (sparse-structure flow, view-selective cross-attention)	~4.9 GB
`bbox_centerpoint/`	3D BBox	Per-asset oriented bounding-box estimator (CenterPoint)	~48 MB
`erp_slat_flow_img2shape_asset_aware_bf16/`	Asset shape	Asset-aware shape generation	~4.9 GB
`erp_slat_flow_imgshape2tex_asset_aware_bf16/`	Asset texture	Asset-aware texture generation (PBR)	~4.9 GB

Each folder holds the EMA weight under ckpts/. Model configs ship with the code repository (under configs/), so no config.json is needed here.

Requirements

System: Tested on Linux.
Hardware: An NVIDIA GPU with at least 24 GB of memory (verified on NVIDIA A100 and H100).
Software:
- The CUDA Toolkit (recommended 12.4).
- Conda for managing dependencies.
- Python 3.8 or higher.

Usage

Please refer to the official GitHub repository for installation. InSpace is run through the repository's scripts (demo/app_inspace.py, eval/pipeline/eval_pipeline.py), which load these checkpoints and chain the multi-stage pipeline together.

# 1. Get the code and set up the environment (same env as TRELLIS.2)
git clone <this-repo-url> --recursive && cd InSpace
. ./setup.sh --new-env --basic --flash-attn --nvdiffrast --nvdiffrec --cumesh --o-voxel --flexgemm

# 2. Download these checkpoints into checkpoints/
pip install -U "huggingface_hub[cli]"
hf download GwanHyeong/InSpace --local-dir checkpoints/

# 3a. Interactive demo (pick a scene, run the pipeline stage by stage)
python demo/app_inspace.py --port 7860

# 3b. Batch inference over the test set
python eval/pipeline/eval_pipeline.py \
    --data_dir datasets/ERP_3D_FRONT_test \
    --noise_mode sdedit --sdedit_alpha 0.5 --bbox_mode predicted --enable_texture

The inference code loads each checkpoint from checkpoints/<folder>/ckpts/*.pt; the matching model config is read from the code repository's configs/ directory.

Dataset

InSpace is trained on ERP-FRONT-30K, a paired ERP-Image-to-3D indoor scene dataset built on 3D-FRONT, with 26.5K training and 2.5K test ERP-image-mesh pairs (~30K total). Each room is paired with 360° ERP observations rendered from inside the scene and covers a wide range of room sizes.

hf download GwanHyeong/ERP-FRONT-30K --repo-type dataset --local-dir datasets/

Known Limitations

Domain of training data: InSpace is trained on ERP-FRONT (synthetic 3D-FRONT scenes). Results on real captured panoramas may vary; for real images the pipeline relies on monocular depth to build the Partial Scene Geometry.
Raw mesh artifacts: As with TRELLIS.2, generated raw meshes may occasionally contain small holes or minor topological discontinuities; mesh post-processing (hole-filling, remeshing) is provided.

Citation

InSpace has been accepted to ECCV 2026. The official citation will be added here soon.

License

Released under the MIT License. This work builds on TRELLIS.2 (MIT, Microsoft). Some dependencies (e.g. nvdiffrast, nvdiffrec) carry their own licenses.

Downloads last month: -

Inference Providers NEW

Image-to-3D

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support