CityWalker (2000hr)

HuggingFace port of the CityWalker waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale urban walking and driving videos. This repo contains the converted weights of CityWalker_2000hr.ckpt (originally a PyTorch Lightning checkpoint) re-packaged as a transformers.PreTrainedModel so it can be loaded with AutoModel.from_pretrained.

Upstream training dataset: ai4ce/CityWalker. Our port (model wrapper + ckpt converter + benchmark integration) lives in ai4ce/wanderland-benchmark.

Architecture

images (B, 5, 3, H, W)  ─► center_crop(400) + resize(392) + ImageNet norm
                         ─► DINOv2 (vit-b/14)        ─► obs tokens (B, 5, 768)
coords (B, 6, 2)         ─► PolarEmbedding + Linear  ─► goal token  (B, 1, 768)
                                                      ─► concat ─► (B, 6, 768)
                                                      ─► TransformerEncoder (8 heads, 16 layers)
                                                      ─► MLP head
                                                      ─► waypoints (B, 5, 2)
                                                      ─► arrive_logits (B, 1)

context_size = 5 past RGB frames.
len_traj_pred = 5 future XY waypoints.
The 6 coord rows are the 5 past poses + 1 target pose, all expressed in the current-pose-relative frame and divided by the per-video step_scale (so the model consumes dimensionless units, not meters).

Usage

from transformers import AutoModel
from wanderland_lab.models.citywalker import CityWalkerModel  # registers AutoModel

model = AutoModel.from_pretrained("ai4ce/citywalker")
model.load_obs_encoder()   # fetches DINOv2 via torch.hub on first call
model.eval()

The DINOv2 backbone is not bundled with the weights to avoid redistributing Meta's pretrained checkpoint; load_obs_encoder() pulls it via torch.hub.

Inputs / Outputs

Name	Shape	Notes
`images`	`(B, 5, 3, H, W)` float32	RGB in `[0, 1]`; the model applies `center_crop(400) → resize(392) → ImageNet normalize` internally
`coords`	`(B, 6, 2)` float32	5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale`
`waypoints` out	`(B, 5, 2)` float32	Predicted XY waypoints in the current-pose-relative frame, in step_scale units — multiply by `step_scale` to recover meters
`arrive_logits`	`(B, 1)` float32	Pre-sigmoid logit for the "arrived at target" binary classifier

The model predicts 2D XY waypoints only. It does not output a heading or yaw. Downstream controllers that need (vx, vy, yaw_rate) derive yaw from the predicted waypoint direction (e.g. atan2(wp_y, wp_x)).

Policy wrapper

For robot-control use — per-episode position history, step_scale estimation from recent motion, lookahead along a reference path, and conversion of the waypoint to a body-frame velocity command — see CityWalkerPolicy in ai4ce/wanderland-benchmark.

Citation

@inproceedings{liu2025citywalker,
  title={Citywalker: Learning embodied urban navigation from web-scale videos},
  author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={6875--6885},
  year={2025}
}

License

Apache-2.0, matching the upstream ai4ce/CityWalker repository.

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Video Preview

Robotics