CityWalker (2000hr)
HuggingFace port of the CityWalker
waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale
urban walking and driving videos. This repo contains the converted weights
of CityWalker_2000hr.ckpt (originally a PyTorch Lightning checkpoint)
re-packaged as a transformers.PreTrainedModel so it can be loaded with
AutoModel.from_pretrained.
Upstream training dataset: ai4ce/CityWalker. Our port (model wrapper + ckpt converter + benchmark integration) lives in ai4ce/wanderland-benchmark.
Architecture
images (B, 5, 3, H, W) โโบ center_crop(400) + resize(392) + ImageNet norm
โโบ DINOv2 (vit-b/14) โโบ obs tokens (B, 5, 768)
coords (B, 6, 2) โโบ PolarEmbedding + Linear โโบ goal token (B, 1, 768)
โโบ concat โโบ (B, 6, 768)
โโบ TransformerEncoder (8 heads, 16 layers)
โโบ MLP head
โโบ waypoints (B, 5, 2)
โโบ arrive_logits (B, 1)
context_size= 5 past RGB frames.len_traj_pred= 5 future XY waypoints.- The 6 coord rows are the 5 past poses + 1 target pose, all expressed in the current-pose-relative frame and divided by the per-video step_scale (so the model consumes dimensionless units, not meters).
Usage
from transformers import AutoModel
from wanderland_lab.models.citywalker import CityWalkerModel # registers AutoModel
model = AutoModel.from_pretrained("ai4ce/citywalker")
model.load_obs_encoder() # fetches DINOv2 via torch.hub on first call
model.eval()
The DINOv2 backbone is not bundled with the weights to avoid redistributing
Meta's pretrained checkpoint; load_obs_encoder() pulls it via torch.hub.
Inputs / Outputs
| Name | Shape | Notes |
|---|---|---|
images |
(B, 5, 3, H, W) float32 |
RGB in [0, 1]; the model applies center_crop(400) โ resize(392) โ ImageNet normalize internally |
coords |
(B, 6, 2) float32 |
5 past poses + 1 target pose in the current-pose-relative frame, scaled by 1 / step_scale |
waypoints out |
(B, 5, 2) float32 |
Predicted XY waypoints in the current-pose-relative frame, in step_scale units โ multiply by step_scale to recover meters |
arrive_logits |
(B, 1) float32 |
Pre-sigmoid logit for the "arrived at target" binary classifier |
The model predicts 2D XY waypoints only. It does not output a heading or
yaw. Downstream controllers that need (vx, vy, yaw_rate) derive yaw from
the predicted waypoint direction (e.g. atan2(wp_y, wp_x)).
Policy wrapper
For robot-control use โ per-episode position history, step_scale estimation
from recent motion, lookahead along a reference path, and conversion of the
waypoint to a body-frame velocity command โ see CityWalkerPolicy in
ai4ce/wanderland-benchmark.
Citation
@inproceedings{liu2025citywalker,
title={Citywalker: Learning embodied urban navigation from web-scale videos},
author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={6875--6885},
year={2025}
}
License
Apache-2.0, matching the upstream ai4ce/CityWalker repository.
- Downloads last month
- -