SCOPE

Scale-Consistent One-Pass Estimation of 3D Geometry

1The University of Hong Kong 2Alibaba Group 3Horizon Robotics 4Ant Group
SIGGRAPH Conference Papers 2026
SCOPE generates temporally consistent and scale-invariant 3D geometry from monocular videos with superior accuracy across extended sequences.



Abstract

We present SCOPE (Scale-Consistent One-Pass Estimation of 3D Geometry), a novel approach for estimating 3D geometry from extended monocular video sequences, where existing methods struggle to maintain both geometric accuracy and temporal consistency across hundreds of frames. Our approach generates affine-invariant 3D point maps with shared parameters across entire sequences, enabling consistent scale-invariant representations. We introduce three key innovations: viewpoint-invariant geometry aligning multi-perspective points in a unified reference frame; appearance-invariant learning enforcing consistency across exponential timescales; and frequency-modulated positioning enabling extrapolation to sequences vastly exceeding training length. Experiments across diverse datasets demonstrate significant improvements, reducing relative point map error by 24.2% and temporal alignment error by 34.9% on ScanNet compared to state-of-the-art methods. Our approach handles challenging scenarios with complex camera trajectories and lighting variations while efficiently processing extended sequences in a single pass.

Framework

Overview of SCOPE. Top-Left: SCOPE consists of a ViT backbone that processes video input frames, followed by a temporal decoder with cross-attention and dynamic NTK scaling RoPE, producing scale-invariant point maps. Top-Right: Cross-frame geometric consistency enforced across global and local geometric levels (G1, G2) to maintain structural coherence across frames. Bottom-Left: RoPE with dynamic NTK scaling applied to extend sequence context, using frequency scaling that adaptively weights dimensions based on scale factor, and train-time sequence stretching that creates a virtual extended sequence to sample positions. Bottom-Right: Hierarchical temporal consistency constraints applied multiple temporal strides (δ = 1, 2, 4, 8) to enforce smooth, consistent point map predictions across time.

Comparison with VGGT on Open-World Videos

Comparison with MoGe on Open-World Videos

Portrait Video Processing

4D Scene Reconstruction

Long-Range Temporal Inference

BibTeX

@inproceedings{zhang2026scope,
  title     = {SCOPE: Scale-Consistent One-Pass Estimation of 3D Geometry},
  author    = {Zhang, Zheng and Yang, Lihe and Yang, Tianyu and Yu, Chaohui and Lao, Yixing and Guo, Xiaoyang and Gong, Biao and Wang, Fan and Zhao, Hengshuang},
  booktitle = {SIGGRAPH Conference Papers},
  year      = {2026}
}