SCOPE

Scale-Consistent One-Pass Estimation of 3D Geometry

Zheng Zhang¹, Lihe Yang¹, Tianyu Yang², Chaohui Yu², Yixing Lao¹, Xiaoyang Guo³, Biao Gong⁴, Fan Wang², Hengshuang Zhao¹

¹The University of Hong Kong ²Alibaba Group ³Horizon Robotics ⁴Ant Group

SIGGRAPH Conference Papers 2026

Paper arXiv Code

SCOPE generates temporally consistent and scale-invariant 3D geometry from monocular videos with superior accuracy across extended sequences.

Abstract

We present SCOPE (Scale-Consistent One-Pass Estimation of 3D Geometry), a novel approach for estimating 3D geometry from extended monocular video sequences, where existing methods struggle to maintain both geometric accuracy and temporal consistency across hundreds of frames. Our approach generates affine-invariant 3D point maps with shared parameters across entire sequences, enabling consistent scale-invariant representations. We introduce three key innovations: viewpoint-invariant geometry aligning multi-perspective points in a unified reference frame; appearance-invariant learning enforcing consistency across exponential timescales; and frequency-modulated positioning enabling extrapolation to sequences vastly exceeding training length. Experiments across diverse datasets demonstrate significant improvements, reducing relative point map error by 24.2% and temporal alignment error by 34.9% on ScanNet compared to state-of-the-art methods. Our approach handles challenging scenarios with complex camera trajectories and lighting variations while efficiently processing extended sequences in a single pass.

Framework

Overview of SCOPE. Top-Left: SCOPE consists of a ViT backbone that processes video input frames, followed by a temporal decoder with cross-attention and dynamic NTK scaling RoPE, producing scale-invariant point maps. Top-Right: Cross-frame geometric consistency enforced across global and local geometric levels (G₁, G₂) to maintain structural coherence across frames. Bottom-Left: RoPE with dynamic NTK scaling applied to extend sequence context, using frequency scaling that adaptively weights dimensions based on scale factor, and train-time sequence stretching that creates a virtual extended sequence to sample positions. Bottom-Right: Hierarchical temporal consistency constraints applied multiple temporal strides (δ = 1, 2, 4, 8) to enforce smooth, consistent point map predictions across time.