ECCV 2026 · Malmö, Sweden

ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning

Xuanhua He^* Jiaxin Xie^* Mingzhe Zheng Qifeng Chen^†

The Hong Kong University of Science and Technology, Hong Kong SAR, China

^* Equal contribution. ^† Corresponding author.

ICDepth overview: robust depth estimation across diverse scenarios

Robust depth estimation across diverse scenarios — fog, night, underwater, animation, and high-resolution video.

Abstract

Monocular video depth estimation requires temporal consistency, geometric accuracy, and generalization across diverse scenarios — yet existing methods struggle to achieve all three simultaneously. Discriminative models excel at per-frame accuracy but suffer from temporal drift due to limited context windows, while generative methods improve consistency and generalization at the cost of extensive training data (10M+ samples) and lack of geometric precision.

We introduce ICDepth, a framework that adapts pre-trained text-to-video diffusion transformers for video depth estimation via In-Context Conditioning (ICC), leveraging their rich spatial-temporal priors. To address key challenges in transferring ICC from generation to dense prediction, we propose:

SAND-Attention (Spatial-temporal Aligned, Noise-Decoupled Attention), which ensures precise spatial-temporal alignment via shared RoPE and enforces unidirectional attention to prevent noise contamination.
SRFM (Semantic- and Resolution-Aware Feature Modulation), which injects DINOv2 semantic and resolution priors to enhance geometric precision.

ICDepth achieves state-of-the-art results on multiple benchmarks with remarkable data efficiency, trained on only 0.8M frames (6–13× less than competing generative methods), while demonstrating strong zero-shot generalization to diverse domains.

Method

ICDepth builds upon the Wan 2.1 VDiT and treats RGB and depth as a unified token sequence processed by native attention.

SAND-Attention

Establishes precise spatial-temporal alignment between clean RGB conditions and noisy depth latents through Shared RoPE, and enforces Unidirectional Attention so noisy depth tokens can query clean RGB tokens while blocking the reverse flow.

SRFM

Injects powerful external priors into the generative model by leveraging DINOv2 features for semantic understanding and Resolution Embeddings to support multi-scale inference up to 1080p.

Qualitative Comparison

Comparison against DepthCrafter, Depth Any Video, and Video Depth Anything. Click tabs to change scene type; use arrows to browse multiple cases per scene.

Night
Foggy
Underwater
3D Games
Animation
FHD Video
Vertical

Case 1 / 2

Quantitative Results

Table 1. Zero-shot video depth estimation on Sintel, ScanNet, KITTI, and Bonn. Red bold = best, underline = second best. Metrics follow the official DepthCrafter evaluation protocol.

Method	Sintel (50 frames)		ScanNet (90 frames)		KITTI (110 frames)		Bonn (110 frames)		Venue	Data Size
Method	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	Venue	Data Size
Depth Anything V2	0.403	0.547	0.123	0.852	0.102	0.910	0.084	0.947	NeurIPS'24	62.62M
Video Depth Anything	0.383	0.629	0.075	0.954	0.078	0.950	0.053	0.975	CVPR'25	1.35M
ChronoDepth	0.587	0.486	0.159	0.783	0.167	0.759	0.100	0.911	CVPR'25	—
DepthCrafter	0.313	0.680	0.142	0.803	0.105	0.898	0.066	0.971	CVPR'25	10.5M
Depth Any Video	0.300	0.643	0.119	0.865	0.098	0.925	0.063	0.963	ICLR'25	6M
ICDepth (Ours)	0.250	0.749	0.076	0.952	0.061	0.968	0.053	0.979	—	0.8M

Ablation Study

Table 2. Component analysis on the Sintel dataset.

Method Variant	AbsRel ↓	RMSE ↓	δ₁ ↑
Full model (Ours)	0.250	5.155	0.749
In-Context Conditioning (ICC)
Replace with channel concat	0.367	5.956	0.654
SAND-Attention
Replace with full attention	0.413	6.078	0.443
w/o RoPE Alignment	0.410	5.999	0.450
w/o Decoupled Attention	0.262	5.196	0.710
Semantic-Resolution Aware Feature Modulation (SRFM)
w/o SRFM	0.306	6.325	0.696
w/o DINOv2 features	0.269	5.730	0.709
w/o Resolution Embedding	0.264	5.565	0.728

Visual comparison of ablation variants on KITTI.

KITTI Case 1

KITTI Case 2

BibTeX

@inproceedings{he2026icdepth,
  title     = {ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning},
  author    = {He, Xuanhua and Xie, Jiaxin and Zheng, Mingzhe and Chen, Qifeng},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}