Overview of DepthCues
We introduce a new benchmark named DepthCues consisting of six depth related tasks where each task is designed to test a given model's ability to estimate a visual cue that is ubiquitous to human depth perception.
Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.
We introduce a new benchmark named DepthCues consisting of six depth related tasks where each task is designed to test a given model's ability to estimate a visual cue that is ubiquitous to human depth perception.
DepthCues Benchmark Results. We evaluate 20 vision models with diverse pre-training settings (indicated by color) on the DepthCues benchmark, which assesses six different monocular depth cues (each row) ubiquitous to humans. The models are ranked based on their average performance on the six cues. We include an end-to-end trained baseline (blue dotted line) as an oracle and a trivial baseline (red dotted line) to mark floor performance. Additionally, depth estimation linear probing results on NYUv2 are shown on the bottom row.
A strong correlation is observed between depth cue understanding and depth estimation.
Failure cases of the top-five vision models on DepthCues. Each column shows two examples for a depth cue.
@article{danier2024depthcues,
author = {Danier, Duolikun and Aygün, Mehmet and Li, Changjian and Bilen, Hakan and Mac Aodha, Oisin},
title = {DepthCues: Evaluating Monocular Depth Perception in Large Vision Models},
journal = {arxiv:2411.17385},
year = {2024},
}
Funding was provided by ELIAI (the Edinburgh Laboratory for Integrated Artificial Intelligence), EPSRC (grant no. EP/W002876/1).