Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computervision. DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), combines an attention mechanism and a depth model to identify potential objects in images, totally unsupervised.
To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.
DADO leverages SSL to extract attention insights and depth estimation techniques to capture structural features of the scene. Our pipeline takes a single RGB input image and computes both depth estimation and attention features. It then segments the scene into discrete depth layers and, in parallel, constructs a global attention map. By combining each depth layer with the attention map, DADO isolates candidate objects at varying depth ranges.
Datasets: To facilitate a comparable evaluation, our work uses Pascal VOC 2007 (VOC07), and VOC 2012 (VOC12).
Evaluation metrics: We evaluate our unsupervised object discovery method using two complementary metrics: Correct Localization (CorLoc) and object discovery Average Precision (odAP) CorLoc measures the percentage of images where the dominant object is correctly localized, based on an Intersection over Union (IoU) threshold of 0.5. It only assesses whether at least one object is detected per image, without considering multiple object retrieval. To address this limitation, we also report results using odAP, which extends standard Av erage Precision to the unsupervised setting. odAP evaluates both the accuracy of object localization and the ability to retrieve all relevant objects, summa rizing performance via precision-recall curves. We compare our method against state-of-the-art approaches. We benchmark against MOST, LOST, and TokenCut using CorLoc. For odAP, we compare with MOST, rOSD, and LOD.
Results: DADO achieves competitive results in both single-object discovery and multi-object discovery. Table 1 shows the performance of our method with an odAP of 6.2 for VOC07 and 5.9 for VOC12. For CorLOC, DADO achieves 78.3% on VOC07 and 74.2% on VOC12, as shown in Table 2.
| Method | Year | VOC07 | VOC12 |
|---|---|---|---|
| rOSD | 2020 | 4.3 | 5.27 |
| LOD | 2021 | 4.5 | 5.34 |
| MOST | 2023 | 6.4 | – |
| DADO (ours) | 2025 | 6.2 | 5.9 |
Table 1: Object Discovery Average Precision (odAP) evaluation on Multiple Object Localization.
| Method | Year | VOC07 | VOC12 |
|---|---|---|---|
| rOSD | 2020 | 54.5 | 55.3 |
| LOD | 2021 | 53.6 | 55.1 |
| LOST | 2021 | 61.9 | 64.0 |
| TokenCut | 2023 | 68.8 | 72.1 |
| MOST | 2023 | 74.8 | 77.4 |
| DADO (ours) | 2025 | 78.3 | 74.2 |
Table 2: CorLOC performance comparison of DADO with Related Work.
One of the major challenges in object discovery is the absence of labels, which also implies a lack of semantics. Although the features obtained from models like DINO are of high quality, the lack of semantic information complicates the distinction of individual objects. They can appear in images as independent and isolated entities, as parts of larger composite objects, adjacent to others with no space in between, in front of or behind, leading to occlusions at different scales.
(a) Independent and isolated objects are effectively discovered by both attention mechanisms and depth cues.
(b) Objects positioned in front of or behind others can be accurately separated using depth layers; in such cases, attention provides limited additional information.
(c) Composite objects, such as the horse and rider, are very difficult to separate when they lie on the same plane—this represents the main weakness of our model.
(d) Pascal VOC ground truth does not include the ’goat’ class, but contains ’sheep’. DADO finds instances of both objects.
(e) Separating two objects that are adjacent and on the same plane is particularly challenging for our model.
The figure above shows examples of these situations and how DADO leverages attention and depth to resolve them. Note that our method successfully discovers both large and small objects, whether they are in the foreground or further in the background, and is also capable of separating them when they are overlapping (as in the case of the cows). However, it may also produce false positives in some situations, as shown in the fifth row (two people and a train).
Our approach, which incorporates depth layers, allows us to address some of these challenges, particularly occlusions. The core idea is that by leveraging depth information, we can achieve better separation of objects. Depth layers can help identify object boundaries and contours. This leads to more accurate object representation in the image, which ultimately improves feature quality and model performance.
Our findings reveal that object-centric images tend to produce attention maps of sufficient quality to detect unseen objects. In contrast, complex scenes con taining important non-centric objects typically yield noisy and less discriminative attention maps. This noise arises because attention mechanisms prioritize the most salient visual features, which in cluttered environments may belong to background textures or secondary objects. We observed that in such challenging cases, depth representations can significantly improve localization performance.
Depth cues provide spatial information that helps distinguish foreground objects from the background, even when visual attention alone is ambiguous. Separating the image into distinct depth planes and applying attention to each plane allows for the individualization of overlapping objects.
By employing dynamic weights that adapt based on either attention features or depth layer representations, we further improve model quality. A higher num ber of objects in an image typically results in a more dispersed attention map, making object identification harder. Dynamic weights allow the model to adjust its focus on attention or depth.
These findings suggest that combining depth information with attention based discovery methods could be a promising direction for improving unsupervised object discovery in complex real-world scenes.
Limitations and Future Work
Our method has difficulty separating adjacent objects that lie in the same plane and lack visible gaps.
Tokenization-based approaches, such as TokenCut and MOST, handle such cases more effectively. Additionally, low-level, non-semantic techniques, such as edge detection, super-pixel grouping, and optimization-based methods such as graph cuts or watershed algorithms, can provide useful boundary cues without requiring semantic information. In future work, we aim to incorporate such strategies to improve performance in scenes with tightly packed objects.
The paper is available here.
@inproceedings{dado-caip-2025,
title={DADO: A Depth-Attention framework for Object Discovery},
author={Gonzalez, Federico and Talavera, Estefania and Radeva, Petia},
booktitle={Computer Analysis of Images and Patterns (CAIP)},
year={2025},
organization={Springer}
}The code is available on the code folder and can be installed as follows:
For any questions, suggestions, or issues, please send an email to fedegonzal@gmail.com
