IROS Conference 2024 Conference Paper
Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection
- Francesco Barbato
- Umberto Michieli
- Jijoong Moon
- Pietro Zanuttigh
- Mete Ozay
Recent years have seen object detection robotic systems deployed in several personal devices (e. g. , home robots and appliances). This has highlighted a challenge in their design, i. e. , they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e. g. , a dog vs. user’s dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e. g. , DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.