ICRA Conference 2025 Conference Paper
Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-Based Autonomous Driving
- Yichen Xie 0002
- Hongge Chen
- Gregory P. Meyer
- Yong Jae Lee
- Eric M. Wolff
- Masayoshi Tomizuka
- Wei Zhan
- Yuning Chai
Multi-frame temporal inputs are important for vision-based autonomous driving. Observations from different angles enable the recovery of 3 D object states from 2 D images as long as we can identify the same instance from different input frames. However, the dynamic nature of driving scenes leads to significant variance in the instance appearance and shape captured by the cameras at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations robust to the changes of distance and perspective in a long-term temporal sequence without any human annotations. In the pretraining stage, raw point clouds from LiDAR sensors are utilized to construct the instance-wise long-term temporal correspondence, which serves as guidance for the extraction of instance-level representation from the vision-based bird's-eye-view (BEV) feature map. Cohere3D encourages consistent representation for the same instance at different frames but distinguishes between different instances. We validate the effectiveness and generalizability of our algorithm by finetuning the pretrained model across key downstream autonomous driving tasks: perception, mapping, prediction, and planning. Results show a notable improvement in both data efficiency and final performance in all these tasks.