Zhe Ren Papers

AAAI Conference 2026 Conference Paper

CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots

Xiao Zhao
Chang Liu
Ruiteng Ji
Zheyuan Zhang
Mingxu Zhu
Linna Song
Zhe Ren
Luo Qingliang

Recent advances in vision language models (VLMs) have demonstrated remarkable potential in embodied navigation tasks. However, existing robot-centric datasets primarily focus on traditional 3D tasks such as perception and prediction, lacking adequate support for vision-language tasks. Vision-language-navigation (VLN) is a key capability for achieving human-like and interpretable navigation in complex environments. In this study, we present CoT-VLNBench, the first large-scale benchmark and dataset designed for chain-of-thought (CoT) reasoning in quadruped robot navigation. Our dataset encompasses a diverse range of indoor and outdoor scenes, multi-step navigation trajectories, and rich natural language instructions, all annotated with fine-grained CoT reasoning traces. Specifically, it contains 175K frames, 5.25M 3D bounding boxes, and 875K vision–question–answer (VQA) pairs. This comprehensive resource enables thorough evaluation of embodied agents’ perceptual and step-by-step reasoning abilities. Furthermore, we propose a novel CoT-VLN model, a state-of-the-art 7B VLN model that integrates visual, linguistic, and reasoning modules, to facilitate interpretable and effective navigation. Extensive experiments demonstrate that our approach significantly outperforms existing non-VLMs baselines on the new benchmark, underscoring the importance of CoT-VLN in embodied navigation. We hope that CoT-VLNBench will serve as a valuable resource to advance research at the intersection of robotics, vision, language, and reasoning.

PDF Details DOI

AAAI Conference 2017 Conference Paper

Unsupervised Deep Learning for Optical Flow Estimation

Zhe Ren
Junchi Yan
Bingbing Ni
Bin Liu
Xiaokang Yang
Hongyuan Zha

Recent work has shown that optical ﬂow estimation can be formulated as a supervised learning problem. Moreover, convolutional networks have been successfully applied to this task. However, supervised ﬂow learning is obfuscated by the shortage of labeled training data. As a consequence, existing methods have to turn to large synthetic datasets for easily computer generated ground truth. In this work, we explore if a deep network for ﬂow estimation can be trained without supervision. Using image warping by the estimated ﬂow, we devise a simple yet effective unsupervised method for learning optical ﬂow, by directly minimizing photometric consistency. We demonstrate that a ﬂow network can be trained from endto-end using our unsupervised scheme. In some cases, our results come tantalizingly close to the performance of methods trained with full supervision.

PDF Details

Possible papers

CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots

Unsupervised Deep Learning for Optical Flow Estimation