Author name cluster

Zhao Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

1 author row

NeurIPS Conference 2025 Conference Paper

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Huaying Yuan
Jian Ni
Zheng Liu
Yueze Wang
Junjie Zhou
Zhengyang Liang
Bo Zhao
Zhao Cao

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LVMR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1, 200 seconds in duration, and collected from various domains, e. g. , movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, and object-level, covering common tasks like action recognition, object localization, causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker to facilitate future research in this area.

PDF Details

AAAI Conference 2024 Conference Paper

Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval

Yan Fang
Qingyao Ai
Jingtao Zhan
Yiqun Liu
Xiaolong Wu
Zhao Cao

Recently, dense retrieval (DR) models, which represent queries and documents with fixed-width vectors and retrieve relevant ones via nearest neighbor search, have drawn increasing attention from the IR community. However, previous studies have shown that the effectiveness of DR critically relies on sufficient training signals, which leads to severe performance degradation when applied in out-of-domain scenarios, where large-scale training data are usually unavailable. To solve this problem, existing studies adopt a data-augmentation-plus-joint-training paradigm to construct weak/pseudo supervisions on the target domain and combine them with the large-scale human annotated data on the source domain to train the DR models. However, they don't explicitly distinguish the data and the supervision signals in the training process and simply assume that the DR models are mighty enough to capture and memorize different domain knowledge and relevance matching patterns without guidance, which, as shown in this paper, is not true. Based on this observation, we propose a Robust Multi-Supervision Combining strategy (RMSC) that decouples the domain and supervision signals by explicitly telling the DR models how the domain data and supervision signals are combined in the training data with specially designed soft tokens. With the extra soft tokens to store the domain-specific and supervision-specific knowledge, RMSC allows the DR models to conduct retrieval based on human-like relevance matching patterns and target-specific language distribution on the target domain without human annotations. Extensive experiments on zero-shot DR benchmarks show that RMSC significantly improves the ranking performance on the target domain compared to strong DR baselines and domain adaptation methods, while being stable during training and can be combined with query generation or second-stage pre-training.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Learning from the Wisdom of Crowds: Exploiting Similar Sessions for Session Search

Yuhang Ye
Zhonghua Li
Zhicheng Dou
Yutao Zhu
Changwang Zhang
Shangquan Wu
Zhao Cao

Search engines are essential internet services, enabling users to efficiently find the information they need. Session search employs users’ session logs of queries to solve complex retrieval tasks, in which users search multiple times until interested documents are found. Most existing session search models focus on the contextual information within the current search, ignoring the evidence from historical search sessions. Considering the fact that many ongoing retrieval tasks should have already been carried out by other users with a similar intent, we argue that historical sessions with similar intents can help improve the accuracy of the current search task. We propose a novel Similar Session-enhanced Ranking (SSR) model to improve the session search performance using historical sessions with similar intents. Specifically, the candidate historical sessions are matched by query-level and session-level semantic similarity, and then query-level neighbor behaviors are aggregated by a Query-guided GNN (QGNN) while session-level neighbor behaviors are aggregated using the attention mechanism. Finally, we integrate the refined and aggregated historical neighbor information into the current search session. Experimental results on AOL and Tiangong-ST datasets show that our SSR model significantly outperforms the state-of-the-art models.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling

Yizhao Gao
Nanyi Fei
Haoyu Lu
Zhiwu Lu
Hao Jiang
Yijie Li
Zhao Cao

Video-language models suffer from forgetting old/learned knowledge when trained with streaming data. In this work, we thus propose a continual video-language modeling (CVLM) setting, where models are supposed to be sequentially trained on five widely-used video-text datasets with different data distributions. Although most of existing continual learning methods have achieved great success by exploiting extra information (e. g. , memory data of past tasks) or dynamically extended networks, they cause enormous resource consumption when transferred to our CVLM setting. To overcome the challenges (i. e. , catastrophic forgetting and heavy resource consumption) in CVLM, we propose a novel cross-modal MoCo-based model with bidirectional momentum update (BMU), termed BMU-MoCo. Concretely, our BMU-MoCo has two core designs: (1) Different from the conventional MoCo, we apply the momentum update to not only momentum encoders but also encoders (i. e. , bidirectional) at each training step, which enables the model to review the learned knowledge retained in the momentum encoders. (2) To further enhance our BMU-MoCo by utilizing earlier knowledge, we additionally maintain a pair of global momentum encoders (only initialized at the very beginning) with the same BMU strategy. Extensive results show that our BMU-MoCo remarkably outperforms recent competitors w. r. t. video-text retrieval performance and forgetting rate, even without using any extra data or dynamic networks.

PDF Details