Author name cluster

Wenhao Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Mixture-of-Experts Meets In-Context Reinforcement Learning

Wenhao Wu
Fuhong Liu
Haoru Li
Zican Hu
Daoyi Dong
Chunlin Chen
Zhi Wang

In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR ( T oken- and T ask-wise M oE for I n-context R L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https: //github. com/NJU-RL/T2MIR.

PDF Details

NeurIPS Conference 2025 Conference Paper

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Huanjin Yao
Jiaxing Huang
Wenhao Wu
Jingyi Zhang
Yibo Wang
Shunyu Liu
Yingjie Wang
YuXin Song

In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code is available at https: //github. com/HJYao00/Mulberry.

PDF Details

NeurIPS Conference 2025 Conference Paper

R1-ShareVL: Incentivizing Reasoning Capabilities of Multimodal Large Language Models via Share-GRPO

Huanjin Yao
Qixiang Yin
Jingyi Zhang
Min Yang
Yibo Wang
Wenhao Wu
Fei Su
Li Shen

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over 6 widely-used reasoning benchmarks showcase the superior performance of our method. Code is available at https: //github. com/HJYao00/R1-ShareVL.

PDF Details

ICLR Conference 2025 Conference Paper

Retrieval Head Mechanistically Explains Long-Context Factuality

Wenhao Wu
Yizhong Wang
Guangxuan Xiao
Hao Peng 0018
Yao Fu

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.

Details

NeurIPS Conference 2025 Conference Paper

Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision

Shilin Zhang
Zican Hu
Wenhao Wu
Xinyi Xie
Jianxiang Tang
Chunlin Chen
Daoyi Dong
Yu Cheng

Offline meta-RL usually tackles generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose T ext-to- D ecision A gent ( T2DA ), a simple and scalable framework that supervises offline meta-RL with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines. Our code is available at https: //github. com/NJU-RL/T2DA.

PDF Details

NeurIPS Conference 2024 Conference Paper

Automated Multi-level Preference for MLLMs

Mengxi Zhang
Wenhao Wu
Yu Lu
YuXin Song
Kang Rong
Huanjin Yao
Jianbo Zhao
Fanglong Liu

Current multimodal Large Language Models (MLLMs) suffer from ''hallucination'', occasionally generating responses that are not grounded in the input images. To tackle this challenge, one promising path is to utilize reinforcement learning from human feedback (RLHF), which steers MLLMs towards learning superior responses while avoiding inferior ones. We rethink the common practice of using binary preferences ( i. e. , superior, inferior), and find that adopting multi-level preferences ( e. g. , superior, medium, inferior) is better for two benefits: 1) It narrows the gap between adjacent levels, thereby encouraging MLLMs to discern subtle differences. 2) It further integrates cross-level comparisons (beyond adjacent-level comparisons), thus providing a broader range of comparisons with hallucination examples. To verify our viewpoint, we present the Automated Multi-level Preference ( AMP ) framework for MLLMs. To facilitate this framework, we first develop an automated dataset generation pipeline that provides high-quality multi-level preference datasets without any human annotators. Furthermore, we design the Multi-level Direct Preference Optimization (MDPO) algorithm to robustly conduct complex multi-level preference learning. Additionally, we propose a new hallucination benchmark, MRHal-Bench. Extensive experiments across public hallucination and general benchmarks, as well as our MRHal-Bench, demonstrate the effectiveness of our proposed method. Code is available at https: //github. com/takomc/amp.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Dense Connector for MLLMs

Huanjin Yao
Wenhao Wu
Taojiannan Yang
YuXin Song
Mengxi Zhang
Haocheng Feng
Yifan Sun
Zhiheng Li

Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1. 5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2. 7B→70B), and diverse architectures of MLLMs (e. g. , LLaVA-v1. 5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https: //github. com/HJYao00/DenseConnector.

PDF Details DOI

JBHI Journal 2024 Journal Article

DETHACDA: A Dual-View Edge and Topology Hybrid Attention Model for CircRNA-Disease Associations Prediction

Wenjing Yin
Shudong Wang
Sibo Qiao
Yawu Zhao
Wenhao Wu
Shanchen Pang
Zhihan Lv

There exists growing evidence that circRNAs are concerned with many complex diseases physiological processes and pathogenesis and may serve as critical therapeutic targets. Identifying disease-associated circRNAs through biological experiments is time-consuming, and designing an intelligent, precise calculation model is essential. Recently, many models based on graph technology have been proposed to predict circRNA-disease association. However, most existing methods only capture the neighborhood topology of the association network and ignore the complex semantic information. Therefore, we propose a Dual-view Edge and Topology Hybrid Attention model for predicting CircRNA-Disease Associations (DETHACDA), effectively capturing the neighborhood topology and various semantics of circRNA and disease nodes in a heterogeneous network. The 5-fold cross-validation experiments on circRNADisease indicate that the proposed DETHACDA achieves the area under receiver operating characteristic curve of 0. 9882, better than four state-of-the-art calculation methods.

Details DOI

NeurIPS Conference 2024 Conference Paper

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

Zhi Wang
Li Zhang
Wenhao Wu
Yuanheng Zhu
Dongbin Zhao
Chunlin Chen

A longstanding goal of artificial general intelligence is highly capable generalists that can learn from diverse experiences and generalize to unseen tasks. The language and vision communities have seen remarkable progress toward this trend by scaling up transformer-based models trained on massive datasets, while reinforcement learning (RL) agents still suffer from poor generalization capacity under such paradigms. To tackle this challenge, we propose Meta Decision Transformer (Meta-DT), which leverages the sequential modeling ability of the transformer architecture and robust task representation learning via world model disentanglement to achieve efficient generalization in offline meta-RL. We pretrain a context-aware world model to learn a compact task representation, and inject it as a contextual condition to the causal transformer to guide task-oriented sequence generation. Then, we subtly utilize history trajectories generated by the meta-policy as a self-guided prompt to exploit the architectural inductive bias. We select the trajectory segment that yields the largest prediction error on the pretrained world model to construct the prompt, aiming to encode task-specific information complementary to the world model maximally. Notably, the proposed framework eliminates the requirement of any expert demonstration or domain knowledge at test time. Experimental results on MuJoCo and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites. Our code is available at https: //github. com/NJU-RL/Meta-DT.

PDF Details DOI

JBHI Journal 2024 Journal Article

MOSGAT: Uniting Specificity-Aware GATs and Cross Modal-Attention to Integrate Multi-Omics Data for Disease Diagnosis

Wenhao Wu
Shudong Wang
Yuanyuan Zhang
Wenjing Yin
Yawu Zhao
Shanchen Pang

With the advancement of sequencing methodologies, the acquisition of vast amounts of multi-omics data presents a significant opportunity for comprehending the intricate biological mechanisms underlying diseases and achieving precise diagnosis and treatment for complex disorders. However, as diverse omics data are integrated, extracting sample-specific features within each omics modality and exploring potential correlations among different modalities while avoiding mutual interference becomes a critical challenge in multi-omics data integration research. In the context of this study, we proposed a framework that unites specificity-aware GATs and cross-modal attention to integrate different omics data (MOSGAT). To be specific, we devise Graph Attention Networks (GATs) tailored for each omics modality data to perform feature extraction on samples. Additionally, an adaptive confidence attention weighting technique is incorporated to enhance the confidence in the extracted features. Finally, a cross-modal attention mechanism was devised based on multi-head self-attention, thoroughly uncovering potential correlations between different omics data. Extensive experiments were conducted on four publicly available medical datasets, highlighting the superiority of the proposed framework when compared to state-of-the-art methodologies, particularly in the realm of classification tasks. The experimental results underscore MOSGAT's effectiveness in extracting features and exploring potential inter-omics associations.

Details DOI

ICLR Conference 2024 Conference Paper

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

Dawei Zhu
Nan Yang 0002
Liang Wang 0046
Yifan Song 0002
Wenhao Wu
Furu Wei
Sujian Li

Large Language Models (LLMs) are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.

Details

EAAI Journal 2024 Journal Article

Transformer Autoencoder for K-means Efficient clustering

Wenhao Wu
Weiwei Wang
Xixi Jia
Xiangchu Feng

As a fundamental unsupervised learning task, clustering has been widely applied in exploratory data analysis in the fields of computer vision, pattern recognition, and data mining. Among existing clustering methods, K-means is the most popular one due to its simplicity and computational efficiency. However, the ubiquitous high dimensionality challenges the effectiveness and the efficiency of the K-means algorithm. Fortunately, the deep neural network provides a powerful resolution for learning low dimensional feature. To optimize the feature learning and the K-means clustering jointly, we present a new deep clustering network called Transformer AutoEncoder for K-means Efficient clustering (TAKE). It consists of two modules: the Transformer AutoEncoder (TAE) for feature learning and the KNet for clustering. The TAE incorporates the transformer structure to learn global features and the contrastive learning mechanism to enhance feature discrimination. The KNet is constructed by unrolling the accelerated projected gradient descent iterations of the relaxed K-means model. The network is trained in two phases: pretraining and clustering. In pretraining, the TAE is optimized by minimizing the cosine similarity-based reconstruction loss, the contrastive loss (CL) and the convex combination loss (CCL). The CCL encourages features of augmented neighbor data to lie in a convex hull, thus K-means friendly. In the clustering phase, the TAE and the KNet are optimized jointly by minimizing the reconstruction loss and the K-means clustering loss. The clustering results are obtained by the forward inference of the KNet. Extended experiments show that our proposed method is highly effective in unsupervised representation learning and clustering.

Details DOI

AAAI Conference 2023 Conference Paper

AdaCM: Adaptive ColorMLP for Real-Time Universal Photo-Realistic Style Transfer

Tianwei Lin
Honglin Lin
Fu Li
Dongliang He
Wenhao Wu
Meiling Wang
Xin Li
Yong Liu

Photo-realistic style transfer aims at migrating the artistic style from an exemplar style image to a content image, producing a result image without spatial distortions or unrealistic artifacts. Impressive results have been achieved by recent deep models. However, deep neural network based methods are too expensive to run in real-time. Meanwhile, bilateral grid based methods are much faster but still contain artifacts like overexposure. In this work, we propose the Adaptive ColorMLP (AdaCM), an effective and efficient framework for universal photo-realistic style transfer. First, we find the complex non-linear color mapping between input and target domain can be efficiently modeled by a small multi-layer perceptron (ColorMLP) model. Then, in AdaCM, we adopt a CNN encoder to adaptively predict all parameters for the ColorMLP conditioned on each input content and style image pair. Experimental results demonstrate that AdaCM can generate vivid and high-quality stylization results. Meanwhile, our AdaCM is ultrafast and can process a 4K resolution image in 6ms on one V100 GPU.

PDF Details DOI

JAIR Journal 2023 Journal Article

FactGen: Faithful Text Generation by Factuality-aware Pre-training and Contrastive Ranking Fine-tuning

ZhiBin Lan
Wei Li
Jinsong Su
Xinyan Xiao
Jiachen Liu
Wenhao Wu
Yajuan Lyu

Conditional text generation is supposed to generate a fluent and coherent target text that is faithful to the source text. Although pre-trained models have achieved promising results, they still suffer from the crucial factuality problem. To deal with this issue, we propose a factuality-aware pretraining-finetuning framework named FactGen, which fully considers factuality during two training stages. Specifically, at the pre-training stage, we utilize a natural language inference model to construct target texts that are entailed by the source texts, resulting in a more factually consistent pre-training objective. Then, during the fine-tuning stage, we further introduce a contrastive ranking loss to encourage the model to generate factually consistent text with higher probability. Extensive experiments on three conditional text generation tasks demonstrate the effectiveness and generality of our training framework.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Wenhao Wu
Zhun Sun
Wanli Ouyang

Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. Our simple yet effective tuning paradigm achieves state-of-the-art performance and efficient training on various video recognition scenarios, i.e., zero-shot, few-shot, general recognition. In particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1 accuracy under zero-shot, few-shot settings on five video datasets. Code and models are available at https://github.com/whwu95/Text4Vis.

PDF Details DOI

IROS Conference 2023 Conference Paper

WSCFER: Improving Facial Expression Representations by Weak Supervised Contrastive Learning

Wei Nie
Bowen Chen 0004
Wenhao Wu
Xiu Xu
Weihong Ren
Honghai Liu 0001

The major challenge of Facial Expression Recog-nition (FER) is to learn class discriminative representations, and the existing works mainly address it by designing various classification networks from class level. However, learning representations at class level is limited due to the inconspicuous class discrimination among different facial expressions. Thus, in this paper, we propose a Weak Supervised Contrastive learning FER (WSCFER) method to improve facial expression representations by simultaneously learning instance-level representations which are highly complementary to the general class-level representations. Specifically, our proposed WSCFER consists of three components: a major task for FER classification, an auxiliary task for Weak Supervised Contrastive (WSC) learning which pulls augmented samples of the same image together while pushing apart instance samples from different classes, and a Partial Consistency Loss (PCL) for optimizing the two embedding spaces from both the class level and the instance level. We compare WSC with some state-of-the-art contrastive methods and find that it can efficiently learn instance-level representations but avoid overemphasizing irrelevant parts, which is crucial for FER. WSCFER achieves superior performance on several in-the-wild databases, and it also shows the promising potential for learning representations under noisy annotations.

Details

AAAI Conference 2022 Conference Paper

Temporal Action Proposal Generation with Background Constraint

Haosen Yang
Wenhao Wu
Lining Wang
Sheng Jin
Boyang Xia
Hongxun Yao
Hujie Huang

Temporal action proposal generation (TAPG) is a challenging task that aims to locate action instances in untrimmed videos with temporal boundaries. To evaluate the confidence of proposals, the existing works typically predict action score of proposals that are supervised by the temporal Intersectionover-Union (tIoU) between proposal and the ground-truth. In this paper, we innovatively propose a general auxiliary Background Constraint idea to further suppress low-quality proposals, by utilizing the background prediction score to restrict the confidence of proposals. In this way, the Background Constraint concept can be easily plug-and-played into existing TAPG methods (e. g. , BMN, GTAD). From this perspective, we propose the Background Constraint Network (BC- Net) to further take advantage of the rich information of action and background. Specifically, we introduce an Action- Background Interaction module for reliable confidence evaluation, which models the inconsistency between action and background by attention mechanisms at the frame and clip levels. Extensive experiments are conducted on two popular benchmarks, i. e. , ActivityNet-1. 3 and THUMOS14. The results demonstrate that our method outperforms state-of-theart methods. Equipped with the existing action classifier, our method also achieves remarkable performance on the temporal action localization task.

PDF Details

AAAI Conference 2021 Conference Paper

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Wenhao Wu
Dongliang He
Tianwei Lin
Fu Li
Chuang Gan
Errui Ding

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H ×W ×T video frames as space-time signal (viewing from the Height- Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-theshelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i. e. , Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance but maintain 2D CNN’s complexity.

PDF Details

IJCAI Conference 2021 Conference Paper

Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video

Jie Wu
Wei Zhang
Guanbin Li
Wenhao Wu
Xiao Tan
Yingying Li
Errui Ding
Liang Lin

In this paper, we introduce a novel task, referred to as Weakly-Supervised Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically, given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i. e. , a sequence of bounding boxes at consecutive times) that encloses the abnormal event, with only coarse video-level annotations as supervision during training. To address this challenging task, we propose a dual-branch network which takes as input the proposals with multi-granularities in both spatial-temporal domains. Each branch employs a relationship reasoning module to capture the correlation between tubes/videolets, which can provide rich contextual information and complex entity relationships for the concept learning of abnormal behaviors. Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches. It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework. Furthermore, we contribute two datasets, i. e. , ST-UCF-Crime and STRA, consisting of videos containing spatio-temporal abnormal annotations to serve as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to handle this task.

PDF Details DOI