Arrow Research search

Author name cluster

Limin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers
1 author row

Possible papers

32

AAAI Conference 2026 Conference Paper

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

  • Yang Chen
  • Xiaowei Xu
  • Shuai Wang
  • Chenhui Zhu
  • Ruxue Wen
  • Xubin Li
  • Tiezheng Ge
  • Limin Wang

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3x, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64 x 64 and 256 x 256.

EAAI Journal 2026 Journal Article

Intelligent pose correction of shield machines via an integrated convolutional long short-term memory Kolmogorov-Arnold network and model reference adaptive control

  • Xiangyu Li
  • Xuanyu Liu
  • Limin Wang
  • Yudong Wang
  • He Zhang
  • Yueyang Huang
  • Junzhi Lu

In underground tunnel construction, Earth Pressure Balance shield machines are required to advance along a designed alignment. However, complex geological conditions and equipment-related disturbances often lead to pose deviations, which can compromise construction quality. This study proposes an integrated intelligent pose correction framework that combines pose prediction with adaptive control. First, key input variables are selected through Pearson correlation analysis and denoised using a hybrid Complete Ensemble Empirical Mode Decomposition with Adaptive Noise-wavelet transform method. A pose prediction model is then developed based on a Convolutional Long Short-Term Memory Kolmogorov-Arnold Network (CL-KAN), which replaces the fully connected layers of a Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) with KAN layers to enhance nonlinear feature representation. Experimental results show that the CL-KAN model achieves high prediction accuracy, with root mean squared error values ranging from 0. 88 to 1. 68 mm for vertical deviations and coefficients of determination ranging from 0. 90 to 0. 97 for the key pose parameters. Compared with a baseline CNN-LSTM, the CL-KAN model reduces the root mean squared error by 12. 3-18. 6% while requiring fewer trainable parameters. To bridge prediction and control, a context-aware perturbation importance analysis (CA-PIA) method is employed to identify influential control features, which subsequently guide the parameter optimization of a model reference adaptive control (MRAC) strategy. Field validation under complex working conditions demonstrates that the proposed framework confines pose deviations within ±7 mm, showing strong robustness and practical applicability for intelligent pose correction in tunnel engineering based on artificial intelligence techniques.

AAAI Conference 2026 Conference Paper

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

  • Zikang Wang
  • Boyu Chen
  • Zhengrong Yue
  • Yi Wang
  • Yu Qiao
  • Limin Wang
  • Yali Wang

Recent advances in video understanding have been driven by MLLMs. But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing the interactive discovery of preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME(w/ subs) and 70.1 on EgoSchema, outperforming its strong baselines (e.g., InternVL2.5-8B and InternVideo2.5-8B), by up to 10.1% and 6.2%. Compared to leading closed-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but only with 7% input frames and 12% inference time on average.

EAAI Journal 2025 Journal Article

A binary linear predictive evolutionary algorithm with feature analysis for multiobjective feature selection in classification

  • Ting Zhou
  • Limin Wang
  • Xuming Han
  • Zhiquan Liu
  • Minghan Gao

Multiobjective feature selection (MOFS), which aims to obtain a set of Pareto optimal feature subsets by simultaneously maximizing classification accuracy and minimizing the number of selected features, has attracted considerable attention recently. However, most existing studies still face a challenge that locating more well-distributed Pareto optimal feature subsets, especially for high-dimensional complex datasets. In response to this challenge, this paper proposes a binary linear predictive evolutionary algorithm with feature analysis (MBLPE) for MOFS. In this paper, a feature analysis-based selection method is proposed to select effective solutions into the next generation. Concretely, two subset evaluation indicators are designed to efficiently deal with duplicated feature subsets during evolution. Then, a fitness allocation is constructed to select effective solutions, improving population diversity. Moreover, a fisher score-based initialization scheme is designed when handling high-dimensional complex datasets. The proposed scheme effectively removes irrelevant and redundant features in search space by identifying features with strong discriminant performance in advance, thereby reducing computational cost. In comparison with seven state-of-the-art algorithms on 18 classification datasets with different characteristics, the proposed MBLPE finds more diverse feature subsets with better convergence in solving the MOFS problems.

NeurIPS Conference 2025 Conference Paper

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

  • Guo Chen
  • Zhiqi Li
  • Shihao Wang
  • Jindong Jiang
  • Yicheng Liu
  • Lidong Lu
  • De-An Huang
  • Wonmin Byeon

We introduce Eagle2. 5, a frontier vision-language model (VLM) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle2. 5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle2. 5-8B achieves 72. 4\% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2. 5-VL-72B and InternVL2. 5-78B.

NeurIPS Conference 2025 Conference Paper

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

  • Zhenpeng Huang
  • Jiaqi Li
  • Zihan Jia
  • Xinhao Li
  • Desen Meng
  • Lingxue Song
  • Xi Chen
  • Liang Li

We present LongVPO, a novel two‑stage Direct Preference Optimization framework that enables short‑context vision‑language models to robustly understand ultra‑long videos without any long‑video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual‑similarity and question‑specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model’s scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, and then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, \model{} outperforms the state‑of‑the‑art open‑source models on multiple long‑video benchmarks, while maintaining strong short‑video performance (e. g. , on MVBench), offering a scalable paradigm for efficient long‑form video understanding.

NeurIPS Conference 2025 Conference Paper

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

  • Chenhui Zhu
  • Yilu Wu
  • Shuai Wang
  • Gangshan Wu
  • Limin Wang

Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.

NeurIPS Conference 2025 Conference Paper

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

  • Yue Feng
  • Jinwei Hu
  • Qijia Lu
  • Jiawei Niu
  • Li Tan
  • Shuo Yuan
  • Ziyi Yan
  • Yizhen Jia

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e. g. , news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e. g. , news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i. e. , Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1, 050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https: //github. com/debby-0527/MUVR.

NeurIPS Conference 2025 Conference Paper

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

  • Xiangyu Zeng
  • Kefan Qiu
  • Qingyu Zhang
  • Xinhao Li
  • Jing Wang
  • Jiaxin Li
  • Ziang Yan
  • Kun Tian

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77. 3% on StreamingBench, 60. 5% on OVBench, and 55. 6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96. 8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

NeurIPS Conference 2025 Conference Paper

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

  • Ziang Yan
  • Yinan He
  • Xinhao Li
  • Zhengrong Yue
  • Xiangyu Zeng
  • Yali Wang
  • Yu Qiao
  • Limin Wang

Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1. 5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2. 5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.

NeurIPS Conference 2024 Conference Paper

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

  • Yuhan Zhu
  • Yuyang Ji
  • Zhiyu Zhao
  • Gangshan Wu
  • Limin Wang

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

NeurIPS Conference 2024 Conference Paper

Does Video-Text Pretraining Help Open-Vocabulary Online Action Detection?

  • Qingsong Zhao
  • Yi Wang
  • Jilan Xu
  • Yinan He
  • Zifan Song
  • Limin Wang
  • Yu Qiao
  • Cairong Zhao

Video understanding relies on accurate action detection for temporal analysis. However, existing mainstream methods have limitations in real-world applications due to their offline and closed-set evaluation approaches, as well as their dependence on manual annotations. To address these challenges and enable real-time action understanding in open-world scenarios, we propose OV-OAD, a zero-shot online action detector that leverages vision-language models and learns solely from text supervision. By introducing an object-centered decoder unit into a Transformer-based model, we aggregate frames with similar semantics using video-text correspondence. Extensive experiments on four action detection benchmarks demonstrate that OV-OAD outperforms other advanced zero-shot methods. Specifically, it achieves 37. 5\% mean average precision on THUMOS’14 and 73. 8\% calibrated average precision on TVSeries. This research establishes a robust baseline for zero-shot transfer in online action detection, enabling scalable solutions for open-world temporal understanding. The code will be available for download at \url{https: //github. com/OpenGVLab/OV-OAD}.

NeurIPS Conference 2024 Conference Paper

Exploring DCN-like architecture for fast image generation with arbitrary resolution

  • Shuai Wang
  • Zexian Li
  • Tianhui Song
  • Xubin Li
  • Tiezheng Ge
  • Bo Zheng
  • Limin Wang

Arbitrary-resolution image generation still remains a challenging task in AIGC, as it requires handling varying resolutions and aspect ratios while maintaining high visual quality. Existing transformer-based diffusion methods suffer from quadratic computation cost and limited resolution extrapolation capabilities, making them less effective for this task. In this paper, we propose FlowDCN, a purely convolution-based generative model with linear time and memory complexity, that can efficiently generate high-quality images at arbitrary resolutions. Equipped with a new design of learnable group-wise deformable convolution block, our FlowDCN yields higher flexibility and capability to handle different resolutions with a single model. FlowDCN achieves the state-of-the-art 4. 30 sFID on $256\times256$ ImageNet Benchmark and comparable resolution extrapolation results, surpassing transformer-based counterparts in terms of convergence speed (only $\frac{1}{5}$ images), visual quality, parameters ($8\%$ reduction) and FLOPs ($20\%$ reduction). We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.

NeurIPS Conference 2024 Conference Paper

VFIMamba: Video Frame Interpolation with State Space Models

  • Guozhen Zhang
  • Chunxu Liu
  • Yutao Cui
  • Xiaotong Zhao
  • Kai Ma
  • Limin Wang

Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering both linear complexity and data-dependent modeling capabilities. In this paper, we propose VFIMamba, a novel frame interpolation method for efficient and dynamic inter-frame modeling by harnessing the S6 model. Our approach introduces the Mixed-SSM Block (MSB), which initially rearranges tokens from adjacent frames in an interleaved fashion and subsequently applies multi-directional S6 modeling. This design facilitates the efficient transmission of information across frames while upholding linear complexity. Furthermore, we introduce a novel curriculum learning strategy that progressively cultivates proficiency in modeling inter-frame dynamics across varying motion magnitudes, fully unleashing the potential of the S6 model. Experimental findings showcase that our method attains state-of-the-art performance across diverse benchmarks, particularly excelling in high-resolution scenarios. In particular, on the X-TEST dataset, VFIMamba demonstrates a noteworthy improvement of 0. 80 dB for 4K frames and 0. 96 dB for 2K frames.

AAAI Conference 2023 Conference Paper

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

  • Jiange Yang
  • Sheng Guo
  • Gangshan Wu
  • Limin Wang

Current RGB-D scene recognition approaches often train two standalone backbones for RGB and depth modalities with the same Places or ImageNet pre-training. However, the pre-trained depth network is still biased by RGB-based models which may result in a suboptimal solution. In this paper, we present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling. Specifically, we first build a patch-level alignment task to pre-train a single encoder shared by two modalities via cross-modal contrastive learning. Then, the pre-trained contrastive encoder is passed to a multi-modal masked autoencoder to capture the finer context features from a generative perspective. In addition, our single-model design without requirement of fusion module is very flexible and robust to generalize to unimodal scenario in both training and testing phases. Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our CoMAE for RGB and depth representation learning. In addition, our experiment results reveal that CoMAE is a data-efficient representation learner. Although we only use the small-scale and unlabeled training set for pre-training, our CoMAE pre-trained models are still competitive to the state-of-the-art methods with extra large-scale and supervised RGB dataset pre-training. Code will be released at https://github.com/MCG-NJU/CoMAE.

NeurIPS Conference 2023 Conference Paper

JourneyDB: A Benchmark for Generative Image Understanding

  • Keqiang Sun
  • Junting Pan
  • Yuying Ge
  • Hao Li
  • Haodong Duan
  • Xiaoshi Wu
  • Renrui Zhang
  • Aojun Zhou

While recent advancements in vision-language models have had a transformative impact on multi-modal comprehension, the extent to which these models possess the ability to comprehend generated images remains uncertain. Synthetic images, in comparison to real data, encompass a higher level of diversity in terms of both content and style, thereby presenting significant challenges for the models to fully grasp. In light of this challenge, we introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images, each paired with the corresponding text prompts that were employed in their creation. Furthermore, we additionally introduce an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation. These benchmarks encompass prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, we evaluate the performance of state-of-the-art multi-modal models when applied to the JourneyDB dataset, providing a comprehensive analysis of their strengths and limitations in comprehending generated content. We anticipate that the proposed dataset and benchmarks will facilitate further research in the field of generative content understanding. The dataset is publicly available at https: //journeydb. github. io.

NeurIPS Conference 2023 Conference Paper

MixFormerV2: Efficient Fully Transformer Tracking

  • Yutao Cui
  • Tianhui Song
  • Gangshan Wu
  • Limin Wang

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70. 6\% on LaSOT and AUC of 56. 7\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2. 7\% AUC on LaSOT with a real-time CPU speed.

AAAI Conference 2022 Conference Paper

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

  • Guo Chen
  • Yin-Dong Zheng
  • Limin Wang
  • Tong Lu

Temporal action detection aims to locate the boundaries of action in the video. The current method based on boundary matching enumerates and calculates all possible boundary matchings to generate proposals. However, these methods neglect the long-range context aggregation in boundary prediction. At the same time, due to the similar semantics of adjacent matchings, local semantic aggregation of densely-generated matchings cannot improve semantic richness and discrimination. In this paper, we propose the end-to-end proposal generation method named Dual Context Aggregation Network (DCAN) to aggregate context on two levels, namely, boundary level and proposal level, for generating high-quality action proposals, thereby improving the performance of temporal action detection. Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries. For matching evaluation, Coarse-to-fine Matching (CFM) is designed to aggregate context on the proposal level and refine the matching map from coarse to fine. We conduct extensive experiments on ActivityNet v1.3 and THUMOS-14. DCAN obtains an average mAP of 35.39% on ActivityNet v1.3 and reaches mAP 54.14% at [email protected] on THUMOS-14, which demonstrates DCAN can generate high-quality proposals and achieve state-of-the-art performance. We release the code at https://github.com/cg1177/DCAN.

JBHI Journal 2022 Journal Article

Joint Landmark and Structure Learning for Automatic Evaluation of Developmental Dysplasia of the Hip

  • Xindi Hu
  • Limin Wang
  • Xin Yang
  • Xu Zhou
  • Wufeng Xue
  • Yan Cao
  • Shengfeng Liu
  • Yuhao Huang

The ultrasound (US) screening of the infant hip is vital for the early diagnosis of developmental dysplasia of the hip (DDH). The US diagnosis of DDH refers to measuring alpha and beta angles that quantify hip joint development. These two angles are calculated from key anatomical landmarks and structures of the hip. However, this measurement process is not trivial for sonographers and usually requires a thorough understanding of complex anatomical structures. In this study, we propose a multi-task framework to learn the relationships among landmarks and structures jointly and automatically evaluate DDH. Our multi-task networks are equipped with three novel modules. Firstly, we adopt Mask R-CNN as the basic framework to detect and segment key anatomical structures and add one landmark detection branch to form a new multi-task framework. Secondly, we propose a novel shape similarity loss to refine the incomplete anatomical structure prediction robustly and accurately. Thirdly, we further incorporate the landmark-structure consistent prior to ensure the consistency of the bony rim estimated from the segmented structure and the detected landmark. In our experiments, 1231 US images of the infant hip from 632 patients are collected, of which 247 images from 126 patients are tested. The average errors in alpha and beta angles are 2. 221 ${}^{\circ }$ and 2. 899 ${}^{\circ }$. About 93% and 85% estimates of alpha and beta angles have errors less than 5 degrees, respectively. Experimental results demonstrate that the proposed method can accurately and robustly realize the automatic evaluation of DDH, showing great potential for clinical application.

AAAI Conference 2022 Conference Paper

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

  • Zhenzhi Wang
  • Limin Wang
  • Tao Wu
  • Tianhao Li
  • Gangshan Wu

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. Code is available at https: //github. com/MCG-NJU/MMN.

NeurIPS Conference 2022 Conference Paper

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

  • Jing Tan
  • Xiaotong Zhao
  • Xintian Shi
  • Bin Kang
  • Limin Wang

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e. g. , ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric.

NeurIPS Conference 2022 Conference Paper

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

  • Zhan Tong
  • Yibing Song
  • Jue Wang
  • Limin Wang

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging and meaningful self-supervision task, thus encouraging extracting more effective video representations during the pre-training process. We obtain three important findings with VideoMAE: (1) An extremely high proportion of masking ratio (i. e. , 90% to 95%) still yields favorable performance for VideoMAE. The temporally redundant video content enables higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i. e. , around 3k-4k videos) without using any extra data. This is partially ascribed to the challenging task of video reconstruction to enforce high-level structure learning. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important factor. Notably, our VideoMAE with the vanilla ViT backbone can achieve 87. 4% on Kinects-400, 75. 4% on Something-Something V2, 91. 3% on UCF101, and 62. 6% on HMDB51, without using any extra data. Code is available at https: //github. com/MCG-NJU/VideoMAE.

TCS Journal 2021 Journal Article

Approximation algorithms for the dynamic k-level facility location problems

  • Limin Wang
  • Zhao Zhang
  • Chenchen Wu
  • Dachuan Xu
  • Xiaoyan Zhang

In this paper, we first consider a dynamic k-level facility location problem, which is a generalization of the k-level facility location problem when considering time factor. We present a combinatorial primal-dual approximation algorithm for this problem which finds a constant factor approximate solution. Then, we investigative the dynamic k-level facility location problem with submodular penalties and outliers, which extend the existing problem on two fronts, namely from static to dynamic and from without penalties (outliers) to penalties (outliers) allowed. Based on primal-dual technique and the triangle inequality property, we also give two constant factor approximation algorithms for the dynamic problem with submodular penalties and outliers, respectively.

IS Journal 2021 Journal Article

Saliency Detection With a Three-Stage Hierarchical Network

  • Dongjing Shan
  • Xiongwei Zhang
  • Limin Wang
  • Tieyong Cao
  • Chao Zhang

Deep learning approaches for saliency detection have attracted much attention and have been exploited widely in recent years. In this article, we propose a three-stage hierarchical neural network for modeling the detection. Initially, fast R-CNN is used to extract features for each superpixel, and the high-level prior information of traditional models is incorporated to weight the deep learning features. Next, in the regional stage, a self-attention mechanism is used to expand the receptive field from one superpixel to its surrounding and relevant regions. And last, saliency scores are sampled by the Gumbel–Softmax trick in a global regression model. In the experiments, we compare our models including two variations (networks without self-attention or prior weights) with 12 previous methods and test them on several benchmark datasets. Different kinds of strategies are also adopted for evaluation and the results demonstrate that our method achieves excellent performance.

AAAI Conference 2020 Conference Paper

Finding Action Tubes with a Sparse-to-Dense Framework

  • Yuxi Li
  • Weiyao Lin
  • Tao Wang
  • John See
  • Rui Qian
  • Ning Xu
  • Limin Wang
  • Shugong Xu

The task of spatial-temporal action detection has attracted increasing attention among researchers. Existing dominant methods solve this problem by relying on short-term information and dense serial-wise detection on each individual frames or clips. Despite their effectiveness, these methods showed inadequate use of long-term information and are prone to inefficiency. In this paper, we propose for the first time, an efficient framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner. There are two key characteristics in this framework: (1) Both long-term and short-term sampled information are explicitly utilized in our spatiotemporal network, (2) A new dynamic feature sampling module (DTS) is designed to effectively approximate the tube output while keeping the system tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets, achieving promising results that are competitive to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our framework about 7. 6 times more efficient than the nearest competitor.

AAAI Conference 2020 Conference Paper

Knowledge Integration Networks for Action Recognition

  • Shiwen Zhang
  • Sheng Guo
  • Limin Wang
  • Weilin Huang
  • Matthew Scott

In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge ef- ficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77. 8%. We further demonstrate that our KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97. 8% top-1 accuracy.

AAAI Conference 2020 Conference Paper

TEINet: Towards an Efficient Architecture for Video Recognition

  • Zhaoyang Liu
  • Donghao Luo
  • Yabiao Wang
  • Limin Wang
  • Ying Tai
  • Chengjie Wang
  • Jilin Li
  • Feiyue Huang

Efficiency is an important issue in designing video architectures for action recognition. 3D CNNs have witnessed remarkable progress in action recognition from videos. However, compared with their 2D counterparts, 3D convolutions often introduce a large amount of parameters and cause high computational cost. To relieve this problem, we propose an efficient temporal module, termed as Temporal Enhancementand-Interaction (TEI Module), which could be plugged into the existing 2D CNNs (denoted by TEINet). The TEI module presents a different paradigm to learn temporal features by decoupling the modeling of channel correlation and temporal interaction. First, it contains a Motion Enhanced Module (MEM) which is to enhance the motion-related features while suppress irrelevant information (e. g. , background). Then, it introduces a Temporal Interaction Module (TIM) which supplements the temporal contextual information in a channel-wise manner. This two-stage modeling scheme is not only able to capture temporal structure flexibly and effectively, but also efficient for model inference. We conduct extensive experiments to verify the effectiveness of TEINet on several benchmarks (e. g. , Something-Something V1&V2, Kinetics, UCF101 and HMDB51). Our proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency.

IJCAI Conference 2019 Conference Paper

Dynamically Visual Disambiguation of Keyword-based Image Search

  • Yazhou Yao
  • Zeren Sun
  • Fumin Shen
  • Li Liu
  • Limin Wang
  • Fan Zhu
  • Lizhong Ding
  • Gangshan Wu

Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits their performance is the problem of visual polysemy. To address this issue, we present an adaptive multi-model framework that resolves polysemy by visual disambiguation. Compared to existing methods, the primary advantage of our approach lies in that our approach can adapt to the dynamic changes in the search results. Our proposed framework consists of two major steps: we first discover and dynamically select the text queries according to the image search results, then we employ the proposed saliency-guided deep multi-instance learning network to remove outliers and learn classification models for visual disambiguation. Extensive experiments demonstrate the superiority of our proposed approach.

AAAI Conference 2019 Conference Paper

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

  • Dongliang He
  • Zhichao Zhou
  • Chuang Gan
  • Fu Li
  • Xiao Liu
  • Yandong Li
  • Limin Wang
  • Shilei Wen

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.