Arrow Research search

Author name cluster

Wei Ji

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

AAAI Conference 2026 Conference Paper

Evolving Generalist Virtual Agents with Generative and Associative Memory

  • Zhenkui Zhang
  • Wendong Bu
  • Kaihang Pan
  • Bingchen Miao
  • Wenqiao Zhang
  • Guoming Wang
  • Wei Ji
  • Rui Tang

Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) exhibit impressive capabilities. However, their long-term learning is hampered by a core limitation: a failure to evolve beyond existing trajectories. This stems from memory systems that treat experiences as isolated fragments and rely on brittle semantic retrieval, preventing the synthesis of novel solutions from disparate knowledge. To address this, we introduce CA3Mem, a framework inspired by the human hippocampus that organizes experiences into a structured memory graph. Leveraging this graph, CA3Mem features two key innovations: 1) a generative memory recombination mechanism that synthesizes novel solutions to drive agent evolution, and 2) an associative retrieval algorithm that employs spreading activation to recall a comprehensive and contextually-aware set of experiences. Experiments on OSWorld and WebArena demonstrate that CA3Mem significantly enhances agent capabilities, leading to marked improvements in long-horizon planning, compositional generalization for novel tasks, and continuous adaptation from experience.

AAAI Conference 2026 Conference Paper

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

  • Xiang Fang
  • Wanlong Fang
  • Changshuo Wang
  • Keke Tang
  • Daizong Liu
  • Siyi Wang
  • Wei Ji

Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

NeurIPS Conference 2025 Conference Paper

Counterfactual Evolution of Multimodal Datasets via Visual Programming

  • Minghe Gao
  • Zhongqi Yue
  • Wenjie Yan
  • Yihao Hu
  • Wei Ji
  • Siliang Tang
  • Jun Xiao
  • Tat-Seng Chua

The rapid development of Multimodal Large Language Models (MLLMs) poses increasing demands on the diversity and complexity of multimodal datasets. Yet manual annotation pipelines can no longer keep pace. Existing augmentation methods often follow fixed rules and lack verifiable control over sample diversity and reasoning complexity. To address this, we introduce Scalable COunterfactual Program Evolution (SCOPE), a framework that uses symbolic Visual Programming to guide program evolution via counterfactual reasoning. SCOPE performs the three steps of counterfactual inference: (1) Abduction, by generating verifiable programs to model reasoning associations; (2) Action, by intervening on program structure along three axes—reasoning path, visual context, and cross-instance composition; and (3) Prediction, by categorizing evolved instances by difficulty, structure, and input multiplicity. Based on this process, we build SCOPE-Train and SCOPE-Test, evolving benchmarks with expert validation. To support training, we propose MAP, a curriculum learning strategy that aligns model capacity with sample difficulty. Experiments show that SCOPE improves reasoning performance, exposes model blind spots, and enhances visual dialog capabilities.

NeurIPS Conference 2025 Conference Paper

Degradation-Aware Dynamic Schrödinger Bridge for Unpaired Image Restoration

  • Jingjun Yi
  • Qi Bi
  • Hao Zheng
  • Huimin Huang
  • Yixian Shen
  • Haolan Zhan
  • Wei Ji
  • Yawen Huang

Image restoration is a fundamental task in computer vision and machine learning, which learns a mapping between the clear images and the degraded images under various conditions (e. g. , blur, low-light, haze). Yet, most existing image restoration methods are highly restricted by the requirement of degraded and clear image pairs, which limits the generalization and feasibility to enormous real-world scenarios without paired images. To address this bottleneck, we propose a Degradation-aware Dynamic Schr\"{o}dinger Bridge (DDSB) for unpaired image restoration. Its general idea is to learn a Schr\"{o}dinger Bridge between clear and degraded image distribution, while at the same time emphasizing the physical degradation priors to reduce the accumulation of errors during the restoration process. A Degradation-aware Optimal Transport (DOT) learning scheme is accordingly devised. Training a degradation model to learn the inverse restoration process is particularly challenging, as it must be applicable across different stages of the iterative restoration process. A Dynamic Transport with Consistency (DTC) learning objective is further proposed to reduce the loss of image details in the early iterations and therefore refine the degradation model. Extensive experiments on multiple image degradation tasks show its state-of-the-art performance over the prior arts.

AAAI Conference 2025 Conference Paper

DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization

  • Qi Bi
  • Jingjun Yi
  • Hao Zheng
  • Haolan Zhan
  • Wei Ji
  • Yawen Huang
  • Yuexiang Li

Domain generalization aims to learn a representation from the source domain, which can be generalized to arbitrary unseen target domains. A fundamental challenge for visual domain generalization is the domain gap caused by the dramatic style variation whereas the image content is stable. The realm of selective state space, exemplified by VMamba, demonstrates its global receptive field in representing the content. However, the way exploiting the domain-invariant property for selective state space is rarely explored. In this paper, we propose a novel Flow Factorized State Space model, dubbed as DGFamba, for visual domain generalization. To maintain domain consistency, we innovatively map the style-augmented and the original state embeddings by flow factorization. In this latent flow space, each state embedding from a certain style is specified by a latent probability path. By aligning these probability paths in the latent space, the state embeddings are able to represent the same content distribution regardless of the style differences. Extensive experiments conducted on various visual domain generalization settings show its state-of-the-art performance.

NeurIPS Conference 2025 Conference Paper

EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution

  • Zhebei Shen
  • Qifan Yu
  • Juncheng Li
  • Wei Ji
  • Qizhi Chen
  • Siliang Tang
  • Yueting Zhuang

Recent advances in reinforcement learning (RL) methods such as Grouped Relative Policy Optimization (GRPO) have strengthened the reasoning capabilities of Large Vision-Language Models (LVLMs). However, due to the inherent entanglement between visual and textual modalities, applying GRPO to LVLMs often leads to reward convergence across different responses to the same sample as training progresses, hindering effective gradient updates and causing the enhancement of chain-of-thought reasoning to stagnate or even collapse. To address this issue, we propose a progressive instruction evolution framework, EvolvedGRPO, to gradually generate more complex questions via editing instructions in an adversarial way, progressively aligned with the model’s evolving capabilities. Specifically, we design two instruction editing strategies across modalities, incorporating incrementally increasing editing instructions and RL-based adversarial data augmentation to improve the effectiveness of model training. To address GRPO's limitations on overly difficult problems, we first train on basic subproblem versions of complex multi-modal questions in both the visual and textual modalities, progressively increasing difficulty to enable prefix-style process rewards, effectively combining the strengths of both process rewards and group-wise relative rewards. Finally, EvolvedGRPO achieves state-of-the-art performance among open-source RL models on multi-modal reasoning tasks, even approaching the closed-source GPT-4o in reasoning capabilities, and demonstrates better performance on unseen LVLM general benchmarks. The Code for EvolvedGRPO is available at https: //github. com/SHENZHEBEI/EvolvedGRPO.

AAAI Conference 2025 Conference Paper

Few-Shot Incremental Learning via Foreground Aggregation and Knowledge Transfer for Audio-Visual Semantic Segmentation

  • Jingqiao Xiu
  • Mengze Li
  • Zongxin Yang
  • Wei Ji
  • Yifang Yin
  • Roger Zimmermann

Audio-Visual Semantic Segmentation (AVSS) has gained significant attention in the multi-modal domain, aiming to segment video objects that produce specific sounds in the corresponding audio. Despite notable progress, existing methods still struggle to handle new classes not included in the original training set. To this end, we introduce Few-Shot Incremental Learning (FSIL) to the AVSS task, which seeks to seamlessly integrate new classes with limited incremental samples while preserving the knowledge of old classes. Two challenges arise in this new setting: (1) To reduce labeling costs, old classes within the incremental samples are treated as background, similar to silent objects. Training the model directly with background annotations may worsen the loss of distinctive knowledge about old classes, such as their outlines and sounds. (2) Most existing models adopt early cross-modal fusion with a single-tower design, incorporating more characteristics into class representations, which impedes knowledge transfer between classes based on similarity. To address these issues, we propose a Few-shot Incremental learning framework via class-centric foregrouNd aggreGation and dual-tower knowlEdge tRansfer (FINGER) for the AVSS task, which comprises two targeted modules: (1) The class-centric foreground aggregation gathers class-specific features for each foreground class while disregarding background features. The background class is excluded during training and inferred from the foreground predictions. (2) The dual-tower knowledge transfer postpones cross-modal fusion to separately conduct knowledge transfer for each modality. Extensive experiments validate the effectiveness of the FINGER model, significantly surpassing state-of-the-art methods.

AAAI Conference 2025 Conference Paper

Learning Fine-grained Domain Generalization via Hyperbolic State Space Hallucination

  • Qi Bi
  • Jingjun Yi
  • Haolan Zhan
  • Wei Ji
  • Gui-Song Xia

Fine-grained domain generalization (FGDG) aims to learn a fine-grained representation that can be well generalized to unseen target domains when only trained on the source domain data. Compared with generic domain generalization, FGDG is particularly challenging in that the fine-grained category can be only discerned by some subtle and tiny patterns. Such patterns are particularly fragile under the cross-domain style shifts caused by illumination, color and etc. To push this frontier, this paper presents a novel Hyperbolic State Space Hallucination (HSSH) method. It consists of two key components, namely, state space hallucination (SSH) and hyperbolic manifold consistency (HMC). SSH enriches the style diversity for the state embeddings by firstly extrapolating and then hallucinating the source images. Then, the pre- and post- style hallucinate state embeddings are projected into the hyperbolic manifold. The hyperbolic state space models the high-order statistics, and allows a better discernment of the fine-grained patterns. Finally, the hyperbolic distance is minimized, so that the impact of style variation on fine-grained patterns can be eliminated. Experiments on three FGDG benchmarks demonstrate its state-of-the-art performance.

NeurIPS Conference 2025 Conference Paper

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

  • Tianhao Peng
  • Haochen Wang
  • Yuanxing Zhang
  • Noah Wang
  • Zili Wang
  • Ge Zhang
  • Jian Yang
  • Shihao Li

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e. g. , sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating M ulti- V ideo U nderstanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1, 824 meticulously curated question-answer pairs spanning 4, 959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.

NeurIPS Conference 2025 Conference Paper

On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders

  • Wenyu Mao
  • Jiancan Wu
  • Guoqing Hu
  • Zhengyi Yang
  • Wei Ji
  • Xiang Wang

Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effectiveness. To address this trade-off, we propose TA-Rec, a two-stage framework that achieves one-step generation by smoothing the denoising function during pretraining while alleviating trajectory deviation by aligning with user preferences during fine-tuning. Specifically, to improve the efficiency without sacrificing the recommendation performance, TA-Rec pretrains the denoising model with Temporal Consistency Regularization (TCR), enforcing the consistency between the denoising results across adjacent steps. Thus, we can smooth the denoising function to map the noise as oracle items in one step with bounded error. To further enhance effectiveness, TA-Rec introduces Adaptive Preference Alignment (APA) that aligns the denoising process with user preference adaptively based on preference pair similarity and timesteps. Extensive experiments prove that TA-Rec’s two-stage objective effectively mitigates the discretization errors-induced trade-off, enhancing both efficiency and effectiveness of diffusion-based recommenders. Our code is available at https: //github. com/maowenyu-11/TA-Rec.

ICML Conference 2025 Conference Paper

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and O(T) Complexity

  • Shihao Zou
  • Qingfeng Li
  • Wei Ji
  • Jingjing Li
  • Yongkui Yang
  • Guoqi Li
  • Chao Dong

Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs’ efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https: //github. com/JimmyZou/SpikeVideoFormer

NeurIPS Conference 2025 Conference Paper

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

  • Haidong Xu
  • Guangwei Xu
  • Zhedong Zheng
  • Xiatian Zhu
  • Wei Ji
  • Xiangtai Li
  • Ruijie Guo
  • Meishan Zhang

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.

NeurIPS Conference 2024 Conference Paper

Learning Frequency-Adapted Vision Foundation Model for Domain Generalized Semantic Segmentation

  • Qi Bi
  • Jingjun Yi
  • Hao Zheng
  • Haolan Zhan
  • Yawen Huang
  • Wei Ji
  • Yuexiang Li
  • Yefeng Zheng

The emerging vision foundation model (VFM) has inherited the ability to generalize to unseen images. Nevertheless, the key challenge of domain-generalized semantic segmentation (DGSS) lies in the domain gap attributed to the cross-domain styles, i. e. , the variance of urban landscape and environment dependencies. Hence, maintaining the style-invariant property with varying domain styles becomes the key bottleneck in harnessing VFM for DGSS. The frequency space after Haar wavelet transformation provides a feasible way to decouple the style information from the domain-invariant content, since the content and style information are retained in the low- and high- frequency components of the space, respectively. To this end, we propose a novel Frequency-Adapted (FADA) learning scheme to advance the frontier. Its overall idea is to separately tackle the content and style information by frequency tokens throughout the learning process. Particularly, the proposed FADA consists of two branches, i. e. , low- and high- frequency branches. The former one is able to stabilize the scene content, while the latter one learns the scene styles and eliminates its impact to DGSS. Experiments conducted on various DGSS settings show the state-of-the-art performance of our FADA and its versatility to a variety of VFMs. Source code is available at \url{https: //github. com/BiQiWHU/FADA}.

AAAI Conference 2024 Conference Paper

Learning Generalized Medical Image Segmentation from Decoupled Feature Queries

  • Qi Bi
  • Jingjun Yi
  • Hao Zheng
  • Wei Ji
  • Yawen Huang
  • Yuexiang Li
  • Yefeng Zheng

Domain generalized medical image segmentation requires models to learn from multiple source domains and generalize well to arbitrary unseen target domain. Such a task is both technically challenging and clinically practical, due to the domain shift problem (i.e., images are collected from different hospitals and scanners). Existing methods focused on either learning shape-invariant representation or reaching consensus among the source domains. An ideal generalized representation is supposed to show similar pattern responses within the same channel for cross-domain images. However, to deal with the significant distribution discrepancy, the network tends to capture similar patterns by multiple channels, while different cross-domain patterns are also allowed to rest in the same channel. To address this issue, we propose to leverage channel-wise decoupled deep features as queries. With the aid of cross-attention mechanism, the long-range dependency between deep and shallow features can be fully mined via self-attention and then guides the learning of generalized representation. Besides, a relaxed deep whitening transformation is proposed to learn channel-wise decoupled features in a feasible way. The proposed decoupled fea- ture query (DFQ) scheme can be seamlessly integrate into the Transformer segmentation model in an end-to-end manner. Extensive experiments show its state-of-the-art performance, notably outperforming the runner-up by 1.31% and 1.98% with DSC metric on generalized fundus and prostate benchmarks, respectively. Source code is available at https://github.com/BiQiWHU/DFQ.

AAAI Conference 2024 Conference Paper

MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer

  • Junde Wu
  • Wei Ji
  • Huazhu Fu
  • Min Xu
  • Yueming Jin
  • Yanwu Xu

The Diffusion Probabilistic Model (DPM) has recently gained popularity in the field of computer vision, thanks to its image generation applications, such as Imagen, Latent Diffusion Models, and Stable Diffusion, which have demonstrated impressive capabilities and sparked much discussion within the community. Recent investigations have further unveiled the utility of DPM in the domain of medical image analysis, as underscored by the commendable performance exhibited by the medical image segmentation model across various tasks. Although these models were originally underpinned by a UNet architecture, there exists a potential avenue for enhancing their performance through the integration of vision transformer mechanisms. However, we discovered that simply combining these two models resulted in subpar performance. To effectively integrate these two cutting-edge techniques for the Medical image segmentation, we propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2. We verify its effectiveness on 20 medical image segmentation tasks with different image modalities. Through comprehensive evaluation, our approach demonstrates superiority over prior state-of-the-art (SOTA) methodologies. Code is released at https://github.com/KidsWithTokens/MedSegDiff.

AAAI Conference 2024 Conference Paper

Panoptic Scene Graph Generation with Semantics-Prototype Learning

  • Li Li
  • Wei Ji
  • Yiming Wu
  • Mengze Li
  • You Qin
  • Lina Wei
  • Roger Zimmermann

Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for the same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to observe the invariance degree of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potentially biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets. Our code is released at https://github.com/lili0415/PSG-biased-annotation.

NeurIPS Conference 2024 Conference Paper

Samba: Severity-aware Recurrent Modeling for Cross-domain Medical Image Grading

  • Qi Bi
  • Jingjun Yi
  • Hao Zheng
  • Wei Ji
  • Haolan Zhan
  • Yawen Huang
  • Yuexiang Li
  • Yefeng Zheng

Disease grading is a crucial task in medical image analysis. Due to the continuous progression of diseases, i. e. , the variability within the same level and the similarity between adjacent stages, accurate grading is highly challenging. Furthermore, in real-world scenarios, models trained on limited source domain datasets should also be capable of handling data from unseen target domains. Due to the cross-domain variants, the feature distribution between source and unseen target domains can be dramatically different, leading to a substantial decrease in model performance. To address these challenges in cross-domain disease grading, we propose a Severity-aware Recurrent Modeling (Samba) method in this paper. As the core objective of most staging tasks is to identify the most severe lesions, which may only occupy a small portion of the image, we propose to encode image patches in a sequential and recurrent manner. Specifically, a state space model is tailored to store and transport the severity information by hidden states. Moreover, to mitigate the impact of cross-domain variants, an Expectation-Maximization (EM) based state recalibration mechanism is designed to map the patch embeddings into a more compact space. We model the feature distributions of different lesions through the Gaussian Mixture Model (GMM) and reconstruct the intermediate features based on learnable severity bases. Extensive experiments show the proposed Samba outperforms the VMamba baseline by an average accuracy of 23. 5\%, 5. 6\% and 4. 1\% on the cross-domain grading of fatigue fracture, breast cancer and diabetic retinopathy, respectively. Source code is available at \url{https: //github. com/BiQiWHU/Samba}.

ICLR Conference 2024 Conference Paper

Towards Robust Multi-Modal Reasoning via Model Selection

  • Xiangyan Liu
  • Rongxue Li
  • Wei Ji
  • Tao Lin

The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the ``brain'' of the agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning. To this end, we identify the key challenges therein and propose the $\textbf{\textit{M}}^\textbf{\textit{3}}$ framework as a plug-in with negligible runtime overhead at test-time. This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning. In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process. Our code and benchmark: https://github.com/LINs-lab/M3.

NeurIPS Conference 2024 Conference Paper

Unleashing Multispectral Video's Potential in Semantic Segmentation: A Semi-supervised Viewpoint and New UAV-View Benchmark

  • Wei Ji
  • Jingjing Li
  • Wenbo Li
  • Yilin Shen
  • Li Cheng
  • Hongxia Jin

Thanks to the rapid progress in RGB & thermal imaging, also known as multispectral imaging, the task of multispectral video semantic segmentation, or MVSS in short, has recently drawn significant attentions. Noticeably, it offers new opportunities in improving segmentation performance under unfavorable visual conditions such as poor light or overexposure. Unfortunately, there are currently very few datasets available, including for example MVSeg dataset that focuses purely toward eye-level view; and it features the sparse annotation nature due to the intensive demands of labeling process. To address these key challenges of the MVSS task, this paper presents two major contributions: the introduction of MVUAV, a new MVSS benchmark dataset, and the development of a dedicated semi-supervised MVSS baseline - SemiMV. Our MVUAV dataset is captured via Unmanned Aerial Vehicles (UAV), which offers a unique oblique bird’s-eye view complementary to the existing MVSS datasets; it also encompasses a broad range of day/night lighting conditions and over 30 semantic categories. In the meantime, to better leverage the sparse annotations and extra unlabeled RGB-Thermal videos, a semi-supervised learning baseline, SemiMV, is proposed to enforce consistency regularization through a dedicated Cross-collaborative Consistency Learning (C3L) module and a denoised temporal aggregation strategy. Comprehensive empirical evaluations on both MVSeg and MVUAV benchmark datasets have showcased the efficacy of our SemiMV baseline.

NeurIPS Conference 2023 Conference Paper

DVSOD: RGB-D Video Salient Object Detection

  • Jingjing Li
  • Wei Ji
  • Size Wang
  • Wenbo Li
  • Li Cheng

Salient object detection (SOD) aims to identify standout elements in a scene, with recent advancements primarily focused on integrating depth data (RGB-D) or temporal data from videos to enhance SOD in complex scenes. However, the unison of two types of crucial information remains largely underexplored due to data constraints. To bridge this gap, we in this work introduce the DViSal dataset, fueling further research in the emerging field of RGB-D video salient object detection (DVSOD). Our dataset features 237 diverse RGB-D videos alongside comprehensive annotations, including object and instance-level markings, as well as bounding boxes and scribbles. These resources enable a broad scope for potential research directions. We also conduct benchmarking experiments using various SOD models, affirming the efficacy of multimodal video input for salient object detection. Lastly, we highlight some intriguing findings and promising future research avenues. To foster growth in this field, our dataset and benchmark results are publicly accessible at: https: //dvsod. github. io/.

AAAI Conference 2023 Conference Paper

FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms

  • Peng Qi
  • Yuyan Bu
  • Juan Cao
  • Wei Ji
  • Ruihao Shui
  • Junbin Xiao
  • Danding Wang
  • Tat-Seng Chua

Short video platforms have become an important channel for news sharing, but also a new breeding ground for fake news. To mitigate this problem, research of fake news video detection has recently received a lot of attention. Existing works face two roadblocks: the scarcity of comprehensive and largescale datasets and insufficient utilization of multimodal information. Therefore, in this paper, we construct the largest Chinese short video dataset about fake news named FakeSV, which includes news content, user comments, and publisher profiles simultaneously. To understand the characteristics of fake news videos, we conduct exploratory analysis of FakeSV from different perspectives. Moreover, we provide a new multimodal detection model named SV-FEND, which exploits the cross-modal correlations to select the most informative features and utilizes the social context information for detection. Extensive experiments evaluate the superiority of the proposed method and provide detailed comparisons of different methods and modalities for future works. Our dataset and codes are available in https://github.com/ICTMCG/FakeSV.

AAAI Conference 2023 Conference Paper

Video-Audio Domain Generalization via Confounder Disentanglement

  • Shengyu Zhang
  • Xusheng Feng
  • Wenyan Fan
  • Wenjing Fang
  • Fuli Feng
  • Wei Ji
  • Shuo Li
  • Li Wang

Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

NeurIPS Conference 2023 Conference Paper

VPGTrans: Transfer Visual Prompt Generator across LLMs

  • Ao Zhang
  • Hao Fei
  • Yuan Yao
  • Wei Ji
  • Li Li
  • Zhiyuan Liu
  • Tat-Seng Chua

Since developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm. However, further tuning the VPG component of the MLLM still incurs significant computational costs, such as thousands of GPU hours and millions of training data points. An alternative solution is transferring an existing VPG from one MLLM to the target MLLM. In this work, we investigate VPG transferability across LLMs for the first time, aiming to reduce the cost of VPG training. Specifically, we explore VPG transfer across different LLM sizes (e. g. , small-to-large) and types. We identify key factors to maximize transfer efficiency, based on which we develop a simple yet highly effective two-stage transfer framework, called VPGTrans. Notably, it enables VPG transfer from BLIP-2 OPT 2. 7B to BLIP-2 OPT 6. 7B with less than 10% of the GPU hours using only 10. 7% of the training data compared to training a VPG for OPT 6. 7B from scratch. Furthermore, we provide a series of intriguing findings and discuss potential explanations behind them. Finally, we showcase the practical value of our VPGTrans approach, by customizing two novel MLLMs, including VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.

AAAI Conference 2022 Conference Paper

Content-Variant Reference Image Quality Assessment via Knowledge Distillation

  • Guanghao Yin
  • Wei Wang
  • Zehuan Yuan
  • Chuchu Han
  • Wei Ji
  • Shouqian Sun
  • Changhu Wang

Generally, humans are more skilled at perceiving differences between high-quality (HQ) and low-quality (LQ) images than directly judging the quality of a single LQ image. This situation also applies to image quality assessment (IQA). Although recent no-reference (NR-IQA) methods have made great progress to predict image quality free from the reference image, they still have the potential to achieve better performance since HQ image information is not fully exploited. In contrast, full-reference (FR-IQA) methods tend to provide more reliable quality evaluation, but its practicability is affected by the requirement for pixel-level aligned reference images. To address this, we firstly propose the content-variant reference method via knowledge distillation (CVRKD-IQA). Specifically, we use non-aligned reference (NAR) images to introduce various prior distributions of high-quality images. The comparisons of distribution differences between HQ and LQ images can help our model better assess the image quality. Further, the knowledge distillation transfers more HQ-LQ distribution difference information from the FR-teacher to the NAR-student and stabilizing CVRKD-IQA performance. Moreover, to fully mine the local-global combined information, while achieving faster inference speed, our model directly processes multiple image patches from the input with the MLP-mixer. Cross-dataset experiments verify that our model can outperform all NAR/NR-IQA SOTAs, even reach comparable performance with FR-IQA methods on some occasions. Since the content-variant and non-aligned reference HQ images are easy to obtain, our model can support more IQA applications with its relative robustness to content variations. Our code and more detail elaborations of supplement are available: https: //github. com/guanghaoyin/CVRKD-IQA.

AAAI Conference 2022 Conference Paper

Rethinking the Two-Stage Framework for Grounded Situation Recognition

  • Meng Wei
  • Long Chen
  • Wei Ji
  • Xiaoyu Yue
  • Tat-Seng Chua

Grounded Situation Recognition (GSR), i. e. , recognizing the salient activity (or verb) category in an image (e. g. , buying) and detecting all corresponding semantic roles (e. g. , agent and goods), is an essential step towards “human-like” event understanding. Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage. However, there are obvious drawbacks in both stages: 1) The widely-used cross-entropy (XE) loss for object recognition is insufficient in verb classification due to the large intraclass variation and high inter-class similarity among daily activities. 2) All semantic roles are detected in an autoregressive manner, which fails to model the complex semantic relations between different roles. To this end, we propose a novel SituFormer for GSR which consists of a Coarse-to- Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: a coarse-grained model trained with XE loss first proposes a set of verb candidates, and then a fine-grained model trained with triplet loss re-ranks these candidates with enhanced verb features (not only separable but also discriminative). TNM is a transformer-based semantic role detection model, which detects all roles parallelly. Owing to the global relation modeling ability and flexibility of the transformer decoder, TNM can fully explore the statistical dependency of the roles. Extensive validations on the challenging SWiG benchmark show that SituFormer achieves a new state-of-the-art performance with significant gains under various metrics. Code is available at https: //github. com/kellyiss/SituFormer.

AAAI Conference 2022 Conference Paper

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

  • Junbin Xiao
  • Angela Yao
  • Zhiyuan Liu
  • Yicong Li
  • Wei Ji
  • Tat-Seng Chua

Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (e. g. , objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also demonstrate the model’s reliability as it shows meaningful visual-textual evidences for the predicted answers.

AAAI Conference 2021 Conference Paper

Boundary Proposal Network for Two-stage Natural Language Video Localization

  • Shaoning Xiao
  • Long Chen
  • Songyang Zhang
  • Wei Ji
  • Jian Shao
  • Lu Ye
  • Jun Xiao

We aim to address the problem of Natural Language Video Localization (NLVL) — localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e. g. , by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame1 as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchorfree approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal twostage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visuallanguage fusion layer is proposed to jointly model the multimodal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BP- Net on three challenging NLVL benchmarks (i. e. , Charades- STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.

NeurIPS Conference 2021 Conference Paper

Joint Semantic Mining for Weakly Supervised RGB-D Salient Object Detection

  • Jingjing Li
  • Wei Ji
  • Qi Bi
  • Cheng Yan
  • Miao Zhang
  • Yongri Piao
  • Huchuan Lu
  • Li Cheng

Training saliency detection models with weak supervisions, e. g. , image-level tags or captions, is appealing as it removes the costly demand of per-pixel annotations. Despite the rapid progress of RGB-D saliency detection in fully-supervised setting, it however remains an unexplored territory when only weak supervision signals are available. This paper is set to tackle the problem of weakly-supervised RGB-D salient object detection. The key insight in this effort is the idea of maintaining per-pixel pseudo-labels with iterative refinements by reconciling the multimodal input signals in our joint semantic mining (JSM). Considering the large variations in the raw depth map and the lack of explicit pixel-level supervisions, we propose spatial semantic modeling (SSM) to capture saliency-specific depth cues from the raw depth and produce depth-refined pseudo-labels. Moreover, tags and captions are incorporated via a fill-in-the-blank training in our textual semantic modeling (TSM) to estimate the confidences of competing pseudo-labels. At test time, our model involves only a light-weight sub-network of the training pipeline, i. e. , it requires only an RGB image as input, thus allowing efficient inference. Extensive evaluations demonstrate the effectiveness of our approach under the weakly-supervised setting. Importantly, our method could also be adapted to work in both fully-supervised and unsupervised paradigms. In each of these scenarios, superior performance has been attained by our approach with comparing to the state-of-the-art dedicated methods. As a by-product, a CapS dataset is constructed by augmenting existing benchmark training set with additional image tags and captions.

IJCAI Conference 2018 Conference Paper

Semantic Locality-Aware Deformable Network for Clothing Segmentation

  • Wei Ji
  • Xi Li
  • Yueting Zhuang
  • Omar El Farouk Bourahla
  • Yixin Ji
  • Shihao Li
  • Jiabao Cui

Clothing segmentation is a challenging vision problem typically implemented within a fine-grained semantic segmentation framework. Different from conventional segmentation, clothing segmentation has some domain-specific properties such as texture richness, diverse appearance variations, non-rigid geometry deformations, and small sample learning. To deal with these points, we propose a semantic locality-aware segmentation model, which adaptively attaches an original clothing image with a semantically similar (e. g. , appearance or pose) auxiliary exemplar by search. Through considering the interactions of the clothing image and its exemplar, more intrinsic knowledge about the locality manifold structures of clothing images is discovered to make the learning process of small sample problem more stable and tractable. Furthermore, we present a CNN model based on the deformable convolutions to extract the non-rigid geometry-aware features for clothing images. Experimental results demonstrate the effectiveness of the proposed model against the state-of-the-art approaches.