Arrow Research search

Author name cluster

Yunchao Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
2 author rows

Possible papers

35

NeurIPS Conference 2025 Conference Paper

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

  • Dongheng Lin
  • Mengxue Qu
  • Kunyang Han
  • Jianbo Jiao
  • Xiaojie Jin
  • Yunchao Wei

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https: //rathgrith. github. io/Unified Frame VAA/.

AAAI Conference 2025 Conference Paper

Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning

  • Man Liu
  • Huihui Bai
  • Feng Li
  • Chunjie Zhang
  • Yunchao Wei
  • Tat-Seng Chua
  • Yao Zhao

Zero-shot learning (ZSL) endeavors to transfer knowledge from the seen categories to recognize unseen categories, which mostly relies on the semantic-visual interactions between image and attribute tokens. Recently, the prompt learning has emerged in ZSL and demonstrated significant potential as it allows the zero-shot transfer of diverse visual concepts to downstream tasks. However, current methods explore the fixed adaptation of the learnable prompt on the seen domains, which make them over-emphasize the primary visual features observed during training, limiting their generalization capabilities to the unseen domains. In this work, we propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment, enabling effective knowledge transfer for ZSL. AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding the semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision. It is further integrated with primary visual features to attend to semantic-related information for visual enhancement, thus strengthening transferable ability. Experimental results on three benchmarks show that our AENet outperforms existing state-of-the-art ZSL methods.

AAAI Conference 2025 Conference Paper

C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection

  • Chuangchuang Tan
  • Renshuai Tao
  • Huan Liu
  • Guanghua Gu
  • Baoyuan Wu
  • Yao Zhao
  • Yunchao Wei

This work focuses on AIGC detection to develop universal detectors capable of identifying various types of forgery images. Recent studies have found large pre-trained models, such as CLIP, are effective for generalizable deepfake detection along with linear classifiers. However, two critical issues remain unresolved: 1) understanding why CLIP features are effective on deepfake detection through a linear classifier; and 2) exploring the detection potential of CLIP. In this study, we delve into the underlying mechanisms of CLIP's detection capabilities by decoding its detection features into text and performing word frequency analysis. Our finding indicates that CLIP detects deepfakes by recognizing similar concepts. Building on this insight, we introduce Category Common Prompt CLIP, called C2P-CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder, thereby enhancing detection performance. Our method achieves a 12.4% improvement in detection accuracy compared to the original CLIP.

ICLR Conference 2025 Conference Paper

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

  • Jiannan Huang 0002
  • Jun Hao Liew
  • Hanshu Yan
  • Yuyang Yin
  • Yao Zhao 0001
  • Humphrey Shi
  • Yunchao Wei

Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (*e.g.*, headphone is missing when generating "a <sks>`dog wearing a headphone"). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (*e.g.*, "a dog wearing a headphone"), implying that the compositional ability only disappears after personalization tuning. We observe a semantic shift in the customized concept after fine-tuning, indicating that the personalized concept is not aligned with the original concept, and further show through theoretical analyses that this semantic shift leads to increased difficulty in sampling the joint conditional probability distribution, resulting in the loss of the compositional ability. Inspired by this finding, we present **ClassDiffusion**, a technique that leverages a **semantic preservation loss** to explicitly regulate the concept space when learning a new concept. Although simple, this approach effectively prevents semantic drift during the fine-tuning process of the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.

ICLR Conference 2025 Conference Paper

Collapsed Language Models Promote Fairness

  • Jingxuan Xu
  • Wuyang Chen 0001
  • Linyi Li
  • Yao Zhao 0001
  • Yunchao Wei

To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at https://github.com/Xujxyang/Fairness-NC-main

NeurIPS Conference 2025 Conference Paper

DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

  • Zixiang Li
  • Haoyu Wang
  • Wei Wang
  • Chuangchuang Tan
  • Yunchao Wei
  • Yao Zhao

Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process. Our codes are available at: https: //github. com/Lzxhh/Dual-Conditional-Inversion

NeurIPS Conference 2025 Conference Paper

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

  • Siyu Jiao
  • Gengwei Zhang
  • Yinlong Qian
  • Jiancheng Huang
  • Yao Zhao
  • Humphrey Shi
  • Lin Ma
  • Yunchao Wei

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images (< 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1. 0B model outperforms its VAR counterpart on the ImageNet 256 × 256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2. 08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0. 25/0. 28 FID and popular diffusion models LDM/DiT by 1. 52/0. 19 FID, respectively. When transferring our 1. 0B model to the ImageNet 512 × 512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2. 3B model, which is a fully supervised model trained at 512 × 512 resolution.

ICLR Conference 2025 Conference Paper

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

  • Weitai Kang
  • Mengxue Qu
  • Jyoti Kini
  • Yunchao Wei
  • Mubarak Shah
  • Yan Yan 0002

In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back." Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization.

NeurIPS Conference 2025 Conference Paper

Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

  • Longfei Li
  • Zhiwen Fan
  • Wenyan Cong
  • Xinhang Liu
  • Yuyang Yin
  • Matt Foutter
  • Panwang Pan
  • Chenyu You

The synthesis of realistic Martian landscape videos, essential for mission rehearsal and robotic simulation, presents unique challenges. These primarily stem from the scarcity of high-quality Martian data and the significant domain gap relative to terrestrial imagery. To address these challenges, we introduce a holistic solution comprising two main components: 1) a data curation framework, Multimodal Mars Synthesis (M3arsSynth), which processes stereo navigation images to render high-fidelity 3D video sequences. 2) a video-based Martian terrain generator (MarsGen), that utilizes multimodal conditioning data to accurately synthesize novel, 3D-consistent frames. Our data are sourced from NASA’s Planetary Data System (PDS), covering diverse Martian terrains and dates, enabling the production of physics-accurate 3D surface models at metric-scale resolution. During inference, MarsGen is conditioned on an initial image frame and can be guided by specified camera trajectories or textual prompts to generate new environments. Experimental results demonstrate that our solution surpasses video synthesis approaches trained on terrestrial data, achieving superior visual quality and 3D structural consistency.

AAAI Conference 2025 Conference Paper

Memory Efficient Matting with Adaptive Token Routing

  • Yiheng Lin
  • Yihan Hu
  • Chenyi Zhang
  • Ting Liu
  • Xiaochao Qu
  • Luoqi Liu
  • Yao Zhao
  • Yunchao Wei

Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of 4872 × 6017. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark.

NeurIPS Conference 2025 Conference Paper

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

  • Xiao Yu
  • Yan Fang
  • Yao Zhao
  • Yunchao Wei

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the $\textbf{Pre}$dictive $\textbf{F}$uture $\textbf{M}$odeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding.

NeurIPS Conference 2024 Conference Paper

Bridge the Points: Graph-based Few-shot Segment Anything Semantically

  • Anqi Zhang
  • Guangyu Gao
  • Jianbo Jiao
  • Chi H. Liu
  • Yunchao Wei

The recent advancements in large-scale pre-training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few-shot Semantic Segmentation (FSS), focusing on prompt generation for SAM-based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one-shot inference times due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive-Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference. Another subsequent Point-Mask Clustering module aligns the granularity of masks and selected points as a directed graph, based on mask coverage over points. These points are then aggregated by decomposing the weakly connected components of the directed graph in an efficient manner, constructing distinct natural clusters. Finally, the positive and overshooting gating, benefiting from graph-based granularity alignment, aggregates high-confident masks and filters the false-positive masks for final prediction, reducing the usage of additional hyperparameters and redundant mask generation. Extensive experimental analysis across standard FSS, One-shot Part Segmentation, and Cross Domain FSS datasets validate the effectiveness and efficiency of the proposed approach, surpassing state-of-the-art generalist models with a mIoU of 58. 7% on COCO-20i and 35. 2% on LVIS-92i. The project page of this work is https: //andyzaq. github. io/GF-SAM/.

NeurIPS Conference 2024 Conference Paper

Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models

  • Hanwen Liang
  • Yuyang Yin
  • Dejia Xu
  • hanxue liang
  • Zhangyang Wang
  • Konstantinos N. Plataniotis
  • Yao Zhao
  • Yunchao Wei

The availability of large-scale multimodal datasets and advancements in diffusion models have significantly accelerated progress in 4D content generation. Most prior approaches rely on multiple images or video diffusion models, utilizing score distillation sampling for optimization or generating pseudo novel views for direct supervision. However, these methods are hindered by slow optimization speeds and multi-view inconsistency issues. Spatial and temporal consistency in 4D geometry has been extensively explored respectively in 3D-aware diffusion models and traditional monocular video diffusion models. Building on this foundation, we propose a strategy to migrate the temporal consistency in video diffusion models to the spatial-temporal consistency required for 4D generation. Specifically, we present a novel framework, \textbf{Diffusion4D}, for efficient and scalable 4D content generation. Leveraging a meticulously curated dynamic 3D dataset, we develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets. To control the dynamic strength of these assets, we introduce a 3D-to-4D motion magnitude metric as guidance. Additionally, we propose a novel motion magnitude reconstruction loss and 3D-aware classifier-free guidance to refine the learning and generation of motion dynamics. After obtaining orbital views of the 4D asset, we perform explicit 4D construction with Gaussian splatting in a coarse-to-fine manner. Extensive experiments demonstrate that our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency across various prompt modalities.

AAAI Conference 2024 Conference Paper

Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning

  • Chuangchuang Tan
  • Yao Zhao
  • Shikui Wei
  • Guanghua Gu
  • Ping Liu
  • Yunchao Wei

This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts for each generation model. Consequently, these detectors have exhibited a lack of proficiency in learning the frequency domain and tend to overfit to the artifacts present in the training data, leading to suboptimal performance on unseen sources. To address this issue, we introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors. Our method forces the detector to continuously focus on high-frequency information, exploiting high-frequency representation of features across spatial and channel dimensions. Additionally, we incorporate a straightforward frequency domain learning module to learn source-agnostic features. It involves convolutional layers applied to both the phase spectrum and amplitude spectrum between the Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (iFFT). Extensive experimentation involving 17 GANs demonstrates the effectiveness of our proposed method, showcasing state-of-the-art performance (+9.8\%) while requiring fewer parameters. The code is available at https://github.com/chuangchuangtan/FreqNet-DeepfakeDetection.

NeurIPS Conference 2023 Conference Paper

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

  • Siyu Jiao
  • Yunchao Wei
  • Yaowei Wang
  • Yao Zhao
  • Humphrey Shi

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50. 4\% (+ 8. 2\%) on COCO, 81. 8\% (+ 3. 2\%) on Pascal-VOC, and 8. 7\% (+4. 3\%) on ADE20K in terms of mIoU for unseen classes. Codes will be provided for reproducibility. Code is available at https: //github. com/jiaosiyu1999/MAFT. git.

AAAI Conference 2023 Conference Paper

Progressive Neighborhood Aggregation for Semantic Segmentation Refinement

  • Ting Liu
  • Yunchao Wei
  • Yanning Zhang

Multi-scale features from backbone networks have been widely applied to recover object details in segmentation tasks. Generally, the multi-level features are fused in a certain manner for further pixel-level dense prediction. Whereas, the spatial structure information is not fully explored, that is similar nearby pixels can be used to complement each other. In this paper, we investigate a progressive neighborhood aggregation (PNA) framework to refine the semantic segmentation prediction, resulting in an end-to-end solution that can perform the coarse prediction and refinement in a unified network. Specifically, we first present a neighborhood aggregation module, the neighborhood similarity matrices for each pixel are estimated on multi-scale features, which are further used to progressively aggregate the high-level feature for recovering the spatial structure. In addition, to further integrate the high-resolution details into the aggregated feature, we apply a self-aggregation module on the low-level features to emphasize important semantic information for complementing losing spatial details. Extensive experiments on five segmentation datasets, including Pascal VOC 2012, CityScapes, COCO-Stuff 10k, DeepGlobe, and Trans10k, demonstrate that the proposed framework can be cascaded into existing segmentation models providing consistent improvements. In particular, our method achieves new state-of-the-art performances on two challenging datasets, DeepGlobe and Trans10k. The code is available at https://github.com/liutinglt/PNA.

NeurIPS Conference 2023 Conference Paper

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

  • Mengxue Qu
  • Yu Wu
  • Wu Liu
  • Xiaodan Liang
  • Jingkuan Song
  • Yao Zhao
  • Yunchao Wei

Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO). In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features: 1) intention descriptions in RIO are represented as natural sentences rather than a mere word or verb phrase, making them more practical and meaningful; 2) the intention descriptions are contextually relevant to the scene, enabling a broader range of potential functionalities associated with the objects; 3) the dataset comprises a total of 40, 214 images and 130, 585 intention-object pairs. With the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments.

NeurIPS Conference 2023 Conference Paper

SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process

  • Mengyu Wang
  • Henghui Ding
  • Jun Hao Liew
  • Jiajun Liu
  • Yao Zhao
  • Yunchao Wei

In this paper, we explore a principal way to enhance the quality of object masks produced by different segmentation models. We propose a model-agnostic solution called SegRefiner, which offers a novel perspective on this problem by interpreting segmentation refinement as a data generation process. As a result, the refinement process can be smoothly implemented through a series of denoising diffusion steps. Specifically, SegRefiner takes coarse masks as inputs and refines them using a discrete diffusion process. By predicting the label and corresponding states-transition probabilities for each pixel, SegRefiner progressively refines the noisy masks in a conditional denoising manner. To assess the effectiveness of SegRefiner, we conduct comprehensive experiments on various segmentation tasks, including semantic segmentation, instance segmentation, and dichotomous image segmentation. The results demonstrate the superiority of our SegRefiner from multiple aspects. Firstly, it consistently improves both the segmentation metrics and boundary metrics across different types of coarse masks. Secondly, it outperforms previous model-agnostic refinement methods by a significant margin. Lastly, it exhibits a strong capability to capture extremely fine details when refining high-resolution images. The source code and trained models are available at SegRefiner. git

NeurIPS Conference 2022 Conference Paper

Mask Matching Transformer for Few-Shot Segmentation

  • Siyu Jiao
  • Gengwei Zhang
  • Shant Navasardyan
  • Ling Chen
  • Yao Zhao
  • Yunchao Wei
  • Humphrey Shi

In this paper, we aim to tackle the challenging few-shot segmentation task from a new perspective. Typical methods follow the paradigm to firstly learn prototypical features from support images and then match query features in pixel-level to obtain segmentation results. However, to obtain satisfactory segments, such a paradigm needs to couple the learning of the matching operations with heavy segmentation modules, limiting the flexibility of design and increasing the learning complexity. To alleviate this issue, we propose Mask Matching Transformer (MM-Former), a new paradigm for the few-shot segmentation task. Specifically, MM-Former first uses a class-agnostic segmenter to decompose the query image into multiple segment proposals. Then, a simple matching mechanism is applied to merge the related segment proposals into the final mask guided by the support images. The advantages of our MM-Former are two-fold. First, the MM-Former follows the paradigm of 'decompose first and then blend', allowing our method to benefit from the advanced potential objects segmenter to produce high-quality mask proposals for query images. Second, the mission of prototypical features is relaxed to learn coefficients to fuse correct ones within a proposal pool, making the MM-Former be well generalized to complex scenarios or cases. We conduct extensive experiments on the popular COCO-$20^i$ and Pascal-$5^i$ benchmarks. Competitive results well demonstrate the effectiveness and the generalization ability of our MM-Former. Code is available at https: //github. com/Picsart-AI-Research/Mask-Matching-Transformer.

NeurIPS Conference 2022 Conference Paper

Mining Unseen Classes via Regional Objectness: A Simple Baseline for Incremental Segmentation

  • Zekang Zhang
  • Guangyu Gao
  • Zhiyuan Fang
  • Jianbo Jiao
  • Yunchao Wei

Incremental or continual learning has been extensively studied for image classification tasks to alleviate catastrophic forgetting, a phenomenon in which earlier learned knowledge is forgotten when learning new concepts. For class incremental semantic segmentation, such a phenomenon often becomes much worse due to the semantic shift of the background class, \ie, some concepts learned at previous stages are assigned to the background class at the current training stage, therefore, significantly reducing the performance of these old concepts. To address this issue, we propose a simple yet effective method in this paper, named Mining unseen Classes via Regional Objectness (MicroSeg). Our MicroSeg is based on the assumption that \emph{background regions with strong objectness possibly belong to those concepts in the historical or future stages}. Therefore, to avoid forgetting old knowledge at the current training stage, our MicroSeg first splits the given image into hundreds of segment proposals with a proposal generator. Those segment proposals with strong objectness from the background are then clustered and assigned new defined labels during the optimization. In this way, the distribution characterizes of old concepts in the feature space could be better perceived, relieving the catastrophic forgetting caused by the semantic shift of the background class accordingly. We conduct extensive experiments on Pascal VOC and ADE20K, and competitive results well demonstrate the effectiveness of our MicroSeg. Code is available at \href{https: //github. com/zkzhang98/MicroSeg}{\textcolor{orange}{\texttt{https: //github. com/zkzhang98/MicroSeg}}}.

NeurIPS Conference 2021 Conference Paper

Associating Objects with Transformers for Video Object Segmentation

  • Zongxin Yang
  • Yunchao Wei
  • Yi Yang

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i. e. , YouTube-VOS (84. 1% J&F), DAVIS 2017 (84. 9%), and DAVIS 2016 (91. 1%), while keeping more than 3X faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.

NeurIPS Conference 2021 Conference Paper

Few-Shot Segmentation via Cycle-Consistent Transformer

  • Gengwei Zhang
  • Guoliang Kang
  • Yi Yang
  • Yunchao Wei

Few-shot segmentation aims to train a segmentation model that can fast adapt to novel classes with few exemplars. The conventional training paradigm is to learn to make predictions on query images conditioned on the features from support images. Previous methods only utilized the semantic-level prototypes of support images as the conditional information. These methods cannot utilize all pixel-wise support information for the query predictions, which is however critical for the segmentation task. In this paper, we focus on utilizing pixel-wise relationships between support and target images to facilitate the few-shot semantic segmentation task. We design a novel Cycle-Consistent Transformer (CyCTR) module to aggregate pixel-wise support features into query ones. CyCTR performs cross-attention between features from different images, i. e. support and query images. We observe that there may exist unexpected irrelevant pixel-level support features. Directly performing cross-attention may aggregate these features from support to query and bias the query features. Thus, we propose using a novel cycle-consistent attention mechanism to filter out possible harmful support features and encourage query features to attend to the most informative pixels from support images. Experiments on all few-shot segmentation benchmarks demonstrate that our proposed CyCTR leads to remarkable improvement compared to previous state-of-the-art methods. Specifically, on Pascal-5^i and COCO-20^i datasets, we achieve 66. 6% and 45. 6% mIoU for 5-shot segmentation, outperforming previous state-of-the-art by 4. 6% and 7. 1% respectively.

NeurIPS Conference 2020 Conference Paper

Consistent Structural Relation Learning for Zero-Shot Segmentation

  • Peike Li
  • Yunchao Wei
  • Yi Yang

Zero-shot semantic segmentation aims to recognize the semantics of pixels from unseen categories with zero training samples. Previous practice [1] proposed to train the classifiers for unseen categories using the visual features generated from semantic word embeddings. However, the generator is merely learned on the seen categories while no constraint is applied to the unseen categories, leading to poor generalization ability. In this work, we propose a Consistent Structural Relation Learning (CSRL) approach to constrain the generating of unseen visual features by exploiting the structural relations between seen and unseen categories. We observe that different categories are usually with similar relations in either semantic word embedding space or visual feature space. This observation motivates us to harness the similarity of category-level relations on the semantic word embedding space to learn a better visual feature generator. Concretely, by exploring the pair-wise and list-wise structures, we impose the relations of generated visual features to be consistent with their counterparts in the semantic word embedding space. In this way, the relations between seen and unseen categories will be transferred to implicitly constrain the generator to produce relation-consistent unseen visual features. We conduct extensive experiments on Pascal-VOC and Pascal-Context benchmarks. The proposed CSRL significantly outperforms existing state-of-the-art methods by a large margin, resulting in ~7-12% on Pascal-VOC and ~2-5% on Pascal-Context.

NeurIPS Conference 2020 Conference Paper

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

  • Guoliang Kang
  • Yunchao Wei
  • Yi Yang
  • Yueting Zhuang
  • Alexander Hauptmann

Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. They tend to consider the domain discrepancy globally, which ignore the pixel-wise relationships and are less discriminative. In this paper, we propose to build the pixel-level cycle association between source and target pixel pairs and contrastively strengthen their connections to diminish the domain gap and make the features more discriminative. To the best of our knowledge, this is a new perspective for tackling such a challenging task. Experiment results on two representative domain adaptation benchmarks, i. e. GTAV $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes, verify the effectiveness of our proposed method and demonstrate that our method performs favorably against previous state-of-the-arts. Our method can be trained end-to-end in one stage and introduce no additional parameters, which is expected to serve as a general framework and help ease future research in domain adaptive semantic segmentation. Code is available at https: //github. com/kgl-prml/Pixel-Level-Cycle-Association.

AAAI Conference 2020 Conference Paper

Reliability Does Matter: An End-to-End Weakly Supervised Semantic Segmentation Approach

  • Bingfeng Zhang
  • Jimin Xiao
  • Yunchao Wei
  • Mingjie Sun
  • Kaizhu Huang

Weakly supervised semantic segmentation is a challenging task as it only takes image-level information as supervision for training but produces pixel-level predictions for testing. To address such a challenging task, most recent state-of-the-art approaches propose to adopt two-step solutions, i. e. 1) learn to generate pseudo pixel-level masks, and 2) engage FCNs to train the semantic segmentation networks with the pseudo masks. However, the two-step solutions usually employ many bells and whistles in producing high-quality pseudo masks, making this kind of methods complicated and inelegant. In this work, we harness the image-level labels to produce reliable pixel-level annotations and design a fully end-to-end network to learn to predict segmentation maps. Concretely, we firstly leverage an image classification branch to generate class activation maps for the annotated categories, which are further pruned into confident yet tiny object/background regions. Such reliable regions are then directly served as ground-truth labels for the parallel segmentation branch, where a newly designed dense energy loss function is adopted for optimization. Despite its apparent simplicity, our one-step solution achieves competitive mIoU scores (val: 62. 6, test: 62. 9) on Pascal VOC compared with those two-step state-of-the-arts. By extending our one-step method to two-step, we get a new state-of-the-art performance on the Pascal VOC (val: 66. 3, test: 66. 5).

IJCAI Conference 2019 Conference Paper

Boundary Perception Guidance: A Scribble-Supervised Semantic Segmentation Approach

  • Bin Wang
  • Guojun Qi
  • Sheng Tang
  • Tianzhu Zhang
  • Yunchao Wei
  • Linghui Li
  • Yongdong Zhang

Semantic segmentation suffers from the fact that densely annotated masks are expensive to obtain. To tackle this problem, we aim at learning to segment by only leveraging scribbles that are much easier to collect for supervision. To fully explore the limited pixel-level annotations from scribbles, we present a novel Boundary Perception Guidance (BPG) approach, which consists of two basic components, i. e. , prediction refinement and boundary regression. Specifically, the prediction refinement progressively makes a better segmentation by adopting an iterative upsampling and a semantic feature enhancement strategy. In the boundary regression, we employ class-agnostic edge maps for supervision to effectively guide the segmentation network in localizing the boundaries between different semantic regions, leading to producing finer-grained representation of feature maps for semantic segmentation. The experiment results on the PASCAL VOC 2012 demonstrate the proposed BPG achieves mIoU of 73. 2% without fully connected Conditional Random Field (CRF) and 76. 0% with CRF, setting up the new state-of-the-art in literature.

AAAI Conference 2019 Conference Paper

Devil in the Details: Towards Accurate Single and Multiple Human Parsing

  • Tao Ruan
  • Ting Liu
  • Zilong Huang
  • Yunchao Wei
  • Shikui Wei
  • Yao Zhao

Human parsing has received considerable interest due to its wide application potentials. Nevertheless, it is still unclear how to develop an accurate human parsing system in an efficient and elegant way. In this paper, we identify several useful properties, including feature resolution, global context information and edge details, and perform rigorous analyses to reveal how to leverage them to benefit the human parsing task. The advantages of these useful properties finally result in a simple yet effective Context Embedding with Edge Perceiving (CE2P) framework for single human parsing. Our CE2P is end-to-end trainable and can be easily adopted for conducting multiple human parsing. Benefiting the superiority of CE2P, we won the 1st places on all three human parsing tracks in the 2nd Look into Person (LIP) Challenge. Without any bells and whistles, we achieved 56. 50% (mIoU), 45. 31% (mean APr ) and 33. 34% (APp 0. 5) in Track 1, Track 2 and Track 5, which outperform the state-of-the-arts more than 2. 06%, 3. 81% and 1. 87%, respectively. We hope our CE2P will serve as a solid baseline and help ease future research in single/multiple human parsing. Code has been made available at https: //github. com/liutinglt/CE2P.

AAAI Conference 2019 Conference Paper

Horizontal Pyramid Matching for Person Re-Identification

  • Yang Fu
  • Yunchao Wei
  • Yuqian Zhou
  • Honghui Shi
  • Gao Huang
  • Xinchao Wang
  • Zhiqiang Yao
  • Thomas Huang

Despite the remarkable progress in person re-identification (Re-ID), such approaches still suffer from the failure cases where the discriminative body parts are missing. To mitigate this type of failure, we propose a simple yet effective Horizontal Pyramid Matching (HPM) approach to fully exploit various partial information of a given person, so that correct person candidates can be identified even if some key parts are missing. With HPM, we make the following contributions to produce more robust feature representations for the Re-ID task: 1) we learn to classify using partial feature representations at different horizontal pyramid scales, which successfully enhance the discriminative capabilities of various person parts; 2) we exploit average and max pooling strategies to account for person-specific discriminative information in a global-local manner. To validate the effectiveness of our proposed HPM method, extensive experiments are conducted on three popular datasets including Market-1501, DukeMTMC- ReID and CUHK03. Respectively, we achieve mAP scores of 83. 1%, 74. 5% and 59. 7% on these challenging benchmarks, which are the new state-of-the-arts.

AAAI Conference 2019 Conference Paper

STA: Spatial-Temporal Attention for Large-Scale Video-Based Person Re-Identification

  • Yang Fu
  • Xiaoyang Wang
  • Yunchao Wei
  • Thomas Huang

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person reidentification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e. g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i. e. MARS and DukeMTMC- VideoReID. In particular, the mAP reaches 87. 7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11. 6%.

AAAI Conference 2019 Conference Paper

Weakly Supervised Scene Parsing with Point-Based Distance Metric Learning

  • Rui Qian
  • Yunchao Wei
  • Honghui Shi
  • Jiachen Li
  • Jiaying Liu
  • Thomas Huang

Semantic scene parsing is suffering from the fact that pixellevel annotations are hard to be collected. To tackle this issue, we propose a Point-based Distance Metric Learning (PDML) in this paper. PDML does not require dense annotated masks and only leverages several labeled points that are much easier to obtain to guide the training process. Concretely, we leverage semantic relationship among the annotated points by encouraging the feature representations of the intra- and intercategory points to keep consistent, i. e. points within the same category should have more similar feature representations compared to those from different categories. We formulate such a characteristic into a simple distance metric loss, which collaborates with the point-wise cross-entropy loss to optimize the deep neural networks. Furthermore, to fully exploit the limited annotations, distance metric learning is conducted across different training images instead of simply adopting an image-dependent manner. We conduct extensive experiments on two challenging scene parsing benchmarks of PASCAL- Context and ADE 20K to validate the effectiveness of our PDML, and competitive mIoU scores are achieved.

NeurIPS Conference 2018 Conference Paper

Self-Erasing Network for Integral Object Attention

  • Qibin Hou
  • PengTao Jiang
  • Yunchao Wei
  • Ming-Ming Cheng

Recently, adversarial erasing for weakly-supervised object attention has been deeply studied due to its capability in localizing integral object regions. However, such a strategy raises one key problem that attention regions will gradually expand to non-object regions as training iterations continue, which significantly decreases the quality of the produced attention maps. To tackle such an issue as well as promote the quality of object attention, we introduce a simple yet effective Self-Erasing Network (SeeNet) to prohibit attentions from spreading to unexpected background regions. In particular, SeeNet leverages two self-erasing strategies to encourage networks to use reliable object and background cues for learning to attention. In this way, integral object regions can be effectively highlighted without including much more background regions. To test the quality of the generated attention maps, we employ the mined object regions as heuristic cues for learning semantic segmentation models. Experiments on Pascal VOC well demonstrate the superiority of our SeeNet over other state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Transferable Semi-Supervised Semantic Segmentation

  • Huaxin Xiao
  • Yunchao Wei
  • Yu Liu
  • Maojun Zhang
  • Jiashi Feng

The performance of deep learning based semantic segmentation models heavily depends on sufficient data with careful annotations. However, even the largest public datasets only provide samples with pixel-level annotations for rather limited semantic categories. Such data scarcity critically limits scalability and applicability of semantic segmentation models in real applications. In this paper, we propose a novel transferable semi-supervised semantic segmentation model that can transfer the learned segmentation knowledge from a few strong categories with pixel-level annotations to unseen weak categories with only image-level annotations, significantly broadening the applicable territory of deep segmentation models. In particular, the proposed model consists of two complementary and learnable components: a Label transfer Network (L-Net) and a Prediction transfer Network (P- Net). The L-Net learns to transfer the segmentation knowledge from strong categories to the images in the weak categories and produces coarse pixel-level semantic maps, by effectively exploiting the similar appearance shared across categories. Meanwhile, the P-Net tailors the transferred knowledge through a carefully designed adversarial learning strategy and produces refined segmentation results with better details. Integrating the L-Net and P-Net achieves 96. 5% and 89. 4% performance of the fully-supervised baseline using 50% and 0% categories with pixel-level annotations respectively on PASCAL VOC 2012. With such a novel transfer mechanism, our proposed model is easily generalizable to a variety of new categories, only requiring image-level annotations, and offers appealing scalability in real applications.

AAAI Conference 2016 Conference Paper

Deep Learning with S-Shaped Rectified Linear Activation Units

  • Xiaojie Jin
  • Chunyan Xu
  • Jiashi Feng
  • Yunchao Wei
  • Junjun Xiong
  • Shuicheng Yan

Rectified linear activation units are important components for state-of-the-art deep convolutional networks. In this paper, we propose a novel S-shaped rectified linear activation unit (SReLU) to learn both convex and non-convex functions, imitating the multiple function forms given by the two fundamental laws, namely the Webner-Fechner law and the Stevens law, in psychophysics and neural sciences. Specifically, SReLU consists of three piecewise linear functions, which are formulated by four learnable parameters. The SReLU is learned jointly with the training of the whole deep network through back propagation. During the training phase, to initialize SReLU in different layers, we propose a “freezing” method to degenerate SReLU into a predefined leaky rectified linear unit in the initial several training epochs and then adaptively learn the good initial values. SReLU can be universally used in the existing deep networks with negligible additional parameters and computation cost. Experiments with two popular CNN architectures, Network in Network and GoogLeNet on scale-various benchmarks including CI- FAR10, CIFAR100, MNIST and ImageNet demonstrate that SReLU achieves remarkable improvement compared to other activation functions.

TIST Journal 2016 Journal Article

Modality-Dependent Cross-Media Retrieval

  • Yunchao Wei
  • Yao Zhao
  • Zhenfeng Zhu
  • Shikui Wei
  • Yanhui Xiao
  • Jiashi Feng
  • Shuicheng Yan

In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.

AAAI Conference 2014 Conference Paper

Proximal Iteratively Reweighted Algorithm with Multiple Splitting for Nonconvex Sparsity Optimization

  • Canyi Lu
  • Yunchao Wei
  • Zhouchen Lin
  • Shuicheng Yan

This paper proposes the Proximal Iteratively REweighted (PIRE) algorithm for solving a general problem, which involves a large body of nonconvex sparse and structured sparse related problems. Comparing with previous iterative solvers for nonconvex sparse problem, PIRE is much more general and efficient. The computational cost of PIRE in each iteration is usually as low as the state-of-the-art convex solvers. We further propose the PIRE algorithm with Parallel Splitting (PIRE-PS) and PIRE algorithm with Alternative Updating (PIRE-AU) to handle the multi-variable problems. In theory, we prove that our proposed methods converge and any limit solution is a stationary point. Extensive experiments on both synthesis and real data sets demonstrate that our methods achieve comparative learning performance, but are much more efficient, by comparing with previous nonconvex solvers.