Arrow Research search

Author name cluster

Yanwei Fu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
1 author row

Possible papers

33

AAAI Conference 2026 Conference Paper

One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow

  • Zeyuan Wang
  • Da Li
  • Yulin Chen
  • Ye Shi
  • Liang Bai
  • Tianyuan Yu
  • Yanwei Fu

We introduce a one-step generative policy for offline reinforcement learning that maps *noise* directly to *actions* via a *residual reformulation* of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to reformulate MeanFlow to enable *direct noise-to-action generation* by integrating the velocity field and noise-to-action transformation into a single policy network—eliminating the need for separate velocity estimation. We explore several reformulation variants and identify an effective *residual formulation* that supports expressive and stable policy learning. Our method offers three key advantages: 1) efficient one-step noise-to-action generation, 2) expressive modelling of multimodal action distributions, and 3) efficient and stable policy learning via Q-learning in a single-stage training setup. Extensive experiments on 73 tasks across the OGBench and D4RL benchmarks demonstrate that our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.

AAAI Conference 2026 Conference Paper

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

  • Mingqi Wu
  • Zhihao Zhang
  • Qiaole Dong
  • Zhiheng Xi
  • Jun Zhao
  • Senjie Jin
  • Xiaoran Fan
  • Yuhao Zhou

Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model’s performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

AAAI Conference 2026 Conference Paper

SwiftVideo: A Unified Framework for Few-Step Video Generation Through Trajectory-Distribution Alignment

  • Yanxiao Sun
  • Jiafu Wu
  • Yun Cao
  • Chengming Xu
  • Yabiao Wang
  • Weijian Cao
  • Donghao Luo
  • Chengjie Wang

Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts in few-step settings. To address these limitations, we propose SwiftVideo, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, We propose a dual-perspective alignment encompassing distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.

AAAI Conference 2025 Conference Paper

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

  • Sixiao Zheng
  • Yanwei Fu

Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.

IJCAI Conference 2025 Conference Paper

CrossVTON: Mimicking the Logic Reasoning on Cross-Category Virtual Try-On Guided by Tri-Zone Priors

  • Donghao Luo
  • Yujie Liang
  • Xu Peng
  • Xiaobin Hu
  • Boyuan Jiang
  • Chengming Xu
  • Taisong Jin
  • Chengjie Wang

Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.

NeurIPS Conference 2025 Conference Paper

Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

  • Yu Li
  • Xingyu Qiu
  • Yuqian Fu
  • Jie Chen
  • Tianwen Qian
  • Xu Zheng
  • Danda Pani Paudel
  • Yanwei Fu

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance. The source code and instructions are available at https: //github. com/LiYu0524/Domain-RAG.

NeurIPS Conference 2025 Conference Paper

PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

  • WANG Yun
  • Junjie Hu
  • Qiaole Dong
  • Yongjian Zhang
  • Yanwei Fu
  • Tin Lun Lam
  • Dapeng Wu

Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a Pick-and-Play Memory (PPM) construction module for dynamic Stereo matching, dubbed as PPMStereo. PPM consists of a pick process that identifies the most relevant frames and a play process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. Codes are available at \textcolor{blue}{https: //github. com/cocowy1/PPMStereo}.

NeurIPS Conference 2025 Conference Paper

Towards Reliable and Holistic Visual In-Context Learning Prompt Selection

  • Wenxiao Wu
  • Jing-Hao Xue
  • Chengming Xu
  • Chen Liu
  • Xinwei Sun
  • Changxin Gao
  • Nong Sang
  • Yanwei Fu

Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.

NeurIPS Conference 2025 Conference Paper

TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making

  • Shanshan Li
  • Da Huang
  • Yu He
  • Yanwei Fu
  • Yu-Gang Jiang
  • Xiangyang Xue

In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.

AAAI Conference 2024 Conference Paper

HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations

  • Yilan Dong
  • Chunlin Yu
  • Ruiyang Ha
  • Ye Shi
  • Yuexin Ma
  • Lan Xu
  • Yanwei Fu
  • Jingya Wang

Existing gait recognition benchmarks mostly include minor clothing variations in the laboratory environments, but lack persistent changes in appearance over time and space. In this paper, we propose the first in-the-wild benchmark CCGait for cloth-changing gait recognition, which incorporates diverse clothing changes, indoor and outdoor scenes, and multi-modal statistics over 92 days. To further address the coupling effect of clothing and viewpoint variations, we propose a hybrid approach HybridGait that exploits both temporal dynamics and the projected 2D information of 3D human meshes. Specifically, we introduce a Canonical Alignment Spatial-Temporal Transformer (CA-STT) module to encode human joint position-aware features, and fully exploit 3D dense priors via a Silhouette-guided Deformation with 3D-2D Appearance Projection (SilD) strategy. Our contributions are twofold: we provide a challenging benchmark CCGait that captures realistic appearance changes over expanded time and space, and we propose a hybrid framework HybridGait that outperforms prior works on CCGait and Gait3D benchmarks. Our project page is available at https://github.com/HCVLab/HybridGait.

TMLR Journal 2024 Journal Article

LEA: Learning Latent Embedding Alignment Model for fMRI Decoding and Encoding

  • Xuelin Qian
  • Yikai Wang
  • Xinwei Sun
  • Yanwei Fu
  • Xiangyang Xue
  • Jianfeng Feng

The connection between brain activity and visual stimuli is crucial to understanding the human brain. Although deep generative models have shown advances in recovering brain recordings by generating images conditioned on fMRI signals, it is still challenging to generate consistent semantics. Moreover, predicting fMRI signals from visual stimuli remains a hard problem. In this paper, we introduce a unified framework that addresses both fMRI decoding and encoding. We train two latent spaces to represent and reconstruct fMRI signals and visual images, respectively. By aligning these two latent spaces, we seamlessly transform between the fMRI signal and visual stimuli. Our model, called Latent Embedding Alignment (LEA), can recover visual stimuli from fMRI signals and predict brain activity from images. LEA outperforms existing methods on multiple fMRI decoding and encoding benchmarks. It offers a comprehensive solution for modeling the relationship between fMRI signals and visual stimuli. The codes are available at \url{https://github.com/naiq/LEA}.

NeurIPS Conference 2024 Conference Paper

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

  • Chenjie Cao
  • Chaohui Yu
  • Fan Wang
  • Xiangyang Xue
  • Yanwei Fu

Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference key&value attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement. The project page is https: //ewrfcas. github. io/MVInpainter/.

TMLR Journal 2024 Journal Article

Repositioning the Subject within Image

  • Yikai Wang
  • Chenjie Cao
  • Ke Fan
  • Qiaole Dong
  • Yifan Li
  • Xiangyang Xue
  • Yanwei Fu

Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image's fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE's effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Results of SEELE on ReS demonstrate its efficacy. Code and ReS dataset are available at https://yikai-wang.github.io/seele/.

NeurIPS Conference 2024 Conference Paper

Towards Global Optimal Visual In-Context Learning Prompt Selection

  • Chengming Xu
  • Chen Liu
  • Yikai Wang
  • Yuan Yao
  • Yanwei Fu

Visual In-Context Learning (VICL) is a prevailing way to transfer visual foundation models to new tasks by leveraging contextual information contained in in-context examples to enhance learning and prediction of query sample. The fundamental problem in VICL is how to select the best prompt to activate its power as much as possible, which is equivalent to the ranking problem to test the in-context behavior of each candidate in the alternative set and select the best one. To utilize more appropriate ranking metric and leverage more comprehensive information among the alternative set, we propose a novel in-context example selection framework to approximately identify the global optimal prompt, i. e. choosing the best performing in-context examples from all alternatives for each query sample. Our method, dubbed Partial2Global, adopts a transformer-based list-wise ranker to provide a more comprehensive comparison within several alternatives, and a consistency-aware ranking aggregator to generate globally consistent ranking. The effectiveness of Partial2Global is validated through experiments on foreground segmentation, single object detection and image colorization, demonstrating that Partial2Global selects consistently better in-context examples compared with other methods, and thus establish the new state-of-the-arts.

NeurIPS Conference 2024 Conference Paper

Unified Lexical Representation for Interpretable Visual-Language Alignment

  • Yifan Li
  • Yikai Wang
  • Yanwei Fu
  • Dongyu Ru
  • Zheng Zhang
  • Tong He

Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e. g. , YFCC15M) and those trained from scratch on even bigger datasets (e. g. , 1. 1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA. Codes are available at https: //github. com/Clementine24/LexVLA.

AAAI Conference 2023 Conference Paper

RankDNN: Learning to Rank for Few-Shot Learning

  • Qianyu Guo
  • Gong Haotong
  • Xujun Wei
  • Yanwei Fu
  • Yizhou Yu
  • Wenqiang Zhang
  • Weifeng Ge

This paper introduces a new few-shot learning pipeline that casts relevance ranking for image retrieval as binary ranking relation classification. In comparison to image classification, ranking relation classification is sample efficient and domain agnostic. Besides, it provides a new perspective on few-shot learning and is complementary to state-of-the-art methods. The core component of our deep neural network is a simple MLP, which takes as input an image triplet encoded as the difference between two vector-Kronecker products, and outputs a binary relevance ranking order. The proposed RankMLP can be built on top of any state-of-the-art feature extractors, and our entire deep neural network is called the ranking deep neural network, or RankDNN. Meanwhile, RankDNN can be flexibly fused with other post-processing methods. During the meta test, RankDNN ranks support images according to their similarity with the query samples, and each query sample is assigned the class label of its nearest neighbor. Experiments demonstrate that RankDNN can effectively improve the performance of its baselines based on a variety of backbones and it outperforms previous state-of-the-art algorithms on multiple few-shot learning benchmarks, including miniImageNet, tieredImageNet, Caltech-UCSD Birds, and CIFAR-FS. Furthermore, experiments on the cross-domain challenge demonstrate the superior transferability of RankDNN.The code is available at: https://github.com/guoqianyu-alberta/RankDNN.

TIST Journal 2023 Journal Article

Recent Few-shot Object Detection Algorithms: A Survey with Performance Comparison

  • Tianying Liu
  • Lu Zhang
  • Yang Wang
  • Jihong Guan
  • Yanwei Fu
  • Jiajia Zhao
  • Shuigeng Zhou

The generic object detection (GOD) task has been successfully tackled by recent deep neural networks, trained by an avalanche of annotated training samples from some common classes. However, it is still non-trivial to generalize these object detectors to the novel long-tailed object classes, which have only few labeled training samples. To this end, the Few-Shot Object Detection (FSOD) has been topical recently, as it mimics the humans’ ability of learning to learn and intelligently transfers the learned generic object knowledge from the common heavy-tailed to the novel long-tailed object classes. Especially, the research in this emerging field has been flourishing in recent years with various benchmarks, backbones, and methodologies proposed. To review these FSOD works, there are several insightful FSOD survey articles [ 58, 59, 74, 78 ] that systematically study and compare them as the groups of fine-tuning/transfer learning and meta-learning methods. In contrast, we review the existing FSOD algorithms from a new perspective under a new taxonomy based on their contributions, i.e., data-oriented, model-oriented, and algorithm-oriented. Thus, a comprehensive survey with performance comparison is conducted on recent achievements of FSOD. Furthermore, we also analyze the technical challenges, the merits and demerits of these methods, and envision the future directions of FSOD. Specifically, we give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols. The taxonomy is then proposed that groups FSOD methods into three types. Following this taxonomy, we provide a systematic review of the advances in FSOD. Finally, further discussions on performance, challenges, and future directions are presented.

TMLR Journal 2023 Journal Article

Worst-case Feature Risk Minimization for Data-Efficient Learning

  • Jingshi Lei
  • Da Li
  • Chengming Xu
  • Liming Fang
  • Timothy Hospedales
  • Yanwei Fu

Deep learning models typically require massive amounts of annotated data to train a strong model for a task of interest. However, data annotation is time-consuming and costly. How to use labeled data from a related but distinct domain, or just a few samples to train a satisfactory model are thus important questions. To achieve this goal, models should resist overfitting to the specifics of the training data in order to generalize well to new data. This paper proposes a novel Worst-case Feature Risk Minimization (WFRM) method that helps improve model generalization. Specifically, we tackle a minimax optimization problem in feature space at each training iteration. Given the input features, we seek the feature perturbation that maximizes the current training loss and then minimizes the training loss of the worst-case features. By incorporating our WFRM during training, we significantly improve model generalization under distributional shift – Domain Generalization (DG) and in the low-data regime – Few-shot Learning (FSL). We theoretically analyze WFRM and find the key reason why it works better than ERM – it induces an empirical risk-based semi-adaptive $L_{2}$ regularization of the classifier weights, enabling a better risk-complexity trade-off. We evaluate WFRM on two data-efficient learning tasks, including three standard DG benchmarks of PACS, VLCS, OfficeHome and the most challenging FSL benchmark Meta-Dataset. Despite the simplicity, our method consistently improves various DG and FSL methods, leading to the new state-of-the-art performances in all settings. Codes & models will be released at https://github.com/jslei/WFRM.

TMLR Journal 2022 Journal Article

Exploring Efficient Few-shot Adaptation for Vision Transformers

  • Chengming Xu
  • Siqian Yang
  • Yabiao Wang
  • Zhanxiong Wang
  • Yanwei Fu
  • Xiangyang Xue

The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing only few labeled examples, with the help of knowledge learned from base categories containing abundant labeled training samples. While there are numerous works into FSL task, Vision Transformers (ViTs) have rarely been taken as the backbone to FSL with few trials focusing on naive finetuning of whole backbone or classification layer. Essentially, despite ViTs have been shown to enjoy comparable or even better performance on other vision tasks, it is still very nontrivial to efficiently finetune the ViTs in real-world FSL scenarios. To this end, we propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) for the task and backbone finetuning, individually. Specifically, in APT, the prefix is projected to new key and value pairs that are attached to each self-attention layer to provide the model with task-specific information. Moreover, we design the DRA in the form of learnable offset vectors to handle the potential domain gaps between base and novel data. To ensure the APT would not deviate from the initial task-specific information much, we further propose a novel prototypical regularization, which minimizes the similarity between the projected distribution of prefix and initial prototypes, regularizing the update procedure. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to show the efficacy of our model. Our model and codes will be released.

TMLR Journal 2022 Journal Article

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

  • Chenjie Cao
  • Xinlin Ren
  • Yanwei Fu

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPNs) suffer from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. The finetuned MVSFormer with hierarchical ViTs of efficient attention mechanisms can achieve prominent improvement based on FPNs. Besides, the alternative MVSFormer with frozen ViT weights is further proposed. This largely alleviates the training cost with competitive performance strengthened by the attention map from the self-distillation pre-training. MVSFormer can be generalized to various input resolutions with efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, MVSFormer ranks as Top-1 on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard. Codes and models are released in https://github.com/ewrfcas/MVSFormer.

NeurIPS Conference 2022 Conference Paper

Self-supervised Amodal Video Object Segmentation

  • Jian Yao
  • Yuxin Hong
  • Chiyu Wang
  • Tianjun Xiao
  • Tong He
  • Francesco Locatello
  • David P Wipf
  • Yanwei Fu

Amodal perception requires inferring the full shape of an object that is partially occluded. This task is particularly challenging on two levels: (1) it requires more information than what is contained in the instant retina or imaging sensor, (2) it is difficult to obtain enough well-annotated amodal labels for supervision. To this end, this paper develops a new framework of Self-supervised amodal Video object segmentation (SaVos). Our method efficiently leverages the visual information of video temporal sequences to infer the amodal mask of objects. The key intuition is that the occluded part of an object can be explained away if that part is visible in other frames, possibly deformed as long as the deformation can be reasonably learned. Accordingly, we derive a novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos. In addition to learning type prior to complete masks for known types, SaVos also learns the spatiotemporal prior, which is also useful for the amodal task and could generalize to unseen types. The proposed framework achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends itself well to being transferred to novel distributions using test-time adaptation, outperforming existing models even after the transfer to a new distribution.

AAAI Conference 2021 Conference Paper

Learning a Few-shot Embedding Model with Contrastive Learning

  • Chen Liu
  • Yanwei Fu
  • Chengming Xu
  • Siqian Yang
  • Jilin Li
  • Chengjie Wang
  • Li Zhang

Few-shot learning (FSL) aims to recognize target classes by adapting the prior knowledge learned from source classes. Such knowledge usually resides in a deep embedding model for a general matching purpose of the support and query image pairs. The objective of this paper is to repurpose the contrastive learning for such matching to learn a few-shot embedding model. We make the following contributions: (i) We investigate the contrastive learning with Noise Contrastive Estimation (NCE) in a supervised manner for training a fewshot embedding model; (ii) We propose a novel contrastive training scheme dubbed infoPatch, exploiting the patch-wise relationship to substantially improve the popular infoNCE; (iii) We show that the embedding learned by the proposed infoPatch is more effective; (iv) Our model is thoroughly evaluated on few-shot recognition task; and demonstrates state-ofthe-art results on miniImageNet and appealing performance on tieredImageNet, Fewshot-CIFAR100 (FC-100).

IJCAI Conference 2021 Conference Paper

Regularising Knowledge Transfer by Meta Functional Learning

  • Pan Li
  • Yanwei Fu
  • Shaogang Gong

Machine learning classifiers’ capability is largely dependent on the scale of available training data and limited by the model overfitting in data-scarce learning tasks. To address this problem, this work proposes a novel Meta Functional Learning (MFL) by meta-learning a generalisable functional model from data-rich tasks whilst simultaneously regularising knowledge transfer to data-scarce tasks. The MFL computes meta-knowledge on functional regularisation generalisable to different learning tasks by which functional training on limited labelled data promotes more discriminative functions to be learned. Moreover, we adopt an Iterative Update strategy on MFL (MFL-IU). This improves knowledge transfer regularisation from MFL by progressively learning the functional regularisation in knowledge transfer. Experiments on three Few-Shot Learning (FSL) benchmarks (miniImageNet, CIFAR-FS and CUB) show that meta functional learning for regularisation knowledge transfer can benefit improving FSL classifiers.

NeurIPS Conference 2021 Conference Paper

The Image Local Autoregressive Transformer

  • Chenjie Cao
  • Yuxin Hong
  • Xiang Li
  • Chengrong Wang
  • Chengming Xu
  • Yanwei Fu
  • Xiangyang Xue

Recently, AutoRegressive (AR) models for the whole image generation empowered by transformers have achieved comparable or even better performance compared to Generative Adversarial Networks (GANs). Unfortunately, directly applying such AR models to edit/change local image regions, may suffer from the problems of missing global information, slow inference speed, and information leakage of local guidance. To address these limitations, we propose a novel model -- image Local Autoregressive Transformer (iLAT), to better facilitate the locally guided image synthesis. Our iLAT learns the novel local discrete representations, by the newly proposed local autoregressive (LA) transformer of the attention mask and convolution mechanism. Thus iLAT can efficiently synthesize the local image regions by key guidance information. Our iLAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both quantitative and qualitative results show the efficacy of our model.

AAAI Conference 2020 Conference Paper

Feature Deformation Meta-Networks in Image Captioning of Novel Objects

  • Tingjia Cao
  • Ke Han
  • Xiaomei Wang
  • Lin Ma
  • Yanwei Fu
  • Yu-Gang Jiang
  • Xiangyang Xue

This paper studies the task of image captioning with novel objects, which only exist in testing images. Intrinsically, this task can reflect the generalization ability of models in understanding and captioning the semantic meanings of visual concepts and objects unseen in training set, sharing the similarity to one/zero-shot learning. The critical difficulty thus comes from that no paired images and sentences of the novel objects can be used to help train the captioning model. Inspired by recent work (Chen et al. 2019b) that boosts one-shot learning by learning to generate various image deformations, we propose learning meta-networks for deforming features for novel object captioning. To this end, we introduce the feature deformation meta-networks (FDM-net), which is trained on source data, and learn to adapt to the novel object features detected by the auxiliary detection model. FDM-net includes two sub-nets: feature deformation, and scene graph sentence reconstruction, which produce the augmented image features and corresponding sentences, respectively. Thus, rather than directly deforming images, FDM-net can efficiently and dynamically enlarge the paired images and texts by learning to deform image features. Extensive experiments are conducted on the widely used novel object captioning dataset, and the results show the effectiveness of our FDM-net. Ablation study and qualitative visualization further give insights of our model.

JBHI Journal 2020 Journal Article

M$^3$Lung-Sys: A Deep Learning System for Multi-Class Lung Pneumonia Screening From CT Imaging

  • Xuelin Qian
  • Huazhu Fu
  • Weiya Shi
  • Tao Chen
  • Yanwei Fu
  • Fei Shan
  • Xiangyang Xue

To counter the outbreak of COVID-19, the accurate diagnosis of suspected cases plays a crucial role in timely quarantine, medical treatment, and preventing the spread of the pandemic. Considering the limited training cases and resources ( e. g, time and budget), we propose a Multi-task Multi-slice Deep Learning System (M $^3$ Lung-Sys) for multi-class lung pneumonia screening from CT imaging, which only consists of two 2D CNN networks, i. e. , slice- and patient-level classification networks. The former aims to seek the feature representations from abundant CT slices instead of limited CT volumes, and for the overall pneumonia screening, the latter one could recover the temporal information by feature refinement and aggregation between different slices. In addition to distinguish COVID-19 from Healthy, H1N1, and CAP cases, our M $^3$ Lung-Sys also be able to locate the areas of relevant lesions, without any pixel-level annotation. To further demonstrate the effectiveness of our model, we conduct extensive experiments on a chest CT imaging dataset with a total of 734 patients (251 healthy people, 245 COVID-19 patients, 105 H1N1 patients, and 133 CAP patients). The quantitative results with plenty of metrics indicate the superiority of our proposed model on both slice- and patient-level classification tasks. More importantly, the generated lesion location maps make our system interpretable and more valuable to clinicians.

AAAI Conference 2019 Conference Paper

Image Block Augmentation for One-Shot Learning

  • Zitian Chen
  • Yanwei Fu
  • Kaiyu Chen
  • Yu-Gang Jiang

Given one or a few training instances of novel classes, oneshot learning task requires that the classifier generalizes to these novel classes. Directly training one-shot classifier may suffer from insufficient training instances in one-shot learning. Previous one-shot learning works investigate the metalearning or metric-based algorithms; in contrast, this paper proposes a Self-Training Jigsaw Augmentation (Self-Jig) method for one-shot learning. Particularly, we solve one-shot learning by directly augmenting the training images through leveraging the vast unlabeled instances. Precisely our proposed Self-Jig algorithm can synthesize new images from the labeled probe and unlabeled gallery images. The labels of gallery images are predicted to help the augmentation process, which can be taken as a self-training scheme. Intrinsically, we argue that we provide a very useful way of directly generating massive amounts of training images for novel classes. Extensive experiments and ablation study not only evaluate the efficacy but also reveal the insights, of the proposed Self-Jig method.

NeurIPS Conference 2019 Conference Paper

Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition

  • Satoshi Tsutsui
  • Yanwei Fu
  • David Crandall

This paper studies the task of one-shot fine-grained recognition, which suffers from the problem of data scarcity of novel fine-grained classes. To alleviate this problem, a off-the-shelf image generator can be applied to synthesize additional images to help one-shot learning. However, such synthesized images may not be helpful in one-shot fine-grained recognition, due to a large domain discrepancy between synthesized and original images. To this end, this paper proposes a meta-learning framework to reinforce the generated images by original images so that these images can facilitate one-shot learning. Specifically, the generic image generator is updated by few training instances of novel classes; and a Meta Image Reinforcing Network (MetaIRNet) is proposed to conduct one-shot fine-grained recognition as well as image reinforcement. The model is trained in an end-to-end manner, and our experiments demonstrate consistent improvement over baseline on one-shot fine-grained image classification benchmarks.

IJCAI Conference 2018 Conference Paper

Harnessing Synthesized Abstraction Images to Improve Facial Attribute Recognition

  • Keke He
  • Yanwei Fu
  • Wuhao Zhang
  • Chengjie Wang
  • Yu-Gang Jiang
  • Feiyue Huang
  • Xiangyang Xue

Facial attribute recognition is an important and yet challenging research topic. Different from most previous approaches which predict attributes only based on the whole images, this paper leverages facial parts locations for better attribute prediction. A facial abstraction image which contains both local facial parts and facial texture information is introduced. This abstraction image is generated by a Generative Adversarial Network (GAN). Then we build a dual-path facial attribute recognition network to utilize features from the original face images and facial abstraction images. Empirically, the features of facial abstraction images are complementary to features of original face images. With the facial parts localized by the abstraction images, our method improves facial attributes recognition, especially the attributes located on small face regions. Extensive evaluations conducted on CelebA and LFWA benchmark datasets show that state-of-the-art performance is achieved.

IJCAI Conference 2018 Conference Paper

Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks

  • Yang He
  • Guoliang Kang
  • Xuanyi Dong
  • Yanwei Fu
  • Yi Yang

This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pretrained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, SFP reduces more than 42% FLOPs on ResNet-101 with even 0. 2% top-5 accuracy improvement, which has advanced the state-of-the-art. Code is publicly available on GitHub: https: //github. com/he-y/softfilter-pruning

NeurIPS Conference 2018 Conference Paper

Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning

  • Yunlong Yu
  • Zhong Ji
  • Yanwei Fu
  • Jichang Guo
  • Yanwei Pang
  • Zhongfei (Mark) Zhang

Zero-Shot Learning (ZSL) is generally achieved via aligning the semantic relationships between the visual features and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions. Feeding both the integrated visual features and the class semantic features into a multi-class classification architecture, the proposed framework can be trained end-to-end. Extensive experimental results on CUB and NABird datasets show that the proposed approach has a consistent improvement on both fine-grained zero-shot classification and retrieval tasks.

AAAI Conference 2016 Conference Paper

Learning to Generate Posters of Scientific Papers

  • Yuting Qiang
  • Yanwei Fu
  • Yanwen Guo
  • Zhi-Hua Zhou
  • Leonid Sigal

Researchers often summarize their work in the form of posters. Posters provide a coherent and efficient way to convey core ideas from scientific papers. Generating a good scientific poster, however, is a complex and time consuming cognitive task, since such posters need to be readable, informative, and visually aesthetic. In this paper, for the first time, we study the challenging problem of learning to generate posters from scientific papers. To this end, a data-driven framework, that utilizes graphical models, is proposed. Specifically, given content to display, the key elements of a good poster, including panel layout and attributes of each panel, are learned and inferred from data. Then, given inferred layout and attributes, composition of graphical elements within each panel is synthesized. To learn and validate our model, we collect and make public a Poster-Paper dataset, which consists of scientific papers and corresponding posters with exhaustively labelled panels and attributes. Qualitative and quantitative results indicate the effectiveness of our approach.