Arrow Research search

Author name cluster

Fei Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

AAAI Conference 2026 Conference Paper

M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction

  • Yuze Zhang
  • Lingjie Li
  • Qiuzhen Lin
  • Zhong Ming
  • Fei Yu
  • Victor C. M. Leung

The Mamba architecture has been widely applied to various low-level vision tasks due to its exceptional adaptability and strong performance. Although the Mamba architecture has been adopted for spectral reconstruction, it still faces the following two challenges: (1) Single spatial perception limits the ability to fully understand and analyze hyperspectral images; (2) Single-scale feature extraction struggles to capture the complex structures and fine details present in hyperspectral images. To address these issues, we propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR. Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features. By integrating the multi-perceptual fusion block into a U-Net structure, M3SR can effectively extract and fuse global, intermediate, and local features, thereby enabling accurate reconstruction of hyperspectral images at multiple scales. Extensive quantitative and qualitative experiments demonstrate that the proposed M3SR outperforms existing state-of-the-art methods while incurring a lower computational cost.

IROS Conference 2025 Conference Paper

A Two-Stage Lightweight Framework for Efficient Land-Air Bimodal Robot Autonomous Navigation

  • Yongjie Li
  • Zhou Liu
  • Wenshuai Yu
  • Zhangji Lu
  • Chenyang Wang
  • Fei Yu
  • Qingquan Li

Land-air bimodal robots (LABR) are gaining attention for autonomous navigation, combining high mobility from aerial vehicles with long endurance from ground vehicles. However, existing LABR navigation methods are limited by suboptimal trajectories from mapping-based approaches and the excessive computational demands of learning-based methods. To address this, we propose a two-stage lightweight framework that integrates global key points prediction with local trajectory refinement to generate efficient and reachable trajectories. In the first stage, the Global Key points Prediction Network (GKPN) was used to generate a hybrid land-air keypoint path. The GKPN includes a Sobel Perception Network (SPN) for improved obstacle detection and a Lightweight Attention Planning Network (LAPN) to improves predictive ability by capturing contextual information. In the second stage, the global path is segmented based on predicted key points and refined using a mapping-based planner to create smooth, collision-free trajectories. Experiments conducted on our LABR platform show that our framework reduces network parameters by 14% and energy consumption during land-air transitions by 35% compared to existing approaches. The framework achieves real-time navigation without GPU acceleration and enables zero-shot transfer from simulation to reality during deployment.

AAAI Conference 2025 Conference Paper

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

  • Yuxuan Wang
  • Yijun Liu
  • Fei Yu
  • Chen Huang
  • Kexin Li
  • Zhiguo Wan
  • Wanxiang Che
  • Hongyang Chen

Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision-Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.

IROS Conference 2025 Conference Paper

JAM: Keypoint-Guided Joint Prediction after Classification-Aware Marginal Proposal for Multi-Agent Interaction

  • Fangze Lin
  • Ying He
  • Fei Yu
  • Hong Zhang

Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low-quality generation of low-probability modes in multi-agent joint prediction. To tackle this issue, we propose a two-stage multi-agent interactive prediction framework named keypoint-guided joint prediction after classification-aware marginal proposal (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real-world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state-of-the-art performance in interactive trajectory prediction. The code is available at https://github.com/LinFunster/JAM to facilitate future research.

IROS Conference 2025 Conference Paper

MobiExo: GPS-SLAM Fusion for Seamless Indoor-Outdoor Mobile Manipulation with Hand-Foot Coordination

  • Jianpeng Wang
  • Zhen Tian
  • Wenlong Chen
  • Dian Yuan
  • Zhou Zhou
  • Ming Cen
  • Xia Hua
  • Fei Yu

Teleoperation systems for mobile robots face significant challenges in achieving seamless coordination across dynamic environments. We present MobiExo, a teleoperation system that unlocks seamless indoor-outdoor mobile manipulation. Our approach tackles two fundamental challenges: robust cross-environment localization and intuitive full-body control. A novel self-adaptive federated filter unifies GPS and SLAM, delivering continuous centimeter-level positioning (4. 5±0. 8 cm indoor, 6. 8±1. 2 cm outdoor) and eliminating transition errors. Simultaneously, an integrated hand-foot coordination framework translates the operator’s natural gait and gestures into fluid robot actions, maintaining remarkable millimeter-level end-effector precision (3. 5±0. 4 mm) during navigation. Extensive field trials validate our design, demonstrating high task success (96. 7% indoor, 94. 3% outdoor) and a 5. 9× efficiency improvement in multi-location tasks over stationary setups. Code is available at: https://github.com/wangjianpeng200/MobiExo.git

NeurIPS Conference 2025 Conference Paper

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

  • Wanlong Liu
  • Junxiao Xu
  • Fei Yu
  • Yukang Lin
  • Ke Ji
  • Wenyu Chen
  • Lifeng Shang
  • Yasheng Wang

Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

NeurIPS Conference 2025 Conference Paper

ReDit: Reward Dithering for Improved LLM Policy Optimization

  • Chenxing Wei
  • Jiarui Yu
  • Ying He
  • Hande Dong
  • Yao Shu
  • Fei Yu

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

NeurIPS Conference 2025 Conference Paper

RoMa: A Robust Model Watermarking Scheme for Protecting IP in Diffusion Models

  • Yingsha Xie
  • Rui Min
  • Zeyu Qin
  • Fei Ma
  • Li Shen
  • Fei Yu
  • Xiaochun Cao

Preserving intellectual property (IP) within a pre-trained diffusion model is critical for protecting the model's copyright and preventing unauthorized model deployment. In this regard, model watermarking is a common practice for IP protection that embeds traceable information within models and allows for further verification. Nevertheless, existing watermarking schemes often face challenges due to their vulnerability to fine-tuning, limiting their practical application in general pre-training and fine-tuning paradigms. Inspired by using mode connectivity to analyze model performance between a pair of connected models, we investigate watermark vulnerability by leveraging Linear Mode Connectivity (LMC) as a proxy to analyze the fine-tuning dynamics of watermark performance. Our results show that existing watermarked models tend to converge to sharp minima in the loss landscape, thus making them vulnerable to fine-tuning. To tackle this challenge, we propose RoMa, a Ro bust M odel w a termarking scheme that improves the robustness of watermarks against fine-tuning. Specifically, RoMa decomposes watermarking into two components, including Embedding Functionality, which preserves reliable watermark detection capability, and Path-specific Smoothness, which enhances the smoothness along the watermark-connected path to improve robustness. Extensive experiments on benchmark datasets MS-COCO-2017 and CUB-200-2011 demonstrate that RoMa significantly improves watermark robustness against fine-tuning while maintaining generation quality, outperforming baselines. The code is available at https: //github. com/xiekks/RoMa.

NeurIPS Conference 2025 Conference Paper

Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

  • Zihan Su
  • Xuerui Qiu
  • Hongbin Xu Xu
  • Tangyu Jiang
  • Jun-hao Zhuang
  • Chun Yuan
  • Ming Li
  • Shengfeng He

The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the- art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. Code and additional supporting materials are provided in the supplementary.

NeurIPS Conference 2025 Conference Paper

Universal Visuo-Tactile Video Understanding for Embodied Interaction

  • Yifan Xie
  • Mingyang Li
  • Shoujie Li
  • Xingting Li
  • Guangyu Chen
  • Fei Ma
  • Fei Yu
  • Wenbo Ding

Tactile perception is essential for embodied agents to understand the physical attributes of objects that cannot be determined through visual inspection alone. While existing methods have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model that enables universal Visuo-Tactile Video (VTV) understanding, bridging the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150, 000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, and scenario-based decision-making. Extensive experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile reasoning tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.

IJCAI Conference 2025 Conference Paper

VideoHumanMIB: Unlocking Appearance Decoupling for Video Human Motion In-betweening

  • Haiwei Xue
  • Zhensong Zhang
  • Minglei Li
  • Zonghong Dai
  • Fei Yu
  • Fei Ma
  • Zhiyong Wu

We propose VideoHumanMIB, a novel framework for Video Human Motion In-betweening that enables seamless transitions between different motion video clips, facilitating the generation of longer and more natural digital human videos. While existing video frame interpolation methods work well for similar motions in adjacent frames, they often struggle with complex human movements, resulting in artifacts and unrealistic transitions. To address these challenges, we introduce a two-stage approach: First, we design an Appearance Reconstruction AutoEncoder to decouple appearance and motion information, extracting robust appearance-invariant features. Second, we develop an enhanced diffusion pretrained network that leverages both motion optical flow and human pose as guidance conditions, enabling the model to learn comprehensive latent distributions of possible motions. Rather than operating directly in pixel space, our model works in a learned latent space, allowing it to better capture the underlying motion dynamics. The framework is optimized with a dual-frame constraint loss and a motion flow loss to ensure temporal consistency and natural movement transitions. Extensive experiments demonstrate that our approach generates highly realistic transition sequences that significantly outperform existing methods, particularly in challenging scenarios with large motion variations. The proposed VideoHumanMIB establishes a new baseline for human motion synthesis and enables more natural and controllable digital human animation.

AAAI Conference 2024 Conference Paper

A Huber Loss Minimization Approach to Byzantine Robust Federated Learning

  • Puning Zhao
  • Fei Yu
  • Zhiguo Wan

Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on epsilon, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of epsilon. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include non-i.i.d data, such that clients have slightly different distributions.

IJCAI Conference 2024 Conference Paper

ABM: Attention before Manipulation

  • Fan Zhuo
  • Ying He
  • Fei Yu
  • Pengteng Li
  • Zheyi Zhao
  • Xilong Sun

Vision-language models (VLMs) show promising generalization and zero-shot capabilities, offering a potential solution to the impracticality and cost of enabling robots to comprehend diverse human instructions and scene semantics in the real world. Existing approaches most directly integrate the semantic representations from pre-trained VLMs with policy learning. However, these methods are limited to the labeled data learned, resulting in poor generalization ability to unseen instructions and objects. To address the above limitation, we propose a simple method called "Attention before Manipulation" (ABM), which fully leverages the object knowledge encoded in CLIP to extract information about the target object in the image. It constructs an Object Mask Field, serving as a better representation of the target object for the model to separate visual grounding from action prediction and acquire specific manipulation skills effectively. We train ABM for 8 RLBench tasks and 2 real-world tasks via behavior cloning. Extensive experiments show that our method significantly outperforms the baselines in the zero-shot and compositional generalization experiment settings.

AAAI Conference 2024 Conference Paper

MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction

  • Hao Qian
  • Hongting Zhou
  • Qian Zhao
  • Hao Chen
  • Hongxiang Yao
  • Jingwei Wang
  • Ziqi Liu
  • Fei Yu

The stock market is a crucial component of the financial system, but predicting the movement of stock prices is challenging due to the dynamic and intricate relations arising from various aspects such as economic indicators, financial reports, global news, and investor sentiment. Traditional sequential methods and graph-based models have been applied in stock movement prediction, but they have limitations in capturing the multifaceted and temporal influences in stock price movements. To address these challenges, the Multi-relational Dynamic Graph Neural Network (MDGNN) framework is proposed, which utilizes a discrete dynamic graph to comprehensively capture multifaceted relations among stocks and their evolution over time. The representation generated from the graph offers a complete perspective on the interrelationships among stocks and associated entities. Additionally, the power of the Transformer structure is leveraged to encode the temporal evolution of multiplex relations, providing a dynamic and effective approach to predicting stock investment. Further, our proposed MDGNN framework achieves the best performance in public datasets compared with the state-of-the-art stock investment methods.

IJCAI Conference 2022 Conference Paper

Region-Aware Metric Learning for Open World Semantic Segmentation via Meta-Channel Aggregation

  • Hexin Dong
  • Zifan Chen
  • Mingze Yuan
  • Yutong Xie
  • Jie Zhao
  • Fei Yu
  • Bin Dong
  • Li Zhang

As one of the most challenging and practical segmentation tasks, open-world semantic segmentation requires the model to segment the anomaly regions in the images and incrementally learn to segment out-of-distribution (OOD) objects, especially under a few-shot condition. The current state-of-the-art (SOTA) method, Deep Metric Learning Network (DMLNet), relies on pixel-level metric learning, with which the identification of similar regions having different semantics is difficult. Therefore, we propose a method called region-aware metric learning (RAML), which first separates the regions of the images and generates region-aware features for further metric learning. RAML improves the integrity of the segmented anomaly regions. Moreover, we propose a novel meta-channel aggregation (MCA) module to further separate anomaly regions, forming high-quality sub-region candidates and thereby improving the model performance for OOD objects. To evaluate the proposed RAML, we have conducted extensive experiments and ablation studies on Lost And Found and Road Anomaly datasets for anomaly segmentation and the CityScapes dataset for incremental few-shot learning. The results show that the proposed RAML achieves SOTA performance in both stages of open world segmentation. Our code and appendix are available at https: //github. com/czifan/RAML.

AAAI Conference 2021 Conference Paper

DAST: Unsupervised Domain Adaptation in Semantic Segmentation Based on Discriminator Attention and Self-Training

  • Fei Yu
  • Mo Zhang
  • Hexin Dong
  • Sheng Hu
  • Bin Dong
  • Li Zhang

Unsupervised domain adaption has recently been used to reduce the domain shift, which would ultimately improve the performance of semantic segmentation on unlabeled realworld data. In this paper, we follow the trend to propose a novel method to reduce the domain shift using strategies of discriminator attention and self-training. The discriminator attention strategy contains a two-stage adversarial learning process, which explicitly distinguishes the well-aligned (domain-invariant) and poorly-aligned (domain-specific) features, and then guides the model to focus on the latter. The self-training strategy adaptively improves the decision boundary of the model for target domain, which implicitly facilitates the extraction of domain-invariant features. By combining the two strategies, we find a more effective way to reduce the domain shift. Extensive experiments demonstrate the effectiveness of our proposed method on numerous benchmark datasets.

AAAI Conference 2021 Conference Paper

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

  • Fei Yu
  • Jiji Tang
  • Weichong Yin
  • Yu Sun
  • Hao Tian
  • Hua Wu
  • Haifeng Wang

We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i. e. , Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3. 7%.

IJCAI Conference 2015 Conference Paper

Solving the Partial Label Learning Problem: An Instance-Based Approach

  • Min-Ling Zhang
  • Fei Yu

In partial label learning, each training example is associated with a set of candidate labels, among which only one is valid. An intuitive strategy to learn from partial label examples is to treat all candidate labels equally and make prediction by averaging their modeling outputs. Nonetheless, this strategy may suffer from the problem that the modeling output from the valid label is overwhelmed by those from the false positive labels. In this paper, an instance-based approach named IPAL is proposed by directly disambiguating the candidate label set. Briefly, IPAL tries to identify the valid label of each partial label example via an iterative label propagation procedure, and then classifies the unseen instance based on minimum error reconstruction from its nearest neighbors. Extensive experiments show that IPAL compares favorably against the existing instance-based as well as other stateof-the-art partial label learning approaches.

IS Journal 2012 Journal Article

Trends & Controversies

  • Anton Nijholt
  • Ronald C. Arkin
  • Sebastien Brault
  • Richard Kulpa
  • Franck Multon
  • Benoit Bideau
  • David Traum
  • Hayley Hung

Many applications require knowledge about how to deceive, including those related to safety, security, and warfare. Speech and text analysis can help detect deception, as can cameras, microphones, physiological sensors, and intelligent software. Models of deception and noncooperation can make a virtual or mixed-reality training environment more realistic, improve immersion, and thus make it more suitable for training military or security personnel. Robots might need to operate in physical and nontraining environments where they must perform military activity, including misleading the enemy. The contributions to this installment of Trends & Controversies present state-of-the-art research approaches to the analysis and generation of noncooperative and deceptive behavior in virtual humans, agents, and robots; the analysis of multiparty interaction in the context of deceptive behavior; and methods to detect misleading information in texts and computer-mediated communication. Articles include: "Computational Deception and Noncooperation, " by Anton Nijholt; "Robots that Need to Mislead: Biologically-Inspired Machine Deception, " by Ronald C. Arkin; "Deception in Sports Using Immersive Environments, " by Sébastien Brault, Richard Kulpa, Franck Multon, and Benoit Bideau; "Non-Cooperative and Deceptive Virtual Agents, " by David Traum; "Deception Detection in Multiparty Contexts, "by Hayley Hung; "Deception Detection, Human Reasoning, and Deception Intent, " by Eugene Santos Jr. , Deqing Li, and Fei Yu; and "Automatic Deception Detection in Computer-Mediated Communication, " by Lina Zhou and Dongsong Zhang.