Arrow Research search

Author name cluster

Wei Ma

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

AAAI Conference 2026 Conference Paper

DualScope: Capturing Critical Spatial and Temporal Cues for Distracted Driving Activity Recognition

  • Zhijie Qiu
  • Shuaibo Li
  • Laixin Zhang
  • Xuming Hu
  • Wei Ma

Accurately recognizing distracted driving activities in real-world scenarios is essential for improving road and pedestrian safety. However, existing approaches are prone to attending to irrelevant scene context and are susceptible to interference from redundant frames, compromising their robustness in complex driving environments. To overcome these limitations, we propose DualScope, a novel framework that captures behaviorally critical information from both spatial and temporal perspectives. In the spatial domain, we introduce a Synergistic Behavior-Centric Distillation mechanism that leverages two key information sources: (1) position-aware knowledge derived from the SAM model, which enhances the perception of critical regions and their semantic interaction structures; and (2) fine-grained visual details obtained from cropped key regions, which improve the model's ability to capture detailed patterns within behavior-relevant areas. In the temporal domain, we present the Saliency-Aware Fine-to-Coarse Temporal Modeling module, comprising three components: a Fine-Grained Motion Encoder for capturing local inter-frame dependencies; a Dynamic Difference Extractor for generating salient motion dynamics; and a Saliency-Aware Temporal Pyramid Mamba for integrating these representations to enable multi-scale temporal modeling. This design effectively captures both short-term motions and long-term behavioral patterns. Furthermore, incorporating salient dynamics enhances the model's focus on significant behavioral variations. Extensive experiments on seven publicly available DDAR datasets demonstrate that DualScope consistently outperforms state-of-the-art methods, validating its effectiveness in capturing behavioral cues across spatial and temporal dimensions.

AAAI Conference 2026 Conference Paper

From Sequential to Recursive: Enhancing Decision-Focused Learning with Bidirectional Feedback

  • Xinyu Wang
  • Jinxiao Du
  • Yiyang Peng
  • Wei Ma

Decision-focused learning (DFL) has emerged as a powerful end-to-end alternative to conventional predict-then-optimize (PTO) pipelines by directly optimizing predictive models through downstream decision losses. Existing DFL frameworks are limited by their strictly sequential structure, referred to as sequential DFL (S-DFL). However, S-DFL fails to capture the bidirectional feedback between prediction and optimization in complex interaction scenarios. In view of this, we first time propose recursive decision-focused learning (R-DFL), a novel framework that introduces bidirectional feedback between downstream optimization and upstream prediction. We further extend two distinct differentiation methods: explicit unrolling via automatic differentiation and implicit differentiation based on fixed-point methods, to facilitate efficient gradient propagation in R-DFL. We rigorously prove that both methods achieve comparable gradient accuracy, with the implicit method offering superior computational efficiency. Extensive experiments on both synthetic and real-world datasets, including the newsvendor problem and the bipartite matching problem, demonstrate that R-DFL not only substantially enhances the final decision quality over sequential baselines but also exhibits robust adaptability across diverse scenarios in closed-loop decision-making problems.

AAAI Conference 2026 Conference Paper

RealNet: Efficient and Unsupervised Detection of AI-Generated Images via Real-Only Representation Learning

  • Shuaibo Li
  • Laixin Zhang
  • Wei Ma
  • Jianwei Guo
  • Shibiao Xu
  • Zhijie Qiu
  • Hongbin Zha

Detecting AI-generated images remains a persistent challenge, as existing detectors often struggle to generalize to forgeries produced by previously unseen generative models. This generalization gap mainly stems from entanglement with semantic content and overfitting to model-specific artifacts. Moreover, many state-of-the-art methods rely on large pre-trained backbones or computationally intensive pipelines, which limit their applicability in real-world, resource-constrained environments. We propose RealNet, a lightweight and unsupervised framework that constructs a disentangled, forgery-aware representation space using only real images. RealNet first extracts semantic-agnostic representations through a dual adversarial denoising mechanism, producing compact features with low intra-class variance. These representations are then perturbed in feature space to generate pseudo-negative samples, which are combined with the original real features to train a lightweight discriminator, enabling robust detection without any dependence on synthetic images during training. Comprehensive evaluations across GAN, diffusion, and emerging VAR-based paradigms demonstrate that RealNet achieves superior cross-model generalization and robustness. RealNet surpasses previous state-of-the-art approaches by 4.51% in accuracy and 3.93% in average precision, while maintaining significantly lower computational cost. Furthermore, we introduce a medically relevant synthetic image dataset and show RealNet remains effective under severe distribution shifts, highlighting its potential for deployment in high-stakes real-world scenarios. Together, these advantages position RealNet as a practical, scalable and socially impactful solution for robust AI-generated image detection.

AAAI Conference 2025 Conference Paper

3SAT: A Simple Self-Supervised Adversarial Training Framework

  • Jiang Fang
  • Haonan He
  • Jiyan Sun
  • Jiadong Fu
  • Zhaorui Guo
  • Yinlong Liu
  • Wei Ma

The combination of self-supervised learning and adversarial training (AT) can significantly improve the adversarial robustness of self-supervised models. However, the robustness of self-supervised adversarial training (self-AT) still lags behind that of state-of-the-art (SOTA) supervised AT (sup-AT), even though the performance of current self-supervised learning models has already matched or even surpassed that of SOTA supervised learning models. This issue raises concerns about the secure application of self-supervised learning models. The inclusion of adversarial training turns self-AT into a challenging joint optimization problem, and recent studies have shown that the data augmentation methods necessary for constructing positive pairs in self-supervised learning negatively impact the robustness improvement in self-AT. Inspired by this, we propose 3SAT, a simple self-supervised adversarial training framework. 3SAT conducts adversarial training on original, unaugmented samples, reducing the difficulty of optimizing the adversarial training subproblem and fundamentally eliminating the negative impact of data augmentation on robustness improvement. Additionally, 3SAT introduces a dynamic training objective scheduling strategy to address the issue of model training collapse during the joint optimization process when using original samples directly. 3SAT is not only structurally simple and computationally efficient, reducing self-AT training time by half, but it also improves the SOTA self-AT robustness accuracy by 16.19\% and standard accuracy by 11.41\% under Auto-Attack on the CIFAR-10 dataset. Even more impressively, 3SAT surpasses the SOTA sup-AT method in robust accuracy by a significant margin of 11.25\%. This marks the first time that self-AT has outperformed SOTA sup-AT in robustness, indicating that self-AT is a superior method for improving model robustness.

AAAI Conference 2025 Conference Paper

Geolocation Representation from Large Language Models Are Generic Enhancers for Spatio-Temporal Learning

  • Junlin He
  • Tong Nie
  • Wei Ma

In the geospatial domain, universal representation models are significantly less prevalent than their extensive use in natural language processing and computer vision. This discrepancy arises primarily from the high costs associated with the input of existing representation models, which often require street views and mobility data. To address this, we develop a novel, training-free method that leverages large language models (LLMs) and auxiliary map data from OpenStreetMap to derive geolocation representations (LLMGeovec). LLMGeovec can represent the geographic semantics of city, country, and global scales, which acts as a generic enhancer for spatio-temporal learning. Specifically, by direct feature concatenation, we introduce a simple yet effective paradigm for enhancing multiple spatio-temporal tasks including geographic prediction (GP), long-term time series forecasting (LTSF), and graph-based spatio-temporal forecasting (GSTF). LLMGeovec can seamlessly integrate into a wide spectrum of spatio-temporal learning models, providing immediate enhancements. Experimental results demonstrate that LLMGeovec achieves global coverage and significantly boosts the performance of leading GP, LTSF, and GSTF models.

ICRA Conference 2025 Conference Paper

Sampling-Based Grasp and Collision Prediction for Assisted Teleoperation

  • Simon Manschitz
  • Berk Gueler
  • Wei Ma
  • Dirk Ruiken

Shared autonomy allows for combining the global planning capabilities of a human operator with the strengths of a robot such as repeatability and accurate control. In a real-time teleoperation setting, one possibility for shared autonomy is to let the human operator decide for the rough movement and to let the robot do fine adjustments, e. g. , when the view of the operator is occluded. We present a learning-based concept for shared autonomy that aims at supporting the human operator in a real-time teleoperation setting. At every step, our system tracks the target pose set by the human operator as accurately as possible while at the same time satisfying a set of constraints which influence the robot's behavior. An important characteristic is that the constraints can be dynamically activated and deactivated which allows the system to provide task-specific assistance. Since the system must generate robot commands in real-time, solving an optimization problem in every iteration is not feasible. Instead, we sample potential target configurations and use Neural Networks for predicting the constraint costs for each configuration. By evaluating each configuration in parallel, our system is able to select the target configuration which satisfies the constraints and has the minimum distance to the operator's target pose with minimal delay. We evaluate the framework with a pick and place task on a bi-manual setup with two Franka Emika Panda robot arms with Robotiq grippers.

IJCAI Conference 2025 Conference Paper

Training-free Fourier Phase Diffusion for Style Transfer

  • Siyuan Zhang
  • Wei Ma
  • Libin Liu
  • Zheng Li
  • Hongbin Zha

Diffusion models have shown significant potential for image style transfer tasks. However, achieving effective stylization while preserving content in a training-free setting remains a challenging issue due to the tightly coupled representation space and inherent randomness of the models. In this paper, we propose a Fourier phase diffusion model that addresses this challenge. Given that the Fourier phase spectrum encodes an image's edge structures, we propose modulating the intermediate diffusion samples with the Fourier phase of a content image to conditionally guide the diffusion process. This ensures content retention while fully utilizing the diffusion model's style generation capabilities. To implement this, we introduce a content phase spectrum incorporation method that aligns with the characteristics of the diffusion process, preventing interference with generative stylization. To further enhance content preservation, we integrate homomorphic semantic features extracted from the content image at each diffusion stage. Extensive experimental results demonstrate that our method outperforms state-of-the-art models in both content preservation and stylization. Code is available at https: //github. com/zhang2002forwin/Fourier-Phase-Diffusion-for-Style-Transfer.

NeurIPS Conference 2024 Conference Paper

Preventing Dimensional Collapse in Self-Supervised Learning via Orthogonality Regularization

  • Junlin He
  • Jinxiao Du
  • Wei Ma

Self-supervised learning (SSL) has rapidly advanced in recent years, approaching the performance of its supervised counterparts through the extraction of representations from unlabeled data. However, dimensional collapse, where a few large eigenvalues dominate the eigenspace, poses a significant obstacle for SSL. When dimensional collapse occurs on features (e. g. hidden features and representations), it prevents features from representing the full information of the data; when dimensional collapse occurs on weight matrices, their filters are self-related and redundant, limiting their expressive power. Existing studies have predominantly concentrated on the dimensional collapse of representations, neglecting whether this can sufficiently prevent the dimensional collapse of the weight matrices and hidden features. To this end, we first time propose a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining. OR promotes orthogonality within weight matrices, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations. Our empirical investigations demonstrate that OR significantly enhances the performance of SSL methods across diverse benchmarks, yielding consistent gains with both CNNs and Transformer-based architectures.

NeurIPS Conference 2024 Conference Paper

Preventing Model Collapse in Deep Canonical Correlation Analysis by Noise Regularization

  • Junlin He
  • Jinxiao Du
  • Susu Xu
  • Wei Ma

Multi-View Representation Learning (MVRL) aims to learn a unified representation of an object from multi-view data. Deep Canonical Correlation Analysis (DCCA) and its variants share simple formulations and demonstrate state-of-the-art performance. However, with extensive experiments, we observe the issue of model collapse, i. e. , the performance of DCCA-based methods will drop drastically when training proceeds. The model collapse issue could significantly hinder the wide adoption of DCCA-based methods because it is challenging to decide when to early stop. To this end, we develop NR-DCCA, which is equipped with a novel noise regularization approach to prevent model collapse. Theoretical analysis shows that the Correlation Invariant Property is the key to preventing model collapse, and our noise regularization forces the neural network to possess such a property. A framework to construct synthetic data with different common and complementary information is also developed to compare MVRL methods comprehensively. The developed NR-DCCA outperforms baselines stably and consistently in both synthetic and real-world datasets, and the proposed noise regularization approach can also be generalized to other DCCA-based methods such as DGCCA.

TIST Journal 2023 Journal Article

Adversarial Attacks on Deep Reinforcement Learning-based Traffic Signal Control Systems with Colluding Vehicles

  • Ao Qu
  • Yihong Tang
  • Wei Ma

The rapid advancements of Internet of Things (IoT) and Artificial Intelligence (AI) have catalyzed the development of adaptive traffic control systems (ATCS) for smart cities. In particular, deep reinforcement learning (DRL) models produce state-of-the-art performance and have great potential for practical applications. In the existing DRL-based ATCS, the controlled signals collect traffic state information from nearby vehicles, and then optimal actions (e.g., switching phases) can be determined based on the collected information. The DRL models fully “trust” that vehicles are sending the true information to the traffic signals, making the ATCS vulnerable to adversarial attacks with falsified information. In view of this, this article first time formulates a novel task in which a group of vehicles can cooperatively send falsified information to “cheat” DRL-based ATCS in order to save their total travel time. To solve the proposed task, we develop CollusionVeh, a generic and effective vehicle-colluding framework composed of a road situation encoder, a vehicle interpreter, and a communication mechanism. We employ our framework to attack established DRL-based ATCS and demonstrate that the total travel time for the colluding vehicles can be significantly reduced with a reasonable number of learning episodes, and the colluding effect will decrease if the number of colluding vehicles increases. Additionally, insights and suggestions for the real-world deployment of DRL-based ATCS are provided. The research outcomes could help improve the reliability and robustness of the ATCS and better protect the smart mobility systems.

NeurIPS Conference 2019 Conference Paper

Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption

  • Wei Ma
  • George Chen

Matrix completion is often applied to data with entries missing not at random (MNAR). For example, consider a recommendation system where users tend to only reveal ratings for items they like. In this case, a matrix completion method that relies on entries being revealed at uniformly sampled row and column indices can yield overly optimistic predictions of unseen user ratings. Recently, various papers have shown that we can reduce this bias in MNAR matrix completion if we know the probabilities of different matrix entries being missing. These probabilities are typically modeled using logistic regression or naive Bayes, which make strong assumptions and lack guarantees on the accuracy of the estimated probabilities. In this paper, we suggest a simple approach to estimating these probabilities that avoids these shortcomings. Our approach follows from the observation that missingness patterns in real data often exhibit low nuclear norm structure. We can then estimate the missingness probabilities by feeding the (always fully-observed) binary matrix specifying which entries are revealed to an existing nuclear-norm-constrained matrix completion algorithm by Davenport et al. [2014]. Thus, we tackle MNAR matrix completion by solving a different matrix completion problem first that recovers missingness probabilities. We establish finite-sample error bounds for how accurate these probability estimates are and how well these estimates debias standard matrix completion losses for the original matrix to be completed. Our experiments show that the proposed debiasing strategy can improve a variety of existing matrix completion algorithms, and achieves downstream matrix completion accuracy at least as good as logistic regression and naive Bayes debiasing baselines that require additional auxiliary information.