Arrow Research search

Author name cluster

Yuxuan Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

AAAI Conference 2026 Conference Paper

DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

  • Kang Ni
  • Minrui Zou
  • Yuxuan Li
  • Xiang Li
  • Kehua Guo
  • Ming-Ming Cheng
  • Yimian Dai

One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half.

AAAI Conference 2026 Conference Paper

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

  • Yuxuan Li
  • Xiang Li
  • Yunheng Li
  • Yicheng Zhang
  • Yimian Dai
  • Qibin Hou
  • Ming-Ming Cheng
  • Jian Yang

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, we propose a novel consistency and synchronization optimization mechanism, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming the combination of specialized models on individual datasets.

AAAI Conference 2026 Conference Paper

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

  • Xinbin Yuan
  • Zhaohui Zheng
  • Yuxuan Li
  • Xialei Liu
  • Li Liu
  • Xiang Li
  • Qibin Hou
  • Ming-Ming Cheng

In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. To mitigate this issue, we propose Strip R-CNN, the first work to systematically explore large strip convolutions for remote sensing object detection. Our key insight is that strip convolutions enable directional feature aggregation along the dominant spatial dimension of slender objects, reducing background interference while preserving essential geometric information. We design two core components: (i) StripNet, a backbone network employing sequential orthogonal large strip convolutions to capture anisotropic spatial patterns, and (ii) Strip Head, which enhances localization precision by incorporating strip convolutions into the detection head. Unlike previous large-kernel approaches that suffer from computational redundancy and isotropic limitations, our method achieves superior performance with remarkable efficiency. Extensive experiments on multiple benchmarks (DOTA, FAIR1M, HRSC2016, and DIOR) demonstrate significant improvements, with our 30M parameter model achieving 82.75% mAP on DOTA-v1.0, establishing a new state-of-the-art record while providing new insights into anisotropic feature learning for remote sensing applications.

AAAI Conference 2026 Conference Paper

TW-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning

  • Yuxuan Li
  • Yicheng Gao
  • Ning Yang
  • Stephen Xia

Episodic tasks in Reinforcement Learning (RL) often pose challenges due to sparse reward signals and high-dimensional state spaces, which hinder efficient learning. Additionally, these tasks often feature hidden “trap states”—irreversible failures that prevent task completion but do not provide explicit negative rewards to guide agents away from repeated errors. To address these issues, we propose Time-Weighted Contrastive Reward Learning (TW-CRL), an Inverse Reinforcement Learning (IRL) framework that leverages both successful and failed demonstrations. By incorporating temporal information, TW-CRL learns a dense reward function that identifies critical states associated with success or failure. This approach not only enables agents to avoid trap states but also encourages meaningful exploration beyond simple imitation of expert trajectories. Empirical evaluations on navigation tasks and robotic manipulation benchmarks demonstrate that TW-CRL surpasses state-of-the-art methods, achieving improved efficiency and robustness.

NeurIPS Conference 2025 Conference Paper

Accident Anticipation via Temporal Occurrence Prediction

  • Tianhao Zhao
  • Yiyang Zou
  • Zihao Mao
  • Peilun Xiao
  • Yulin Huang
  • Hongda Yang
  • Yuxuan Li
  • Tracy Li

Accident anticipation aims to predict potential collisions in an online manner, enabling timely alerts to enhance road safety. Existing methods typically predict frame-level risk scores as indicators of hazard. However, these approaches rely on ambiguous binary supervision—labeling all frames in accident videos as positive—despite the fact that risk varies continuously over time, leading to unreliable learning and false alarms. To address this, we propose a novel paradigm that shifts the prediction target from current-frame risk scoring to directly estimating accident scores at multiple future time steps (e. g. , 0. 1s–2. 0s ahead), leveraging precisely annotated accident timestamps as supervision. Our method employs a snippet-level encoder to jointly model spatial and temporal dynamics, and a Transformer-based temporal decoder that predicts accident scores for all future horizons simultaneously using dedicated temporal queries. Furthermore, we introduce a refined evaluation protocol that reports Time-to-Accident (TTA) and recall—evaluated at multiple pre-accident intervals (0. 5s, 1. 0s, and 1. 5s)—only when the false alarm rate (FAR) remains within an acceptable range, ensuring practical relevance. Experiments show that our method achieves superior performance in both recall and TTA under realistic FAR constraints. Project page: https: //happytianhao. github. io/TOP/

AAAI Conference 2025 Conference Paper

Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects

  • Jianhua Sun
  • Yuxuan Li
  • Longfei Xu
  • Jiude Wei
  • Liang Chai
  • Cewu Lu

Human cognition can leverage fundamental conceptual knowledge, like geometry and kinematic ones, to appropriately perceive, comprehend and interact with novel objects. Motivated by this finding, we aim to endow machine intelligence with an analogous capability through performing at the conceptual level, in order to understand and then interact with articulated objects, especially for those in novel categories, which is challenging due to the intricate geometric structures and diverse joint types of articulated objects. To achieve this goal, we propose Analytic Ontology Template (AOT), a parameterized and differentiable program description of generalized conceptual ontologies. A baseline approach called AOTNet driven by AOTs is designed accordingly to equip intelligent agents with these generalized concepts, and then empower the agents to effectively discover the conceptual knowledge on the structure and affordance of articulated objects. The AOT-driven approach yields benefits in three key perspectives: i) enabling concept-level understanding of articulated objects without relying on any real training data, ii) providing analytic structure information, and iii) introducing rich affordance information indicating proper ways of interaction. We conduct exhaustive experiments and the results demonstrate the superiority of our approach in understanding and then interacting with articulated objects.

ICML Conference 2025 Conference Paper

Finding Wasserstein Ball Center: Efficient Algorithm and The Applications in Fairness

  • Yuntao Wang
  • Yuxuan Li
  • Qingyuan Yang
  • Hu Ding 0003

Wasserstein Barycenter (WB) is a fundamental geometric optimization problem in machine learning, whose objective is to find a representative probability measure that minimizes the sum of Wasserstein distances to given distributions. WB has a number of applications in various areas. However, WB may lead to unfair outcome towards underrepresented groups in some applications (e. g. , a "minority” distribution may be far away from the obtained WB under Wasserstein distance). To address this issue, we propose an alternative objective called "Wasserstein Ball Center (WBC)”. Specifically, WBC is a distribution that encompasses all input distributions within the minimum Wasserstein distance, which can be formulated as a “minmax” optimization problem. We show that the WBC problem with fixed support is equivalent to solving a large-scale linear programming (LP) instance, which is quite different from the previously studied LP model for WB. By incorporating some novel observations on the induced normal equation, we propose an efficient algorithm that accelerates the interior point method by $O(\min(N^2m, Nm^2, m^4))$ times ("$N$” is the number of distributions and "$m$” is the support size). Finally, we conduct a set of experiments on both synthetic and real-world datasets, demonstrating the computational efficiency of our algorithm, and showing its ability to provide more fairness for input distributions.

ICLR Conference 2025 Conference Paper

Interactive Adjustment for Human Trajectory Prediction with Individual Feedback

  • Jianhua Sun 0003
  • Yuxuan Li
  • Liang Chai
  • Cewu Lu

Human trajectory prediction is fundamental for autonomous driving and service robot. The research community has studied various important aspects of this task and made remarkable progress recently. However, there is an essential perspective which is not well exploited in previous research all along, namely individual feedback. Individual feedback exists in the sequential nature of trajectory prediction, where earlier predictions of a target can be verified over time by his ground-truth trajectories to obtain feedback which provides valuable experience for subsequent predictions on the same agent. In this paper, we show such feedback can reveal the strengths and weaknesses of the model's predictions on a specific target and heuristically guide to deliver better predictions on him. We present an interactive adjustment network to effectively model and leverage the feedback. This network first exploits the feedback from previous predictions to dynamically generate an adjuster which then interactively makes appropriate adjustments to current predictions for more accurate ones. We raise a novel displacement expectation loss to train this interactive architecture. Through experiments on representative prediction methods and widely-used benchmarks, we demonstrate the great value of individual feedback and the superior effectiveness of proposed interactive adjustment network.

AAAI Conference 2025 Conference Paper

Multi-clue Consistency Learning to Bridge Gaps Between General and Oriented Object in Semi-supervised Detection

  • Chenxu Wang
  • Chunyan Xu
  • Xiang Li
  • Yuxuan Li
  • Xu Guo
  • Ziqi Gu
  • Zhen Cui

While existing semi-supervised object detection (SSOD) methods perform well in general scenes, they encounter challenges in handling oriented objects in aerial images. We experimentally find three gaps between general and oriented object detection in semi-supervised learning: 1) Sampling inconsistency: the common center sampling is not suitable for oriented objects with larger aspect ratios when selecting positive labels from labeled data. 2) Assignment inconsistency: balancing the precision and localization quality of oriented pseudo-boxes poses greater challenges which introduces more noise when selecting positive labels from unlabeled data. 3) Confidence inconsistency: there exists more mismatch between the predicted classification and localization qualities when considering oriented objects, affecting the selection of pseudo-labels. Therefore, we propose a Multi-clue Consistency Learning (MCL) framework to bridge gaps between general and oriented objects in semi-supervised detection. Specifically, considering various shapes of rotated objects, the Gaussian Center Assignment is specially designed to select the pixel-level positive labels from labeled data. We then introduce the Scale-aware Label Assignment to select pixel-level pseudo-labels instead of unreliable pseudo-boxes, which is a divide-and-rule strategy suited for objects with various scales. The Consistent Confidence Soft Label is adopted to further boost the detector by maintaining the alignment of the predicted results. Comprehensive experiments on DOTA-v1.5 and DOTA-v1.0 benchmarks demonstrate that our proposed MCL can achieve state-of-the-art performance in the semi-supervised oriented object detection task.

NeurIPS Conference 2024 Conference Paper

ConceptFactory: Facilitate 3D Object Knowledge Annotation with Object Conceptualization

  • Jianhua Sun
  • Yuxuan Li
  • Longfei Xu
  • Nange Wang
  • Jiude Wei
  • Yining Zhang
  • Cewu Lu

We present ConceptFactory, a novel scope to facilitate more efficient annotation of 3D object knowledge by recognizing 3D objects through generalized concepts (i. e. object conceptualization), aiming at promoting machine intelligence to learn comprehensive object knowledge from both vision and robotics aspects. This idea originates from the findings in human cognition research that the perceptual recognition of objects can be explained as a process of arranging generalized geometric components (e. g. cuboids and cylinders). ConceptFactory consists of two critical parts: i) ConceptFactory Suite, a unified toolbox that adopts Standard Concept Template Library (STL-C) to drive a web-based platform for object conceptualization, and ii) ConceptFactory Asset, a large collection of conceptualized objects acquired using ConceptFactory suite. Our approach enables researchers to effortlessly acquire or customize extensive varieties of object knowledge to comprehensively study different object understanding tasks. We validate our idea on a wide range of benchmark tasks from both vision and robotics aspects with state-of-the-art algorithms, demonstrating the high quality and versatility of annotations provided by our approach. Our website is available at https: //apeirony. github. io/ConceptFactory.

ICML Conference 2024 Conference Paper

Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous Transformer

  • Toru Shirakawa
  • Yi Li
  • Yulun Wu
  • Sky Qiu
  • Yuxuan Li
  • Mingduo Zhao
  • Hiroyasu Iso
  • Mark J. van der Laan

We propose Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to estimate the counterfactual mean of outcome under dynamic treatment policies in longitudinal problem settings. Our approach utilizes a transformer architecture with heterogeneous type embedding trained using temporal-difference learning. After obtaining an initial estimate using the transformer, following the targeted minimum loss-based likelihood estimation (TMLE) framework, we statistically corrected for the bias commonly associated with machine learning algorithms. Furthermore, our method also facilitates statistical inference by enabling the provision of 95% confidence intervals grounded in asymptotic statistical theory. Simulation results demonstrate our method’s superior performance over existing approaches, particularly in complex, long time-horizon scenarios. It remains effective in small-sample, short-duration contexts, matching the performance of asymptotically efficient estimators. To demonstrate our method in practice, we applied our method to estimate counterfactual mean outcomes for standard versus intensive blood pressure management strategies in a real-world cardiovascular epidemiology cohort study.

NeurIPS Conference 2024 Conference Paper

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

  • Yuxuan Li
  • Xiang Li
  • Weijie Li
  • Qibin Hou
  • Li Liu
  • Ming-Ming Cheng
  • Jian Yang

Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at \url{https: //github. com/zcablii/SARDet_100K}.

TMLR Journal 2023 Journal Article

Representations and Computations in Transformers that Support Generalization on Structured Tasks

  • Yuxuan Li
  • James McClelland

Transformers have shown remarkable success in natural language processing and computer vision, serving as the foundation of large language and multimodal models. These networks can capture nuanced context sensitivity across high-dimensional language tokens or image pixels, but it remains unclear how highly structured behavior and systematic generalization can arise in these systems. Here, we explore the solution process a causal transformer discovers as it learns to solve a set of algorithmic tasks involving copying, sorting, and hierarchical compositions of these operations. We search for the minimal layer and head configuration sufficient to solve these tasks and unpack the roles of the attention heads, as well as how token representations are reweighted across layers to complement these roles. Our results provide new insights into how attention layers in transformers support structured computation within and across tasks: 1) Replacing fixed position labels with labels sampled from a larger set enables strong length generalization and faster learning. The learnable embeddings of these labels develop different representations, capturing sequence order if necessary, depending on task demand. 2) Two-layer transformers can learn reliable solutions to the multi-level problems we explore. The first layer tends to transform the input representation to allow the second layer to share computation across repeated components within a task or across related tasks. 3) We introduce an analysis pipeline that quantifies how the representation space in a given layer prioritizes different aspects of each item. We show that these representations prioritize information needed to guide attention relative to information that only requires downstream readout.