Author name cluster

Xiang Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

179 papers

2 author rows

EAAI Journal 2026 Journal Article

A group consensus model for non-cooperative behavior based on q-rung orthopair fuzzy trust network and regret theory

Weishu Peng
Chunfang Chen
Xiang Li
Jing Tang

Details DOI

AAAI Conference 2026 Conference Paper

Analyze–Compose–Execute: A Dynamic Dialogue Framework for Multi-Agent Debate

Wenyuan Gu
Haowen Wang
Jiale Han
Xiang Li
Zhixuan Wu
Hongru Xiao
Bo Cheng

Multi-Agent Debate (MAD) is an emerging paradigm that leverages the reasoning abilities of Large Language Models (LLMs) by encouraging them to collaboratively solve problems through human-like discussions. However, current MAD methods typically constrain agents to follow fixed discussion pipelines, repeatedly applying the same discussion act for a predetermined number of rounds, which limits their effectiveness and adaptability in complex and diverse tasks. To address this limitation, we propose Analyze–Compose–Execute (ACE), a novel debate framework in which agents dynamically execute the discussion actions according to the dialogue context. By analyzing the current responses of agents, ACE selects appropriate acts from a predefined Atomic Discussion Acts Library (ADAL), which are composed into a discussion action to be executed in the next round, to enable truly dynamic debate. We conduct extensive experiments on the challenging benchmark Big-Bench Hard (BBH) benchmark. ACE achieves state-of-the- art results on 17 out of 23 tasks, with an average performance gain of 8.5% across all tasks, demonstrating the effectiveness and robustness of our approach.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Beyond Adapter Retrieval: Latent Geometry-Preserving Composition via Sparse Task Projection

Pengfei Jin
Peng Shu
Sifan Song
Sekeun Kim
Qing Xiao
Cheng Chen
Tianming Liu
Xiang Li

Recent advances in parameter-efficient transfer learning have demonstrated the utility of composing LoRA adapters from libraries of pretrained modules. However, most existing approaches rely on simple retrieval heuristics or uniform averaging, which overlook the latent structure of task relationships in representation space. We propose a new framework for adapter reuse that moves beyond retrieval, formulating adapter composition as a geometry-aware sparse reconstruction problem. Specifically, we represent each task by a latent prototype vector derived from the base model’s encoder and aim to approximate the target task prototype as a sparse linear combination of retrieved reference prototypes, under an L1-regularized optimization objective. The resulting combination weights are then used to blend the corresponding LoRA adapters, yielding a composite adapter tailored to the target task. This formulation not only preserves the local geometric structure of the task representation manifold, but also promotes interpretability and efficient reuse by selecting a minimal set of relevant adapters. We demonstrate the effectiveness of our approach across multiple domains—including medical image segmentation, medical report generation and image synthesis. Our results highlight the benefit of coupling retrieval with latent geometry-aware optimization for improved zero-shot generalization.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

Kang Ni
Minrui Zou
Yuxuan Li
Xiang Li
Kehua Guo
Ming-Ming Cheng
Yimian Dai

One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half.

PDF Details DOI

EAAI Journal 2026 Journal Article

Dynamic vision-based machinery intelligent fault diagnosis with robustness on camera positions

Xiang Li
Peng Yu
Bin Yang
Yaguo Lei
Naipeng Li
Ke Feng

Details DOI

AAAI Conference 2026 Conference Paper

Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models

Zhouxing Tan
Hanlin Xue
Yulong Wan
Ruochong Xiong
Xu Chu
Xiang Li
Junfei Liu

Large language models (LLMs) suffer from a lack of decision-making transparency, limiting their deployment in high-stakes domains such as healthcare. We propose a mechanistic interpretability framework that introduces two novel paradigms: Medical Fine-Tuning with Frozen Attention Layers (FTFA) and Posterior Adaptation Transcoders (PAT). FTFA freezes attention layers while fine-tuning only feed-forward network (FFN) parameters, enabling PAT to efficiently adapt pre-trained transcoders on the same data. This approach achieves over 1000× efficiency improvement compared to training transcoders from scratch. We theoretically justify this methodology and demonstrate its cost-effectiveness for cross-domain transfer. Transcoders are sparse autoencoders that replace MLP layers to provide interpretable feature representations. By substituting MLP layers of both base Gemma2-2b and its medical fine-tuned variant with per-layer transcoders, we enable feature-level attribution analysis. Through systematic pruning and node merging of resulting attribution graphs, we construct human-interpretable decision pathways. Our analysis reveals that LLMs employ two parallel mechanisms for medical diagnosis: pattern matching and multi-hop reasoning, with fine-tuned models demonstrating enhanced correct reasoning patterns. This work provides a practical framework for training transcoders on fine-tuned models at minimal cost, enabling broader application of mechanistic interpretability across domains and potentially guiding model training through transcoder-based analysis.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Ego-PMOVE: Prompt-aware Mixture of View Experts Network for Egocentric Gaze Prediction

Heqian Qiu
Lanxiao Wang
Taijin Zhao
Zhaofeng Shi
Xiang Li
Linfeng Xu
Hongliang Li

Egocentric gaze prediction serves as a critical indicator for decoding human visual attention and cognitive processes, but its inherently limited field of view creates prediction challenges. Although exo-view data provides supplementary contextual information, it exhibits significant spatial and semantic gaps. Existing methods focus solely on isolated feature encoding in single-view paradigms, neglecting cross-view gaze correlations. To make up for this gap, we make the first exploration of cross-view gaze relationship for egocentric gaze prediction, and propose Ego-PMOVE, a novel Prompt-aware Mixture of View Experts network. Unlike prior cross-view studies that forcibly align cross-view features thereby introducing inference noise, we leverage the popular Mixture-of-Experts (MoE) and a set of flexible prompts to disentangle features from different views into three parallel experts: a view-shared expert directly modeling common semantic relationships, a view-discrepancy expert adaptively adjusting the spatial position, scale and shifts based on different view-specific features, and an egocentric expert extracting independent features to compensate for the case of missing exocentric data. To balance these experts, we further design a soft router to dynamically weight them for mining useful information while suppressing noise. A view-query gaze decoder then generates view-specific gaze attention maps, jointly optimized by gaze-heamap and cross-view contrastive loss that regularize both shared and divergent features for accurate gaze prediction. Extensive experiments across the multi-view EgoMe dataset and single-view Ego4D and EGTEA Gaze++ datasets demonstrate the effectiveness and generalizability of our approach.

PDF Details DOI

EAAI Journal 2026 Journal Article

Generate individual spatiotemporal activity sequences from population synthesis via deep learning approaches

Ye Lu
Guirong Liu
Xiang Li
Zhen Jin

Details DOI

AAAI Conference 2026 Conference Paper

GeoBayes: Probabilistic Image Geo-Localization Inference via Sequential Bayesian Updating

Weimin Shi
Xiang Li
Kaige Li
Junhao Fang
Qiang Zhou
Qichuan Geng
Zhong Zhou

Image geo-localization aims to determine the geographic location of a query image. While Multimodal Large Language Models (MLLMs) show potential for this task due to their rich world knowledge and explainable abilities, they often struggle with confirmation bias, i.e., committing to early, potentially incorrect guesses driven by visual clues with varied geographic likelihoods. In this paper, we propose GeoBayes, a novel training-free framework that formulates geolocalization as a Maximum a Posteriori (MAP) estimation task over multiple geographic hypotheses and performs probabilistic thought via sequential Bayesian reasoning. GeoBayes treats each visual object and its associated geographic clues as probabilistic evidence, integrating them iteratively through a Hypothesize–Verify–Update loop. At each step, it evaluates how new evidence supports existing hypotheses and updates their posterior probabilities, gradually converging on the most probable location. This allows GeoBayes to explicitly quantify and fuse the varied geographic probabilities implied by various visual elements, reducing the risk of overcommitting to misleading clues. Furthermore, considering the natural hierarchy of geographic labels (e.g., country, city), GeoBayes introduces a state memory mechanism that stores hypotheses, inference context, and evidence scores across levels. This design enables the framework to propagate prior knowledge across levels of the geographic hierarchy and incorporate geographic structural constraints into the Bayesian update process, achieving a coarse-to-fine geo-localization. Experiments on IM2GPS3k and YFCC4K show that GeoBayes improves MLLM-based geo-localization accuracy without extra training. This demonstrates the effectiveness of probabilistic reasoning for robust and interpretable geo-localization.

PDF Details DOI

AAAI Conference 2026 Conference Paper

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

Xiang Li
Wenxi Li
Yuetong Wang
Chenyang Lyu
Haozhe Lin
Guiguang Ding
Yuchen Guo

Object detection in High-Resolution Wide (HRW) shots, or gigapixel images, presents unique challenges due to extreme object sparsity and vast scale variations. State-of-the-art methods like SparseFormer have pioneered sparse processing by selectively focusing on important regions, yet they apply a uniform computational model to all selected regions, overlooking their intrinsic complexity differences. This leads to a suboptimal trade-off between performance and efficiency. In this paper, we introduce GigaMoE, a novel backbone architecture that pioneers adaptive computation for this domain by replacing the standard Feed-Forward Networks (FFNs) with a Mixture-of-Experts (MoE) module. Our architecture first employs a shared expert to provide a robust feature baseline for all selected regions. Upon this foundation, our core innovation---a novel Sparsity-Guided Routing mechanism---insightfully repurposes importance scores from the sparse backbone to provide a "computational bonus,'' dynamically engaging a variable number of specialized experts based on content complexity. The entire system is trained efficiently via a loss-free load-balancing technique, eliminating the need for cumbersome auxiliary losses. Extensive experiments show that GigaMoE sets a new state-of-the-art on the PANDA benchmark, improving detection accuracy by 1.1% over SparseFormer while simultaneously reducing the computational cost (FLOPs) by a remarkable 32.3%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Yao Cheng
Yibo Zhao
Jiapeng Zhu
Yao Liu
Xing Sun
Xiang Li

Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi-step reasoning. In this work, we propose CogGRAG, a human cognition-inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree-structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self-consistent verification across reasoning paths. The framework operates in three stages: (1) top-down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom-up reasoning with dual-process self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph-structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

Ruiyu Qiu
Rui Wang
Guanghui Yang
Xiang Li
Zhijiang Shao

Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Multiplex Heterogeneous Graph Neural Networks with Euclidean-Riemannian Mutual Space Synergy

Xiang Li
Yuan Cao
Zhongying Zhao
Guoqing Chao
Yanwei Yu

Multiplex heterogeneous networks are common in real-world scenarios, where entities interact through diverse types of relations across multiple semantic layers. Recent advances in multiplex heterogeneous graph neural networks have achieved remarkable results by incorporating node and relation types into message passing and designing relation-aware architectures. However, most existing methods either decouple relations and risk losing complex semantics or require handcrafted relation patterns, which limit scalability. Moreover, prevailing models are typically restricted to Euclidean space, making it difficult to capture non-Euclidean topologies and to distinguish complex interactions among heterogeneous nodes and relations. Standard GNN message passing, grounded in the homophily assumption, also proves inadequate for the intricate, coupled structures in multiplex heterogeneous graphs. To address these challenges, we propose MRiemGNN, a novel multiplex heterogeneous graph neural network that synergizes Euclidean and Riemannian spaces through a geometry-aware, relation-specific message passing scheme and cross-space mutual learning. Experiments on multiple real-world datasets show that MRiemGNN achieves superior performance, efficiency, and scalability on both node classification and link prediction tasks.

PDF Details DOI

EAAI Journal 2026 Journal Article

Physics-integrated intelligent method for propeller aerodynamic property predictions of electric aircraft

Wei Zhang
Ziwei Wang
Xiang Li
Song Xiang

Details DOI

AAAI Conference 2026 Conference Paper

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

Yuxuan Li
Xiang Li
Yunheng Li
Yicheng Zhang
Yimian Dai
Qibin Hou
Ming-Ming Cheng
Jian Yang

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, we propose a novel consistency and synchronization optimization mechanism, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming the combination of specialized models on individual datasets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SpatioTemporal Difference Network for Video Depth Super-Resolution

Zhengxue Wang
Yuan Wu
Xiang Li
Zhiqiang Yan
Jian Yang

Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

Xinbin Yuan
Zhaohui Zheng
Yuxuan Li
Xialei Liu
Li Liu
Xiang Li
Qibin Hou
Ming-Ming Cheng

In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. To mitigate this issue, we propose Strip R-CNN, the first work to systematically explore large strip convolutions for remote sensing object detection. Our key insight is that strip convolutions enable directional feature aggregation along the dominant spatial dimension of slender objects, reducing background interference while preserving essential geometric information. We design two core components: (i) StripNet, a backbone network employing sequential orthogonal large strip convolutions to capture anisotropic spatial patterns, and (ii) Strip Head, which enhances localization precision by incorporating strip convolutions into the detection head. Unlike previous large-kernel approaches that suffer from computational redundancy and isotropic limitations, our method achieves superior performance with remarkable efficiency. Extensive experiments on multiple benchmarks (DOTA, FAIR1M, HRSC2016, and DIOR) demonstrate significant improvements, with our 30M parameter model achieving 82.75% mAP on DOTA-v1.0, establishing a new state-of-the-art record while providing new insights into anisotropic feature learning for remote sensing applications.

PDF Details DOI

AIIM Journal 2026 Journal Article

Syndrome differentiation of Traditional Chinese Medicine via multiple knowledge enhancement with Kolmogorov–Arnold Theorem

Yi Yang
Xuxiang Lu
Wenrong An
Haifeng Wei
Xiang Li
Pingping Wang
Benzheng Wei

Details DOI

AAAI Conference 2026 Conference Paper

TCoT: Trajectory Chain-of-Thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model

Xiang Li
Ya-Li Li
Yuan Wang
Huaqiang Wang
Shengjin Wang

Recent advances in vision-language-action (VLA) models have demonstrated impressive generalization for robotic manipulation. However, these models often operate by directly mapping visual and linguistic inputs to subsequent actions, lacking intermediate task planning, along with failure detection and recovery ability. These limitations prevent them from effectively decomposing complex tasks, recognizing problems, and correcting erroneous actions, ultimately resulting in complete task failure. This significantly hinders their ability to perform long-horizon tasks and generalization ability. To this end, we introduce TCoT: Trajectory Chain-of-Thought, a unified VLA framework that enhances this direct mapping with trajectory planning as well as failure detection and recovery. TCoT leverages hierarchy trajectories as a precise and compact representation of CoT reasoning for manipulation: global planning provides a high-level, goal-oriented trajectory to guide the robot toward its task objective, while local planning focuses on real-time adjustments to address dynamic changes. Moreover, we designed the Global-Local Switching Recovery algorithm that detects and effectively recovers from failures. Experimental results reveal that TCoT surpasses the state-of-the-art methods across both real and simulated scenarios and exhibits superior generalization capabilities.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

Meng Cao
Pengfei Hu
Yingyao Wang
Jihao Gu
Haoran Tang
Haoze Zhao
Chen Wang
Jiahua Dong

Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in videos remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation in video contexts. Our work differs from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the video’s explicit narrative; 2) Multi-hop fact-seeking question: Each question involves multiple explicit facts and requires strict factual grounding without hypothetical or subjective inferences. We include per-hop single-fact-based sub-QAs alongside final QAs to enable fine-grained, step-by-step evaluation; 3) Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance; 4) Temporal grounded required: Requiring answers to rely on one or more temporal segments in videos, rather than single frames. We extensively evaluate 33 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, with the best-performing model o3 merely achieving an F-score of 66.3%; 2) Most LVLMs are overconfident in what they generate, with self-stated confidence exceeding actual accuracy; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead; 4) Multi-hop QA demonstrates substantially degraded performance compared to single-hop sub-QAs, with first-hop object/event recognition emerging as the primary bottleneck. We position Video SimpleQA as the cornerstone benchmark for video factuality assessment, aiming to steer LVLM development toward verifiable grounding in real-world contexts.

PDF Details DOI

JBHI Journal 2026 Journal Article

WGB-GLFI: A Novel Graph-Based Global-Local Feature Interaction Framework for Automated Seizure Detection

Xiang Li
Mingxing Zhu
Chuqi Yang
Ke Zhang
Xin Wang
Sunday Timothy Aboyeji
Fei Chen
Chen Yao

Epilepsy detection faces significant challenges due to unpredictable seizures, ranging from brief awareness lapses to severe convulsions, posing risks to patients' safety and quality of life. In recent years, deep learning has become a mainstream approach in this field, leveraging advanced computational resources and EEG datasets. However, a key challenge remains: existing methods often lack unified spatial modeling and struggle to effectively handle local detailed features, thereby limiting their accuracy and robustness. To address these issues, we propose the Weighted Graph Building Global-Local Feature Interaction (WGB-GLFI) framework, which integrates spatial connectivity and dynamic patterns through a Weighted Graph Building (WGB) module and a Global-Local Feature Interaction (GLFI) module. This approach excels by comprehensively capturing the dynamic spatial relationships during epileptic seizures and achieving seamless global-local feature integration, significantly enhancing seizure detection performance. Its effectiveness has been validated across multiple datasets, including CHB-MIT, Siena Scalp, and private datasets, demonstrating robust and reliable results. Evaluated on these datasets, our model achieves accuracy rates of 99. 28%, 99. 21%, and 99. 30%, respectively. The reliability and robustness of our framework provide epilepsy patients with faster and more reliable seizure detection, which helps to intervene in a timely manner and improve the quality of life of patients.

Details DOI

ECAI Conference 2025 Conference Paper

ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning

Xinyi Wang
Jiashui Wang
Jinbo Su
Ke Wang
Peng Chen
Yanming Liu
Long Liu
Xiang Li

Assembly code analysis and comprehension play critical roles in applications like reverse engineering, yet they face substantial challenges due to low information density and a lack of explicit syntactic structures. While traditional masked language modeling (MLM) approaches do not explicitly focus on natural language interaction, emerging decoder-focused large language models (LLMs) demonstrate partial success in binary analysis yet remain underexplored for holistic comprehension. We present Assembly Augmented Tuning (ASMA-Tune), an end-to-end structural-semantic instruction tuning framework that synergizes encoder architecture with decoder-based LLMs through a projector module, where the assembly encoder extracts hardware-level structural features, the projector bridges representations with the semantic space, and the instruction-tuned LLM preserves natural language capabilities. Experimental results demonstrate three key advantages: (1) State-of-the-art performance in assembly comprehension with +39. 7% Recall@1 and +17. 8% MRR improvements over GPT-4-Turbo, (2) Consistent enhancements across base models (24. 6–107. 4% Recall@1 and 15. 2–106. 3% MRR on Qwen2. 5-Coder, Deepseek-Coder and CodeLlama variants), and (3) Superior instruction-following capabilities (41. 5%–118% improvements) with controlled code generation degradation (–8. 9% to –35% across architectures).

Details

IROS Conference 2025 Conference Paper

BookBot: A Robotic Manipulation Benchmark for Voice-Driven Book Recognition and Grasping in Cluttered Environments

Huaqiang Wang
Yuan Wang
Xiang Li
Yali Li
Shengjin Wang

Books, as enduring repositories of cultural heritage as well as knowledge, play a fundamental role in human development. Although advances in embodied AI and robotics revolutionize automation in domains, e. g. , manufacturing and logistics, robotic book manipulation remains an underexplored frontier. Two primary bottlenecks impede progress: (1) scarcity of fine-grained annotated datasets for benchmarking robotic book manipulation, and (2) lack of unified perception-action frameworks capable of dynamically coupling multi-modal sensing and manipulation in real-world scenarios. To these issues, we present THU-Book, the first open-access benchmark featuring 643 3D scene captures, encompassing 11, 298 high-fidelity book instances with rich annotations to support tasks from book recognition and localization to grasping and repositioning. Building upon this foundation, we develop BookBot, a novel voice-interactive book manipulation pipeline to support cross-environmental, multilingual, and multi-categorical book manipulation. First, we utilize Large Language Models (LLMs) to parse and comprehend ambiguity in user instructions. We further propose an instance segmentation module combined with OCR tool to link language to visual instances. Finally, we introduce a PCA-based manipulation policy to refine the robotic grasp pose, utilizing the principal components of the books’ geometry, improving the precision and efficiency of grasping. Experiments conducted on the THU-Book benchmark validate the effectiveness of our BookBot. The dataset is available at https://github.com/wanghq-public/BookBot.

Details

JBHI Journal 2025 Journal Article

Characterization of Cortical Connectivity in the Deception State With a Data-Driven Network Model Based on EEG Signal

Qianruo Kang
Yaqian Li
Xiang Li
Min Tian
Yin Xiang
Feng Li
Siyu Peng
Yijun Xiong

This study investigates the pattern of information interaction at the cortical level during deception, aiming to reveal the cognitive processes involved in the deception task. Our study involves the 64-channel EEG signals of 28 subjects (14 for innocent and 14 for guilty groups) acquired under the guilty knowledge test (GKT) lie-detection protocol. Additionally, we establish the functional connectivity network at the cortical level considering volume conduction effects, use a data-driven approach to select the regions of interest (ROIs) on the subject's cortex based on scalp electrical activity, and perform cortical current density estimation on 15 ROIs. The nonlinear dependence between the cortical waveforms of the ROIs is quantified based on mutual information, and a network of cortical mutual information connections is constructed in four frequency bands: delta, theta, alpha, and beta. The feature extraction and classification process are performed in each frequency band, and the mutual information connections statistically different between the innocent and guilty groups are first selected as features using statistical tests. Moreover, the optimal feature subset (OFS) is found by combining the SVM classifier and the wrapper feature selection strategy. Furthermore, the most important mutual information connections (MIMICs) per frequency band are obtained by refining the OFS according to the classification performance curve. The average test accuracies of MIMICs in the delta, theta, alpha, and beta bands reached 99. 76%, 96. 42%, 84. 04%, and 97. 61%, respectively. Finally, the physiological significance of each frequency sub-band and the physiological function of MIMICs are combined to explore the cognitive mechanism of lies and provide new evidence for cognitive activity in lying states.

Details DOI

UAI Conference 2025 Conference Paper

Corruption-Robust Variance-aware Algorithms for Generalized Linear Bandits under Heavy-tailed Rewards

Qingyuan Yu
Euijin Baek
Xiang Li
Qiang Sun

Stochastic linear bandits have recently received significant attention in sequential decision-making. However, real-world challenges such as heavy-tailed noise, reward corruption, and nonlinear reward functions remain difficult to address. To tackle these difficulties, we propose GAdaOFUL, a novel algorithm that leverages adaptive Huber regression to achieve robustness in generalized linear models (GLMs), where rewards can be nonlinear functions of features. GAdaOFUL achieves a state-of-the-art variance-aware regret bound, scaling with the square root of the cumulative reward variance over time, plus an additional term proportional to the level of corruption. The algorithm adapts to problem complexity, yielding improved regret when the cumulative variance is small. Simulation results demonstrate the robustness and effectiveness of GAdaOFUL in practice. The code is available at \url{https: //github. com/NeXAIS/GAdaOFUL}.

Details

AAAI Conference 2025 Conference Paper

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Xiang Li
Qiaomin Xie

The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel coupling-based convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

Xiang Li
Zirui Wang
Zixuan Huang
James Rehg

Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.