Arrow Research search

Author name cluster

Xiang Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

179 papers
2 author rows

Possible papers

179

AAAI Conference 2026 Conference Paper

Analyze–Compose–Execute: A Dynamic Dialogue Framework for Multi-Agent Debate

  • Wenyuan Gu
  • Haowen Wang
  • Jiale Han
  • Xiang Li
  • Zhixuan Wu
  • Hongru Xiao
  • Bo Cheng

Multi-Agent Debate (MAD) is an emerging paradigm that leverages the reasoning abilities of Large Language Models (LLMs) by encouraging them to collaboratively solve problems through human-like discussions. However, current MAD methods typically constrain agents to follow fixed discussion pipelines, repeatedly applying the same discussion act for a predetermined number of rounds, which limits their effectiveness and adaptability in complex and diverse tasks. To address this limitation, we propose Analyze–Compose–Execute (ACE), a novel debate framework in which agents dynamically execute the discussion actions according to the dialogue context. By analyzing the current responses of agents, ACE selects appropriate acts from a predefined Atomic Discussion Acts Library (ADAL), which are composed into a discussion action to be executed in the next round, to enable truly dynamic debate. We conduct extensive experiments on the challenging benchmark Big-Bench Hard (BBH) benchmark. ACE achieves state-of-the- art results on 17 out of 23 tasks, with an average performance gain of 8.5% across all tasks, demonstrating the effectiveness and robustness of our approach.

AAAI Conference 2026 Conference Paper

Beyond Adapter Retrieval: Latent Geometry-Preserving Composition via Sparse Task Projection

  • Pengfei Jin
  • Peng Shu
  • Sifan Song
  • Sekeun Kim
  • Qing Xiao
  • Cheng Chen
  • Tianming Liu
  • Xiang Li

Recent advances in parameter-efficient transfer learning have demonstrated the utility of composing LoRA adapters from libraries of pretrained modules. However, most existing approaches rely on simple retrieval heuristics or uniform averaging, which overlook the latent structure of task relationships in representation space. We propose a new framework for adapter reuse that moves beyond retrieval, formulating adapter composition as a geometry-aware sparse reconstruction problem. Specifically, we represent each task by a latent prototype vector derived from the base model’s encoder and aim to approximate the target task prototype as a sparse linear combination of retrieved reference prototypes, under an L1-regularized optimization objective. The resulting combination weights are then used to blend the corresponding LoRA adapters, yielding a composite adapter tailored to the target task. This formulation not only preserves the local geometric structure of the task representation manifold, but also promotes interpretability and efficient reuse by selecting a minimal set of relevant adapters. We demonstrate the effectiveness of our approach across multiple domains—including medical image segmentation, medical report generation and image synthesis. Our results highlight the benefit of coupling retrieval with latent geometry-aware optimization for improved zero-shot generalization.

AAAI Conference 2026 Conference Paper

DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

  • Kang Ni
  • Minrui Zou
  • Yuxuan Li
  • Xiang Li
  • Kehua Guo
  • Ming-Ming Cheng
  • Yimian Dai

One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half.

AAAI Conference 2026 Conference Paper

Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models

  • Zhouxing Tan
  • Hanlin Xue
  • Yulong Wan
  • Ruochong Xiong
  • Xu Chu
  • Xiang Li
  • Junfei Liu

Large language models (LLMs) suffer from a lack of decision-making transparency, limiting their deployment in high-stakes domains such as healthcare. We propose a mechanistic interpretability framework that introduces two novel paradigms: Medical Fine-Tuning with Frozen Attention Layers (FTFA) and Posterior Adaptation Transcoders (PAT). FTFA freezes attention layers while fine-tuning only feed-forward network (FFN) parameters, enabling PAT to efficiently adapt pre-trained transcoders on the same data. This approach achieves over 1000× efficiency improvement compared to training transcoders from scratch. We theoretically justify this methodology and demonstrate its cost-effectiveness for cross-domain transfer. Transcoders are sparse autoencoders that replace MLP layers to provide interpretable feature representations. By substituting MLP layers of both base Gemma2-2b and its medical fine-tuned variant with per-layer transcoders, we enable feature-level attribution analysis. Through systematic pruning and node merging of resulting attribution graphs, we construct human-interpretable decision pathways. Our analysis reveals that LLMs employ two parallel mechanisms for medical diagnosis: pattern matching and multi-hop reasoning, with fine-tuned models demonstrating enhanced correct reasoning patterns. This work provides a practical framework for training transcoders on fine-tuned models at minimal cost, enabling broader application of mechanistic interpretability across domains and potentially guiding model training through transcoder-based analysis.

AAAI Conference 2026 Conference Paper

Ego-PMOVE: Prompt-aware Mixture of View Experts Network for Egocentric Gaze Prediction

  • Heqian Qiu
  • Lanxiao Wang
  • Taijin Zhao
  • Zhaofeng Shi
  • Xiang Li
  • Linfeng Xu
  • Hongliang Li

Egocentric gaze prediction serves as a critical indicator for decoding human visual attention and cognitive processes, but its inherently limited field of view creates prediction challenges. Although exo-view data provides supplementary contextual information, it exhibits significant spatial and semantic gaps. Existing methods focus solely on isolated feature encoding in single-view paradigms, neglecting cross-view gaze correlations. To make up for this gap, we make the first exploration of cross-view gaze relationship for egocentric gaze prediction, and propose Ego-PMOVE, a novel Prompt-aware Mixture of View Experts network. Unlike prior cross-view studies that forcibly align cross-view features thereby introducing inference noise, we leverage the popular Mixture-of-Experts (MoE) and a set of flexible prompts to disentangle features from different views into three parallel experts: a view-shared expert directly modeling common semantic relationships, a view-discrepancy expert adaptively adjusting the spatial position, scale and shifts based on different view-specific features, and an egocentric expert extracting independent features to compensate for the case of missing exocentric data. To balance these experts, we further design a soft router to dynamically weight them for mining useful information while suppressing noise. A view-query gaze decoder then generates view-specific gaze attention maps, jointly optimized by gaze-heamap and cross-view contrastive loss that regularize both shared and divergent features for accurate gaze prediction. Extensive experiments across the multi-view EgoMe dataset and single-view Ego4D and EGTEA Gaze++ datasets demonstrate the effectiveness and generalizability of our approach.

AAAI Conference 2026 Conference Paper

GeoBayes: Probabilistic Image Geo-Localization Inference via Sequential Bayesian Updating

  • Weimin Shi
  • Xiang Li
  • Kaige Li
  • Junhao Fang
  • Qiang Zhou
  • Qichuan Geng
  • Zhong Zhou

Image geo-localization aims to determine the geographic location of a query image. While Multimodal Large Language Models (MLLMs) show potential for this task due to their rich world knowledge and explainable abilities, they often struggle with confirmation bias, i.e., committing to early, potentially incorrect guesses driven by visual clues with varied geographic likelihoods. In this paper, we propose GeoBayes, a novel training-free framework that formulates geolocalization as a Maximum a Posteriori (MAP) estimation task over multiple geographic hypotheses and performs probabilistic thought via sequential Bayesian reasoning. GeoBayes treats each visual object and its associated geographic clues as probabilistic evidence, integrating them iteratively through a Hypothesize–Verify–Update loop. At each step, it evaluates how new evidence supports existing hypotheses and updates their posterior probabilities, gradually converging on the most probable location. This allows GeoBayes to explicitly quantify and fuse the varied geographic probabilities implied by various visual elements, reducing the risk of overcommitting to misleading clues. Furthermore, considering the natural hierarchy of geographic labels (e.g., country, city), GeoBayes introduces a state memory mechanism that stores hypotheses, inference context, and evidence scores across levels. This design enables the framework to propagate prior knowledge across levels of the geographic hierarchy and incorporate geographic structural constraints into the Bayesian update process, achieving a coarse-to-fine geo-localization. Experiments on IM2GPS3k and YFCC4K show that GeoBayes improves MLLM-based geo-localization accuracy without extra training. This demonstrates the effectiveness of probabilistic reasoning for robust and interpretable geo-localization.

AAAI Conference 2026 Conference Paper

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

  • Xiang Li
  • Wenxi Li
  • Yuetong Wang
  • Chenyang Lyu
  • Haozhe Lin
  • Guiguang Ding
  • Yuchen Guo

Object detection in High-Resolution Wide (HRW) shots, or gigapixel images, presents unique challenges due to extreme object sparsity and vast scale variations. State-of-the-art methods like SparseFormer have pioneered sparse processing by selectively focusing on important regions, yet they apply a uniform computational model to all selected regions, overlooking their intrinsic complexity differences. This leads to a suboptimal trade-off between performance and efficiency. In this paper, we introduce GigaMoE, a novel backbone architecture that pioneers adaptive computation for this domain by replacing the standard Feed-Forward Networks (FFNs) with a Mixture-of-Experts (MoE) module. Our architecture first employs a shared expert to provide a robust feature baseline for all selected regions. Upon this foundation, our core innovation---a novel Sparsity-Guided Routing mechanism---insightfully repurposes importance scores from the sparse backbone to provide a "computational bonus,'' dynamically engaging a variable number of specialized experts based on content complexity. The entire system is trained efficiently via a loss-free load-balancing technique, eliminating the need for cumbersome auxiliary losses. Extensive experiments show that GigaMoE sets a new state-of-the-art on the PANDA benchmark, improving detection accuracy by 1.1% over SparseFormer while simultaneously reducing the computational cost (FLOPs) by a remarkable 32.3%.

AAAI Conference 2026 Conference Paper

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

  • Yao Cheng
  • Yibo Zhao
  • Jiapeng Zhu
  • Yao Liu
  • Xing Sun
  • Xiang Li

Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi-step reasoning. In this work, we propose CogGRAG, a human cognition-inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree-structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self-consistent verification across reasoning paths. The framework operates in three stages: (1) top-down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom-up reasoning with dual-process self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph-structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.

AAAI Conference 2026 Conference Paper

LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

  • Ruiyu Qiu
  • Rui Wang
  • Guanghui Yang
  • Xiang Li
  • Zhijiang Shao

Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.

AAAI Conference 2026 Conference Paper

Multiplex Heterogeneous Graph Neural Networks with Euclidean-Riemannian Mutual Space Synergy

  • Xiang Li
  • Yuan Cao
  • Zhongying Zhao
  • Guoqing Chao
  • Yanwei Yu

Multiplex heterogeneous networks are common in real-world scenarios, where entities interact through diverse types of relations across multiple semantic layers. Recent advances in multiplex heterogeneous graph neural networks have achieved remarkable results by incorporating node and relation types into message passing and designing relation-aware architectures. However, most existing methods either decouple relations and risk losing complex semantics or require handcrafted relation patterns, which limit scalability. Moreover, prevailing models are typically restricted to Euclidean space, making it difficult to capture non-Euclidean topologies and to distinguish complex interactions among heterogeneous nodes and relations. Standard GNN message passing, grounded in the homophily assumption, also proves inadequate for the intricate, coupled structures in multiplex heterogeneous graphs. To address these challenges, we propose MRiemGNN, a novel multiplex heterogeneous graph neural network that synergizes Euclidean and Riemannian spaces through a geometry-aware, relation-specific message passing scheme and cross-space mutual learning. Experiments on multiple real-world datasets show that MRiemGNN achieves superior performance, efficiency, and scalability on both node classification and link prediction tasks.

AAAI Conference 2026 Conference Paper

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

  • Yuxuan Li
  • Xiang Li
  • Yunheng Li
  • Yicheng Zhang
  • Yimian Dai
  • Qibin Hou
  • Ming-Ming Cheng
  • Jian Yang

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, we propose a novel consistency and synchronization optimization mechanism, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming the combination of specialized models on individual datasets.

AAAI Conference 2026 Conference Paper

SpatioTemporal Difference Network for Video Depth Super-Resolution

  • Zhengxue Wang
  • Yuan Wu
  • Xiang Li
  • Zhiqiang Yan
  • Jian Yang

Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

AAAI Conference 2026 Conference Paper

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

  • Xinbin Yuan
  • Zhaohui Zheng
  • Yuxuan Li
  • Xialei Liu
  • Li Liu
  • Xiang Li
  • Qibin Hou
  • Ming-Ming Cheng

In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. To mitigate this issue, we propose Strip R-CNN, the first work to systematically explore large strip convolutions for remote sensing object detection. Our key insight is that strip convolutions enable directional feature aggregation along the dominant spatial dimension of slender objects, reducing background interference while preserving essential geometric information. We design two core components: (i) StripNet, a backbone network employing sequential orthogonal large strip convolutions to capture anisotropic spatial patterns, and (ii) Strip Head, which enhances localization precision by incorporating strip convolutions into the detection head. Unlike previous large-kernel approaches that suffer from computational redundancy and isotropic limitations, our method achieves superior performance with remarkable efficiency. Extensive experiments on multiple benchmarks (DOTA, FAIR1M, HRSC2016, and DIOR) demonstrate significant improvements, with our 30M parameter model achieving 82.75% mAP on DOTA-v1.0, establishing a new state-of-the-art record while providing new insights into anisotropic feature learning for remote sensing applications.

AAAI Conference 2026 Conference Paper

TCoT: Trajectory Chain-of-Thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model

  • Xiang Li
  • Ya-Li Li
  • Yuan Wang
  • Huaqiang Wang
  • Shengjin Wang

Recent advances in vision-language-action (VLA) models have demonstrated impressive generalization for robotic manipulation. However, these models often operate by directly mapping visual and linguistic inputs to subsequent actions, lacking intermediate task planning, along with failure detection and recovery ability. These limitations prevent them from effectively decomposing complex tasks, recognizing problems, and correcting erroneous actions, ultimately resulting in complete task failure. This significantly hinders their ability to perform long-horizon tasks and generalization ability. To this end, we introduce TCoT: Trajectory Chain-of-Thought, a unified VLA framework that enhances this direct mapping with trajectory planning as well as failure detection and recovery. TCoT leverages hierarchy trajectories as a precise and compact representation of CoT reasoning for manipulation: global planning provides a high-level, goal-oriented trajectory to guide the robot toward its task objective, while local planning focuses on real-time adjustments to address dynamic changes. Moreover, we designed the Global-Local Switching Recovery algorithm that detects and effectively recovers from failures. Experimental results reveal that TCoT surpasses the state-of-the-art methods across both real and simulated scenarios and exhibits superior generalization capabilities.

AAAI Conference 2026 Conference Paper

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

  • Meng Cao
  • Pengfei Hu
  • Yingyao Wang
  • Jihao Gu
  • Haoran Tang
  • Haoze Zhao
  • Chen Wang
  • Jiahua Dong

Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in videos remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation in video contexts. Our work differs from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the video’s explicit narrative; 2) Multi-hop fact-seeking question: Each question involves multiple explicit facts and requires strict factual grounding without hypothetical or subjective inferences. We include per-hop single-fact-based sub-QAs alongside final QAs to enable fine-grained, step-by-step evaluation; 3) Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance; 4) Temporal grounded required: Requiring answers to rely on one or more temporal segments in videos, rather than single frames. We extensively evaluate 33 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, with the best-performing model o3 merely achieving an F-score of 66.3%; 2) Most LVLMs are overconfident in what they generate, with self-stated confidence exceeding actual accuracy; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead; 4) Multi-hop QA demonstrates substantially degraded performance compared to single-hop sub-QAs, with first-hop object/event recognition emerging as the primary bottleneck. We position Video SimpleQA as the cornerstone benchmark for video factuality assessment, aiming to steer LVLM development toward verifiable grounding in real-world contexts.

JBHI Journal 2026 Journal Article

WGB-GLFI: A Novel Graph-Based Global-Local Feature Interaction Framework for Automated Seizure Detection

  • Xiang Li
  • Mingxing Zhu
  • Chuqi Yang
  • Ke Zhang
  • Xin Wang
  • Sunday Timothy Aboyeji
  • Fei Chen
  • Chen Yao

Epilepsy detection faces significant challenges due to unpredictable seizures, ranging from brief awareness lapses to severe convulsions, posing risks to patients' safety and quality of life. In recent years, deep learning has become a mainstream approach in this field, leveraging advanced computational resources and EEG datasets. However, a key challenge remains: existing methods often lack unified spatial modeling and struggle to effectively handle local detailed features, thereby limiting their accuracy and robustness. To address these issues, we propose the Weighted Graph Building Global-Local Feature Interaction (WGB-GLFI) framework, which integrates spatial connectivity and dynamic patterns through a Weighted Graph Building (WGB) module and a Global-Local Feature Interaction (GLFI) module. This approach excels by comprehensively capturing the dynamic spatial relationships during epileptic seizures and achieving seamless global-local feature integration, significantly enhancing seizure detection performance. Its effectiveness has been validated across multiple datasets, including CHB-MIT, Siena Scalp, and private datasets, demonstrating robust and reliable results. Evaluated on these datasets, our model achieves accuracy rates of 99. 28%, 99. 21%, and 99. 30%, respectively. The reliability and robustness of our framework provide epilepsy patients with faster and more reliable seizure detection, which helps to intervene in a timely manner and improve the quality of life of patients.

ECAI Conference 2025 Conference Paper

ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning

  • Xinyi Wang
  • Jiashui Wang
  • Jinbo Su
  • Ke Wang
  • Peng Chen
  • Yanming Liu
  • Long Liu
  • Xiang Li

Assembly code analysis and comprehension play critical roles in applications like reverse engineering, yet they face substantial challenges due to low information density and a lack of explicit syntactic structures. While traditional masked language modeling (MLM) approaches do not explicitly focus on natural language interaction, emerging decoder-focused large language models (LLMs) demonstrate partial success in binary analysis yet remain underexplored for holistic comprehension. We present Assembly Augmented Tuning (ASMA-Tune), an end-to-end structural-semantic instruction tuning framework that synergizes encoder architecture with decoder-based LLMs through a projector module, where the assembly encoder extracts hardware-level structural features, the projector bridges representations with the semantic space, and the instruction-tuned LLM preserves natural language capabilities. Experimental results demonstrate three key advantages: (1) State-of-the-art performance in assembly comprehension with +39. 7% Recall@1 and +17. 8% MRR improvements over GPT-4-Turbo, (2) Consistent enhancements across base models (24. 6–107. 4% Recall@1 and 15. 2–106. 3% MRR on Qwen2. 5-Coder, Deepseek-Coder and CodeLlama variants), and (3) Superior instruction-following capabilities (41. 5%–118% improvements) with controlled code generation degradation (–8. 9% to –35% across architectures).

IROS Conference 2025 Conference Paper

BookBot: A Robotic Manipulation Benchmark for Voice-Driven Book Recognition and Grasping in Cluttered Environments

  • Huaqiang Wang
  • Yuan Wang
  • Xiang Li
  • Yali Li
  • Shengjin Wang

Books, as enduring repositories of cultural heritage as well as knowledge, play a fundamental role in human development. Although advances in embodied AI and robotics revolutionize automation in domains, e. g. , manufacturing and logistics, robotic book manipulation remains an underexplored frontier. Two primary bottlenecks impede progress: (1) scarcity of fine-grained annotated datasets for benchmarking robotic book manipulation, and (2) lack of unified perception-action frameworks capable of dynamically coupling multi-modal sensing and manipulation in real-world scenarios. To these issues, we present THU-Book, the first open-access benchmark featuring 643 3D scene captures, encompassing 11, 298 high-fidelity book instances with rich annotations to support tasks from book recognition and localization to grasping and repositioning. Building upon this foundation, we develop BookBot, a novel voice-interactive book manipulation pipeline to support cross-environmental, multilingual, and multi-categorical book manipulation. First, we utilize Large Language Models (LLMs) to parse and comprehend ambiguity in user instructions. We further propose an instance segmentation module combined with OCR tool to link language to visual instances. Finally, we introduce a PCA-based manipulation policy to refine the robotic grasp pose, utilizing the principal components of the books’ geometry, improving the precision and efficiency of grasping. Experiments conducted on the THU-Book benchmark validate the effectiveness of our BookBot. The dataset is available at https://github.com/wanghq-public/BookBot.

JBHI Journal 2025 Journal Article

Characterization of Cortical Connectivity in the Deception State With a Data-Driven Network Model Based on EEG Signal

  • Qianruo Kang
  • Yaqian Li
  • Xiang Li
  • Min Tian
  • Yin Xiang
  • Feng Li
  • Siyu Peng
  • Yijun Xiong

This study investigates the pattern of information interaction at the cortical level during deception, aiming to reveal the cognitive processes involved in the deception task. Our study involves the 64-channel EEG signals of 28 subjects (14 for innocent and 14 for guilty groups) acquired under the guilty knowledge test (GKT) lie-detection protocol. Additionally, we establish the functional connectivity network at the cortical level considering volume conduction effects, use a data-driven approach to select the regions of interest (ROIs) on the subject's cortex based on scalp electrical activity, and perform cortical current density estimation on 15 ROIs. The nonlinear dependence between the cortical waveforms of the ROIs is quantified based on mutual information, and a network of cortical mutual information connections is constructed in four frequency bands: delta, theta, alpha, and beta. The feature extraction and classification process are performed in each frequency band, and the mutual information connections statistically different between the innocent and guilty groups are first selected as features using statistical tests. Moreover, the optimal feature subset (OFS) is found by combining the SVM classifier and the wrapper feature selection strategy. Furthermore, the most important mutual information connections (MIMICs) per frequency band are obtained by refining the OFS according to the classification performance curve. The average test accuracies of MIMICs in the delta, theta, alpha, and beta bands reached 99. 76%, 96. 42%, 84. 04%, and 97. 61%, respectively. Finally, the physiological significance of each frequency sub-band and the physiological function of MIMICs are combined to explore the cognitive mechanism of lies and provide new evidence for cognitive activity in lying states.

UAI Conference 2025 Conference Paper

Corruption-Robust Variance-aware Algorithms for Generalized Linear Bandits under Heavy-tailed Rewards

  • Qingyuan Yu
  • Euijin Baek
  • Xiang Li
  • Qiang Sun

Stochastic linear bandits have recently received significant attention in sequential decision-making. However, real-world challenges such as heavy-tailed noise, reward corruption, and nonlinear reward functions remain difficult to address. To tackle these difficulties, we propose GAdaOFUL, a novel algorithm that leverages adaptive Huber regression to achieve robustness in generalized linear models (GLMs), where rewards can be nonlinear functions of features. GAdaOFUL achieves a state-of-the-art variance-aware regret bound, scaling with the square root of the cumulative reward variance over time, plus an additional term proportional to the level of corruption. The algorithm adapts to problem complexity, yielding improved regret when the cumulative variance is small. Simulation results demonstrate the robustness and effectiveness of GAdaOFUL in practice. The code is available at \url{https: //github. com/NeXAIS/GAdaOFUL}.

AAAI Conference 2025 Conference Paper

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

  • Xiang Li
  • Qiaomin Xie

The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel coupling-based convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

NeurIPS Conference 2025 Conference Paper

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

  • Xiang Li
  • Zirui Wang
  • Zixuan Huang
  • James Rehg

Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

AAAI Conference 2025 Conference Paper

Every Opinion Matters: Evaluating and Building Models with Pluralistic Views

  • Xiang Li

The development of large language models has demonstrated robust performance on English-centric benchmarks, which predominantly reflect majority opinions and dominant cultural norms. However, successful deployment in real-world applications requires the ability to handle context-specific and diverse knowledge, which is often underrepresented in training data. Addressing a plurality of perspectives is therefore essential. My research focuses on developing pluralistic evaluation methods to assess the diversity of LLM outputs, with a particular focus on culturally rich common-sense reasoning. Additionally, I work on advancing models that integrate diverse knowledge into LLMs, aiming to bridge the gap between human and AI understanding through the incorporation of varied perspectives using innovative probabilistic frameworks. In this talk, I will emphasize two key directions of my previous work: the probabilistic box model for representing diverse knowledge and probabilistic evaluation for assessing diversity in LLMs, with a focus on distributional aspects. Additionally, I will discuss my efforts to understand model behavior in long-tail scenarios.

NeurIPS Conference 2025 Conference Paper

Fading to Grow: Growing Preference Ratios via Preference Fading Discrete Diffusion for Recommendation

  • Guoqing Hu
  • An Zhang
  • Shuchang Liu
  • Wenyu Mao
  • Jiancan Wu
  • Xun Yang
  • Xiang Li
  • Lantao Hu

Recommenders aim to rank items from a discrete item corpus in line with user interests, yet suffer from extremely sparse user preference data. Recent advances in diffusion models have inspired diffusion-based recommenders, which alleviate sparsity by injecting noise during a forward process to prevent collapse of perturbed preference distributions. However, current diffusion‑based recommenders predominantly rely on continuous Gaussian noise, which is intrinsically mismatched with the discrete nature of user preference data in recommendation. In this paper, building upon recent advances in discrete diffusion, we propose \textbf{PreferGrow}, a discrete diffusion-based recommender modeling preference ratios by fading and growing user preferences over the discrete item corpus. PreferGrow differs from existing diffusion-based recommenders in three core aspects: (1) Discrete modeling of preference ratios: PreferGrow models relative preference ratios between two items, where a positive value indicates a more preferred one over another less preferred. This formulation aligns naturally with the discrete and ranking-oriented nature of recommendation tasks. (2) Perturbing via preference fading: Instead of injecting continuous noise, PreferGrow fades user preferences by replacing the preferred item with alternatives---physically akin to negative sampling---thereby eliminating the need for any prior noise assumption. (3) Preference reconstruction via growing: PreferGrow reconstructs user preferences by iteratively growing the preference signal from the estimated ratios. We further provide theoretical analysis showing that PreferGrow preserves key properties of discrete diffusion processes. PreferGrow provides a well-defined matrix‑based formulation for discrete diffusion-based recommendation and empirically outperforms existing diffusion‑based recommenders across five benchmark datasets, underscoring its superior effectiveness. Our codes are available at \url{https: //anonymous. 4open. science/r/PreferGrow_Commit-2259/}.

AAAI Conference 2025 Conference Paper

From Words to Worth: Newborn Article Impact Prediction with LLM

  • Penghai Zhao
  • Qinghua Xing
  • Kairan Dou
  • Jinyu Tian
  • Ying Tai
  • Jian Yang
  • Ming-Ming Cheng
  • Xiang Li

Predicting the future impact of newly published articles is pivotal for advancing scientific discovery in an era of unprecedented scholarly expansion. This paper introduces a promising approach, leveraging the capabilities of LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Breaking away from traditional methods heavily reliant on external data, we propose fine-tuning the LLM to uncover the intrinsic semantic patterns shared by highly impactful articles from a vast collection of text-score pairs. These semantic features are further utilized to predict the proposed indicator, TNCSIsp, which incorporates favorable normalization properties across value, field, and time. To facilitate parameter-efficient fine-tuning of the LLM, we have also meticulously curated a dataset containing over 12,000 entries, each annotated with titles, abstracts, and their corresponding TNCSIsp values. Experimental results reveal an MAE of 0.216 and an NDCG@20 of 0.901, setting new benchmarks in predicting the impact of newborn articles. Finally, we present a real-world application example for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for article impact prediction.

AAAI Conference 2025 Conference Paper

Hierarchically Controlled Deformable 3D Gaussians for Talking Head Synthesis

  • Zhenhua Wu
  • Linxuan Jiang
  • Xiang Li
  • Chaowei Fang
  • Yipeng Qin
  • Guanbin Li

Audio-driven talking head synthesis is a critical task in digital human modeling. While recent advances using diffusion models and Neural Radiance Fields (NeRF) have improved visual quality, they often require substantial computational resources, limiting practical deployment. We present a novel framework for audio-driven talking head synthesis, namely it Hierarchically Controlled Deformable 3D Gaussians (HiCoDe), which achieves state-of-the-art performance with significantly reduced computational costs. Our key contribution is a hierarchical control strategy that effectively bridges the gap between sparse audio features and dense 3D Gaussian point clouds. Specifically, this strategy comprises two control levels: i) coarse-level control based on a 3D Morphable Model (3DMM) and ii) fine-level control using facial landmarks. Extensive experiments on the HDTF dataset and additional test sets demonstrate that our method outperforms existing approaches in visual quality, facial landmark accuracy, and audio-visual synchronization while being more computationally efficient in both training and inference.

ICRA Conference 2025 Conference Paper

In-Pipe Navigation Development Environment and a Smooth Path Planning Method on Pipeline Surface

  • Hao Liu
  • Xiang Li
  • Xiang Zhang
  • Gang Liu
  • Mingquan Lu

Autonomous in-pipe inspection robots can automatically navigate through complex pipeline networks and detect potential risks from corrosion and defects, demonstrating great potential for replacing costly manual inspections. However, there is no publicly available simulation environment where researchers can validate their in-pipe navigation algorithms as far as we know, and the navigation algorithms on constrained 3D pipe surface which is the critical software component are less discussed. Firstly, this paper proposes an open-source In-Pipe Navigation Development Environment. It contains various pipeline models, a magnetic wheel climbing robot model realized by the adhesion plugin, and baseline algorithms for navigation tasks. Secondly, a novel effective path planning method is introduced. Instead of planning based on surface structures, the proposed method plans based on pipeline axis and maps it into local path using the Frenet-Serret formula, thereby generating smooth, feasible, and efficient paths. Finally, we conduct both qualitative and quantitative experiments in the proposed simulation and real-world environments. The results show the usability of the development environment, also robustness and efficiency of the proposed planning method.

NeurIPS Conference 2025 Conference Paper

Learning to Plan Like the Human Brain via Visuospatial Perception and Semantic-Episodic Synergistic Decision-Making

  • Tianyuan Jia
  • Ziyu Li
  • Qing Li
  • Xiuxing Li
  • Xiang Li
  • Chen Wei
  • Li Yao
  • Xia Wu

Motion planning in high-dimensional continuous spaces remains challenging due to complex environments and computational constraints. Although learning-based planners, especially graph neural network (GNN)-based, have significantly improved planning performance, they still struggle with inaccurate graph construction and limited structural reasoning, constraining search efficiency and path quality. The human brain exhibits efficient planning through a two-stage Perception-Decision model. First, egocentric spatial representations from visual and proprioceptive input are constructed, and then semantic–episodic synergy is leveraged to support decision-making in uncertainty scenarios. Inspired by this process, we propose NeuroMP, a brain-inspired planning framework that learns to plan like the human brain. NeuroMP integrates a Perceptive Segment Selector inspired by visuospatial perception to construct safer graphs, and a Global Alignment Heuristic guide search in weakly connected graphs by modeling semantic-episodic synergistic decision-making. Experimental results demonstrate that NeuroMP significantly outperforms existing planning methods in efficiency and quality while maintaining a high success rate.

ICLR Conference 2025 Conference Paper

Let Your Features Tell The Differences: Understanding Graph Convolution By Feature Splitting

  • Yilun Zheng
  • Xiang Li
  • Sitao Luan
  • Xiaojiang Peng
  • Lihui Chen

Graph Neural Networks (GNNs) have demonstrated strong capabilities in processing structured data. While traditional GNNs typically treat each feature dimension equally important during graph convolution, we raise an important question: **Is the graph convolution operation equally beneficial for each feature?** If not, the convolution operation on certain feature dimensions can possibly lead to harmful effects, even worse than convolution-free models. Therefore, it is required to distinguish convolution-favored and convolution-disfavored features. Traditional feature selection methods mainly focus on identifying informative features or reducing redundancy, but they are not suitable for structured data as they overlook graph structures. In graph community, some studies have investigated the performance of GNN with respect to node features using feature homophily metrics, which assess feature consistency across graph topology. Unfortunately, these metrics do not effectively align with GNN performance and cannot be reliably used for feature selection in GNNs. To address these limitations, we introduce a novel metric, Topological Feature Informativeness (TFI), to distinguish GNN-favored and GNN-disfavored features, where its effectiveness is validated through both theoretical analysis and empirical observations. Based on TFI, we propose a simple yet effective Graph Feature Selection (GFS) method, which processes GNN-favored and GNN-disfavored features with GNNs and non-GNN models separately. Compared to original GNNs, GFS significantly improves the extraction of useful topological information from each feature with comparable computational costs. Extensive experiments show that after applying GFS to $\textbf{8}$ baseline and state-of-the-art (SOTA) GNN architectures across $\textbf{10}$ datasets, $\textbf{90\%}$ of the GFS-augmented cases show significant performance boosts. Furthermore, our proposed TFI metric outperforms other feature selection methods for GFS. These results verify the effectiveness of both GFS and TFI. Additionally, we demonstrate that GFS's improvements are robust to hyperparameter tuning, highlighting its potential as a universally valid method for enhancing various GNN architectures. To facilitate reproducibility and further research, we have made our code publicly available at https://github.com/KTTRCDL/graph-feature-selection.

AAAI Conference 2025 Conference Paper

Leveraging Large Language Models for Node Generation in Few-Shot Learning on Text-Attributed Graphs

  • Jianxiang Yu
  • Yuxiang Ren
  • Chenghua Gong
  • Jiaqi Tan
  • Xiang Li
  • Xuecang Zhang

Text-attributed graphs have recently garnered significant attention due to their wide range of applications in web domains. Existing methodologies employ word embedding models for acquiring text representations as node features, which are subsequently fed into Graph Neural Networks (GNNs) for training. Recently, the advent of Large Language Models (LLMs) has introduced their powerful capabilities in information retrieval and text generation, which can greatly enhance the text attributes of graph data. Furthermore, the acquisition and labeling of extensive datasets are both costly and time-consuming endeavors. Consequently, few-shot learning has emerged as a crucial problem in the context of graph learning tasks. In order to tackle this challenge, we propose a lightweight paradigm called LLM4NG, which adopts a plug-and-play approach to establish supervision signals by leveraging LLMs for node generation. Specifically, we utilize LLMs to extract semantic information from the labels and generate samples that belong to these categories as exemplars. Subsequently, we employ an edge predictor to capture the structural information inherent in the raw dataset and integrate the newly generated samples into the original graph. This approach harnesses LLMs for enhancing class-level information and seamlessly introduces labeled nodes and edges without modifying the raw dataset, thereby facilitating the node classification task in few-shot scenarios. Extensive experiments demonstrate the outstanding performance of our proposed paradigm, particularly in low-shot scenarios. For instance, in the 1-shot setting of the ogbn-arxiv dataset, LLM4NG achieves a 76% improvement over the baseline model.

IJCAI Conference 2025 Conference Paper

MaskDGNN: Self-Supervised Dynamic Graph Neural Networks with Activeness-aware Temporal Masking

  • Yiming He
  • Xiang Li
  • Zhongying Zhao
  • Haobing Liu
  • Peilan He
  • Yanwei Yu

Integrating dynamics into graph neural networks (GNNs) provides deeper insights into the evolution of dynamic graphs, thereby enhancing the temporal representation in real-world dynamic network problems. Existing methods extracting critical information from dynamic graphs face two key challenges, either overlooking the negative impact of redundant information or struggling in addressing the distribution shifting issue in dynamic graphs. To address these challenges, we propose MaskDGNN, a novel dynamic GNN architecture that consists of two modules: First, self-supervised activeness-aware temporal masking mechanism selectively retains edges between highly active nodes while masking those with low activeness, effectively reducing redundancy. Second, adaptive frequency enhancing graph representation learner amplifies the frequency-domain features of nodes to capture intrinsic features under distribution shifting. Experiments on five real-world dynamic graph datasets demonstrate that MaskDGNN outperforms state-of-the-art methods, achieving an average improvement of 7. 07% in accuracy and 13. 87% in MRR for link prediction tasks.

JBHI Journal 2025 Journal Article

MediViSTA: Medical Video Segmentation Via Temporal Fusion SAM Adaptation for Echocardiography

  • Sekeun Kim
  • Pengfei Jin
  • Cheng Chen
  • Kyungsang Kim
  • Zhiliang Lyu
  • Hui Ren
  • Sunghwan Kim
  • Zhengliang Liu

Despite achieving impressive results in general-purpose semantic segmentation with strong generalization on natural images, the Segment Anything Model (SAM) has shown less precision and stability in medical image segmentation. In particular, the original SAM architecture is designed for 2D natural images and is therefore not support to handle three-dimensional information, which is particularly important for medical imaging modalities that are often volumetric or video data. In this paper, we introduce MediViSTA, a parameter-efficient fine-tuning method designed to adapt the vision foundation model for medical video, with a specific focus on echocardiography segmentation. To achieve spatial adaptation, we propose a frequency feature fusion technique that injects spatial frequency information from a CNN branch. For temporal adaptation, we integrate temporal adapters within the transformer blocks of the image encoder. Using a fine-tuning strategy, only a small subset of pre-trained parameters is updated, allowing efficient adaptation to echocardiography data. The effectiveness of our method has been comprehensively evaluated on three datasets, comprising two public datasets and one multi-center in-house dataset. Our method consistently outperforms various state-of-the-art approaches without using any prompts. Furthermore, our model exhibits strong generalization capabilities on unseen datasets, surpassing the second-best approach by 2. 15% in Dice and 0. 09 in temporal consistency. The results demonstrate the potential of MediViSTA to significantly advance echocardiography video segmentation, offering improved accuracy and robustness in cardiac assessment applications.

NeurIPS Conference 2025 Conference Paper

Mitigating the Privacy–Utility Trade-off in Decentralized Federated Learning via f-Differential Privacy

  • Xiang Li
  • Chendi Wang
  • Buxin Su
  • Qi Long
  • Weijie Su

Differentially private (DP) decentralized Federated Learning (FL) allows local users to collaborate without sharing their data with a central server. However, accurately quantifying the privacy budget of private FL algorithms is challenging due to the co-existence of complex algorithmic components such as decentralized communication and local updates. This paper addresses privacy accounting for two decentralized FL algorithms within the $f$-differential privacy ($f$-DP) framework. We develop two new $f$-DP–based accounting methods tailored to decentralized settings: Pairwise Network $f$-DP (PN-$f$-DP), which quantifies privacy leakage between user pairs under random-walk communication, and Secret-based $f$-Local DP (Sec-$f$-LDP), which supports structured noise injection via shared secrets. By combining tools from $f$-DP theory and Markov chain concentration, our accounting framework captures privacy amplification arising from sparse communication, local iterations, and correlated noise. Experiments on synthetic and real datasets demonstrate that our methods yield consistently tighter $(\epsilon, \delta)$ bounds and improved utility compared to Rényi DP–based approaches, illustrating the benefits of $f$-DP in decentralized privacy accounting.

AAAI Conference 2025 Conference Paper

Multi-clue Consistency Learning to Bridge Gaps Between General and Oriented Object in Semi-supervised Detection

  • Chenxu Wang
  • Chunyan Xu
  • Xiang Li
  • Yuxuan Li
  • Xu Guo
  • Ziqi Gu
  • Zhen Cui

While existing semi-supervised object detection (SSOD) methods perform well in general scenes, they encounter challenges in handling oriented objects in aerial images. We experimentally find three gaps between general and oriented object detection in semi-supervised learning: 1) Sampling inconsistency: the common center sampling is not suitable for oriented objects with larger aspect ratios when selecting positive labels from labeled data. 2) Assignment inconsistency: balancing the precision and localization quality of oriented pseudo-boxes poses greater challenges which introduces more noise when selecting positive labels from unlabeled data. 3) Confidence inconsistency: there exists more mismatch between the predicted classification and localization qualities when considering oriented objects, affecting the selection of pseudo-labels. Therefore, we propose a Multi-clue Consistency Learning (MCL) framework to bridge gaps between general and oriented objects in semi-supervised detection. Specifically, considering various shapes of rotated objects, the Gaussian Center Assignment is specially designed to select the pixel-level positive labels from labeled data. We then introduce the Scale-aware Label Assignment to select pixel-level pseudo-labels instead of unreliable pseudo-boxes, which is a divide-and-rule strategy suited for objects with various scales. The Consistent Confidence Soft Label is adopted to further boost the detector by maintaining the alignment of the predicted results. Comprehensive experiments on DOTA-v1.5 and DOTA-v1.0 benchmarks demonstrate that our proposed MCL can achieve state-of-the-art performance in the semi-supervised oriented object detection task.

JBHI Journal 2025 Journal Article

Multi-Scale Dynamic Sparse Attention UNet for Medical Image Segmentation

  • Xiang Li
  • Chong Fu
  • Qun Wang
  • Wenchao Zhang
  • Chen Ye
  • Junxin Chen
  • Chiu-Wing Sham

Transformers have recently gained significant attention in medical image segmentation due to their ability to capture long-range dependencies. However, the presence of excessive background noise in large regions of medical images introduces distractions and increases the computational burden on the fine-grained self-attention (SA) mechanism, which is a key component of the transformer model. Meanwhile, preserving fine-grained details is essential for accurately segmenting complex, blurred medical images with diverse shapes and sizes. Thus, we propose a novel Multi-scale Dynamic Sparse Attention (MDSA) module, which flexibly reduces computational costs while maintaining multi-scale fine-grained interactions with content awareness. Specifically, multi-scale aggregation is first applied to the feature maps to enrich the diversity of interaction information. Then, for each query, irrelevant key-value pairs are filtered out at a coarse-grained level. Finally, fine-grained SA is performed on the remaining key-value pairs. In addition, we design an enhanced downsampling merging (EDM) module and an enhanced upsampling fusion (EUF) module for building pyramid architectures. Using MDSA to construct the basic blocks, combined with EDMs and EUFs, we develop a UNet-like model named MDSA-UNet. Since MDSA-UNet dynamically processes only a small subset of relevant fine-grained features, it achieves strong segmentation performance with high computational efficiency. Extensive experiments on four datasets spanning three different types demonstrate that our MDSA-UNet, without using pre-training, significantly outperforms other non-pretrained methods and even competes with pre-trained models, achieving Dice scores of 82. 10% on DDTI, 80. 20% on TN3K, 90. 75% on ISIC2018, and 91. 05% on ACDC. Meanwhile, our model maintains lower complexity, with only 6. 65 M parameters and 4. 54 G FLOPs at a resolution of 224 × 224, ensuring both effectiveness and efficiency. Code is available at URL.

IJCAI Conference 2025 Conference Paper

Not All Layers of LLMs Are Necessary During Inference

  • Siqi Fan
  • Xin Jiang
  • Xiang Li
  • Xuying Meng
  • Peng Han
  • Shuo Shang
  • Aixin Sun
  • Yequan Wang

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17. 8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.

NeurIPS Conference 2025 Conference Paper

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

  • Weiqing He
  • Xiang Li
  • Tianqi Shang
  • Li Shen
  • Weijie Su
  • Qi Long

Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i. i. d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.

ICML Conference 2025 Conference Paper

Preference Adaptive and Sequential Text-to-Image Generation

  • Ofir Nabati
  • Guy Tennenholtz
  • Chih-Wei Hsu
  • Moonkyung Ryu
  • Deepak Ramachandran
  • Yinlam Chow
  • Xiang Li
  • Craig Boutilier

We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user’s intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.

NeurIPS Conference 2025 Conference Paper

REOBench: Benchmarking Robustness of Earth Observation Foundation Models

  • Xiang Li
  • Yong Tao
  • Siyuan Zhang
  • Siwei Liu
  • Zhitong Xiong
  • Chunbo Luo
  • Lu Liu
  • Mykola Pechenizkiy

Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 25%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.

NeurIPS Conference 2025 Conference Paper

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

  • Ge Wu
  • Shen Zhang
  • Ruijing Shi
  • Shanghua Gao
  • Zhenyuan Chen
  • Lei Wang
  • Zhaowei Chen
  • Hongcheng Gao

REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called $\textit{$\textbf{R}$epresentation $\textbf{E}$ntanglement for $\textbf{G}$eneration}$ ($\textbf{REG}$), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0. 5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https: //github. com/Martinser/REG.

NeurIPS Conference 2025 Conference Paper

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

  • Wenhao Tang
  • Rong Qin
  • Heng Fang
  • Fengtao Zhou
  • Hao Chen
  • Xiang Li
  • Ming-Ming Cheng

Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. ABMILX mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient ($<$ 10 RTX3090 GPU hours). We demonstrate the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https: //github. com/DearCaat/E2E-WSI-ABMILX.

NeurIPS Conference 2025 Conference Paper

See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

  • Yuan Wu
  • Zhiqiang Yan
  • Yigong Zhang
  • Xiang Li
  • Jian Yang

Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose LIAR, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available here.

NeurIPS Conference 2025 Conference Paper

Statistical Inference under Performativity

  • Xiang Li
  • Yunai Li
  • Huiying Zhong
  • Lihua Lei
  • Zhun Deng

Performativity of predictions refers to the phenomenon where prediction-informed decisions influence the very targets they aim to predict—a dynamic commonly observed in policy-making, social sciences, and economics. In this paper, we initiate an end-to-end framework of statistical inference under performativity. Our contributions are twofold. First, we establish a central limit theorem for estimation and inference in the performative setting, enabling standard inferential tasks such as constructing confidence intervals and conducting hypothesis tests in policy-making contexts. Second, we leverage this central limit theorem to study prediction-powered inference (PPI) under performativity. This approach yields more precise estimates and tighter confidence regions for the model parameters (i. e. , policies) of interest in performative prediction. We validate the effectiveness of our framework through numerical experiments. To the best of our knowledge, this is the first work to establish a complete statistical inference under performativity, introducing new challenges and inference settings that we believe will provide substantial value to policy-making, statistics, and machine learning.

NeurIPS Conference 2025 Conference Paper

Towards Understanding the Mechanisms of Classifier-Free Guidance

  • Xiang Li
  • Rongrong Wang
  • Qing Qu

Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we first analyze CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG's mechanism within the nonlinear regime.

AAAI Conference 2025 Conference Paper

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

  • Xiang Li
  • Yunshi Lan
  • Chao Yang

Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce TreeEval, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate 6 models of different parameter sizes, including 7B, 13B, and 34B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around 45 questions. We also conduct more analysis to show the robustness and reliability of TreeEval.

NeurIPS Conference 2025 Conference Paper

Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling

  • Xiao Li
  • Zekai Zhang
  • Xiang Li
  • Siyi Chen
  • Zhihui Zhu
  • Peng Wang
  • Qing Qu

Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the diffusion model’s generalization: it emerges when the model generate novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.

JBHI Journal 2025 Journal Article

Variability of Spatiotemporal-Rhythmic Network During Inhibitory Control in Repetitive Subconcussion

  • Xiang Li
  • Zhenghao Fu
  • Hui Zhou
  • Yin Xiang
  • Yaqian Li
  • Yida He
  • Jiaqi Zhang
  • Huanhuan Li

The inhibitory control dysfunction associated with the cognitive symptoms resulting from repetitive subconcussion (SC) is frequent. Implementing inhibitory control is temporally resolved and is likely related to the dynamic interactions in functional brain networks. However, investigations of the dynamic activity of these brain networks using electroencephalography (EEG) are often limited to specific frequency bands without entirely utilizing the spatiotemporal rhythmic information. Therefore, we proposed an innovative framework for constructing a large-scale spatiotemporal-rhythmic network (STRN) using the dynamic cross-frequency phase synchronization to track cognitive deficits induced by repetitive subconcussion during the inhibitory control. Seventeen parachuters with repeated subconcussive exposure and 17 healthy controls (HC) were subjected to a Stroop task while recording the continuous scalp EEG data. Our results indicated an STRN-specific activation pattern that achieved a high classification performance with an average accuracy of 90. 98%, which may serve as a biomarker for identifying the repetitive subconcussion inhibitory control dysfunction. In this STRN state, the SC exhibited mostly lower network rhythmic information interactions than the HC. These findings suggested that the STRN presented in this study could be an effective analytical method for understanding the cognitive dysfunction observed in the repetitive subconcussion and other related conditions.

JBHI Journal 2025 Journal Article

Voxel-Level Brain States Prediction Using Swin Transformer

  • Yifei Sun
  • Daniel Chahine
  • Qinghao Wen
  • Tianming Liu
  • Xiang Li
  • Yixuan Yuan
  • Fernando Calamante
  • Jinglei Lv

Understanding brain dynamics is important for neuroscience and mental health. Functional magnetic resonance imaging (fMRI) enables the measurement of neural activities through blood-oxygen-level-dependent (BOLD) signals, which represent brain states. In this study, we aim to predict future human resting brain states with fMRI. Due to the 3D voxel-wise spatial organization and temporal dependencies of the fMRI data, we propose a novel architecture which employs a 4D Shifted Window (Swin) Transformer as encoder to efficiently learn spatio-temporal information and a convolutional decoder to enable brain state prediction at the same spatial and temporal resolution as the input fMRI data. We used 100 unrelated subjects from the Human Connectome Project (HCP) for model training and testing. Our novel model has shown high accuracy when predicting 7. 2s resting-state brain activities based on the prior 23. 04s fMRI time series. The predicted brain states highly resemble BOLD contrast and dynamics. This work shows promising evidence that the spatiotemporal organization of the human brain can be learned by a Swin Transformer model, at high resolution, which provides a potential for reducing the fMRI scan time and the development of brain-computer interfaces in the future.

NeurIPS Conference 2025 Conference Paper

Who You Are Matters: Bridging Interests and Social Roles via LLM-Enhanced Logic Recommendation

  • Qing Yu
  • Xiaobei Wang
  • Shuchang Liu
  • Xiaoyu Yang
  • Xueliang Wang
  • Chang Meng
  • Shanshan Wu
  • Bin Wen

Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e. g. , categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks. Our code is available in https: //github. com/Code2Q/TagCF.

NeurIPS Conference 2024 Conference Paper

3DCoMPaT200: Language Grounded Large-Scale 3D Vision Dataset for Compositional Recognition

  • Mahmoud Ahmed
  • Xiang Li
  • Arpit Prajapati
  • Mohamed Elhoseiny

Understanding objects in 3D at the part level is essential for humans and robots to navigate and interact with the environment. Current datasets for part-level 3D object understanding encompass a limited range of categories. For instance, the ShapeNet-Part and PartNet datasets only include 16, and 24 object categories respectively. The 3DCoMPaT dataset, specifically designed for compositional understanding of parts and materials, contains only 42 object categories. To foster richer and fine-grained part-level 3D understanding, we introduce 3DCoMPaT200, a large-scale dataset tailored for compositional understanding of object parts and materials, with 200 object categories with approximately 5 times larger object vocabulary compared to 3DCoMPaT and almost 4 times larger part categories. Concretely, 3DCoMPaT200 significantly expands upon 3DCoMPaT, featuring 1, 031 fine-grained part categories and 293 distinct material classes for compositional application to 3D object parts. Additionally, to address the complexities of compositional 3D modeling, we propose a novel task of Compositional Part Shape Retrieval using ULIP to provide a strong 3D foundational model for 3D Compositional Understanding. This method evaluates the model shape retrieval performance given one, three, or six parts described in text format. These results show that the model's performance improves with an increasing number of style compositions, highlighting the critical role of the compositional dataset. Such results underscore the dataset's effectiveness in enhancing models' capability to understand complex 3D shapes from a compositional perspective. Code and Data can be found here: https: //github. com/3DCoMPaT200/3DCoMPaT200/

JMLR Journal 2024 Journal Article

A Random Projection Approach to Personalized Federated Learning: Enhancing Communication Efficiency, Robustness, and Fairness

  • Yuze Han
  • Xiang Li
  • Shiyun Lin
  • Zhihua Zhang

Personalized Federated Learning (FL) faces many challenges such as expensive communication costs, training-time adversarial attacks, and performance unfairness across devices. Recent developments witness a trade-off between a reference model and local models to achieve personalization. Following the avenue, we propose a personalized FL method toward the three goals. When it is time to communicate, our method projects local models into a shared-and-fixed low-dimensional random subspace and uses infimal convolution to control the deviation between the reference model and projected local models. We theoretically show our method converges for both strongly convex and non-convex but smooth objectives with square regularizers and the convergence dependence on the projection dimension is mild. We also illustrate the benefits of robustness and fairness on a class of linear problems. Finally, we conduct a large number of experiments to show the empirical superiority of our method over several state-of-the-art methods on the three aspects. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes

  • Yan Huang
  • Xiang Li
  • Yipeng Shen
  • Niao He
  • Jinming Xu

In this paper, we show that applying adaptive methods directly to distributed minimax problems can result in non-convergence due to inconsistency in locally computed adaptive stepsizes. To address this challenge, we propose D-AdaST, a Distributed Adaptive minimax method with Stepsize Tracking. The key strategy is to employ an adaptive stepsize tracking protocol involving the transmission of two extra (scalar) variables. This protocol ensures the consistency among stepsizes of nodes, eliminating the steady-state error due to the lack of coordination of stepsizes among nodes that commonly exists in vanilla distributed adaptive methods, and thus guarantees exact convergence. For nonconvex-strongly-concave distributed minimax problems, we characterize the specific transient times that ensure time-scale separation of stepsizes and quasi-independence of networks, leading to a near-optimal convergence rate of $\tilde{\mathcal{O}} \left( \epsilon ^{-\left( 4+\delta \right)} \right)$ for any small $\delta > 0$, matching that of the centralized counterpart. To our best knowledge, D-AdaST is the *first* distributed adaptive method achieving near-optimal convergence without knowing any problem-dependent parameters for nonconvex minimax problems. Extensive experiments are conducted to validate our theoretical results.

AAAI Conference 2024 Conference Paper

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

  • Kun Wang
  • Zhiqiang Yan
  • Huang Tian
  • Zhenyu Zhang
  • Xiang Li
  • Jun Li
  • Jian Yang

Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF---a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.

NeurIPS Conference 2024 Conference Paper

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

  • Hejie Cui
  • Lingjun Mao
  • Xin Liang
  • Jieyu Zhang
  • Hui Ren
  • Quanzheng Li
  • Xiang Li
  • Carl Yang

Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18. 5% relatively) and medical VQA (win rate up to 81. 73%). Our instruction-following data and models are available at https: //BioMed-VITAL. github. io.

NeurIPS Conference 2024 Conference Paper

Cross-model Control: Improving Multiple Large Language Models in One-time Training

  • Jiayi Wu
  • Hao Sun
  • Hengyi Cai
  • Lixin Su
  • Shuaiqiang Wang
  • Dawei Yin
  • Xiang Li
  • Ming Gao

The number of large language models (LLMs) with varying parameter scales and vocabularies is increasing. While they deliver powerful performance, they also face a set of common optimization needs to meet specific requirements or standards, such as instruction following or avoiding the output of sensitive information from the real world. However, how to reuse the fine-tuning outcomes of one model to other models to reduce training costs remains a challenge. To bridge this gap, we introduce Cross-model Control (CMC), a method that improves multiple LLMs in one-time training with a portable tiny language model. Specifically, we have observed that the logit shift before and after fine-tuning is remarkably similar across different models. Based on this insight, we incorporate a tiny language model with a minimal number of parameters. By training alongside a frozen template LLM, the tiny model gains the capability to alter the logits output by the LLMs. To make this tiny language model applicable to models with different vocabularies, we propose a novel token mapping strategy named PM-MinED. We have conducted extensive experiments on instruction tuning and unlearning tasks, demonstrating the effectiveness of CMC. Our code is available at https: //github. com/wujwyi/CMC

NeurIPS Conference 2024 Conference Paper

DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain

  • Kun Wang
  • Zhiqiang Yan
  • Junkai Fan
  • Wanlu Zhu
  • Xiang Li
  • Jun Li
  • Jian Yang

In this paper, we introduce DCDepth, a novel framework for the long-standing monocular depth estimation task. Moving beyond conventional pixel-wise depth estimation in the spatial domain, our approach estimates the frequency coefficients of depth patches after transforming them into the discrete cosine domain. This unique formulation allows for the modeling of local depth correlations within each patch. Crucially, the frequency transformation segregates the depth information into various frequency components, with low-frequency components encapsulating the core scene structure and high-frequency components detailing the finer aspects. This decomposition forms the basis of our progressive strategy, which begins with the prediction of low-frequency components to establish a global scene context, followed by successive refinement of local details through the prediction of higher-frequency components. We conduct comprehensive experiments on NYU-Depth-V2, TOFDC, and KITTI datasets, and demonstrate the state-of-the-art performance of DCDepth. Code is available at https: //github. com/w2kun/DCDepth.

ICLR Conference 2024 Conference Paper

Decoding Natural Images from EEG for Object Recognition

  • Yonghao Song
  • Bingchuan Liu
  • Xiang Li
  • Nanlin Shi
  • Yijun Wang 0001
  • Xiaorong Gao

Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. Our approach achieves state-of-the-art results on a comprehensive EEG-image dataset, with a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios. Code available at https://github.com/eeyhsong/NICE-EEG.

AAAI Conference 2024 Conference Paper

DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

  • Xiang Li
  • Junbo Yin
  • Wei Li
  • Chengzhong Xu
  • Ruigang Yang
  • Jianbing Shen

Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X.

JBHI Journal 2024 Journal Article

DiffMAR: A Generalized Diffusion Model for Metal Artifact Reduction in CT Images

  • Tianxiao Cai
  • Xiang Li
  • Chenglan Zhong
  • Wei Tang
  • Jixiang Guo

X-ray imaging frequently introduces varying degrees of metal artifacts to computed tomography (CT) images when metal implants are present. For the metal artifact reduction (MAR) task, existing end-to-end methods often exhibit limited generalization capabilities. While methods based on multiple iterations often suffer from accumulative error, resulting in lower-quality restoration outcomes. In this work, we innovatively present a generalized diffusion model for Metal Artifact Reduction (DiffMAR). The proposed method utilizes a linear degradation process to simulate the physical phenomenon of metal artifact formation in CT images and directly learn an iterative restoration process from paired CT images in the reverse process. During the reverse process of DiffMAR, a Time-Latent Adjustment (TLA) module is designed to adjust time embedding at the latent level, thereby minimizing the accumulative error during iterative restoration. We also designed a structure information extraction (SIE) module to utilize linear interpolation data in the image domain, guiding the generation of anatomical structures during the iterative restoring. This leads to more accurate and robust shadow-free image generation. Comprehensive analysis, including both synthesized data and clinical evidence, confirms that our proposed method surpasses the current state-of-the-art (SOTA) MAR methods in terms of both image generation quality and generalization.

NeurIPS Conference 2024 Conference Paper

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

  • Kai Hu
  • Weichen Yu
  • Yining Li
  • Tianjun Yao
  • Xiang Li
  • Wenhe Liu
  • Lijun Yu
  • Zhiqiang Shen

Recent research indicates that large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content. This paper introduces a novel token-level attack method, Adaptive Dense-to-Sparse Constrained Optimization (ADC), which has been shown to successfully jailbreak multiple open-source LLMs. Drawing inspiration from the difficulties of discrete token optimization, our method relaxes the discrete jailbreak optimization into a continuous optimization process while gradually increasing the sparsity of the optimizing vectors. This technique effectively bridges the gap between discrete and continuous space optimization. Experimental results demonstrate that our method is more effective and efficient than state-of-the-art token-level methods. On Harmbench, our approach achieves the highest attack success rate on seven out of eight LLMs compared to the latest jailbreak methods. \textcolor{red}{Trigger Warning: This paper contains model behavior that can be offensive in nature. }

NeurIPS Conference 2024 Conference Paper

Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning

  • Chong Ma
  • Hanqi Jiang
  • Wenting Chen
  • Yiwei Li
  • Zihao Wu
  • Xiaowei Yu
  • Zhengliang Liu
  • Lei Guo

In the medical multi-modal frameworks, the alignment of cross-modality features presents a significant challenge. However, existing works have learned features that are implicitly aligned from the data, without considering the explicit relationships in the medical context. This data-reliance may lead to low generalization of the learned alignment relationships. In this work, we propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for better alignment of medical visual and textual features. We explore the natural auxiliary role of radiologists' eye-gaze data in aligning medical images and text, and introduce a novel approach by using eye-gaze data, collected synchronously by radiologists during diagnostic evaluations. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets, where EGMA achieved state-of-the-art performance and stronger generalization across different datasets. Additionally, we explore the impact of varying amounts of eye-gaze data on model performance, highlighting the feasibility and utility of integrating this auxiliary data into multi-modal alignment framework.

NeurIPS Conference 2024 Conference Paper

Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

  • Hao Chen
  • Ankit Shah
  • Jindong Wang
  • Ran Tao
  • Yidong Wang
  • Xiang Li
  • Xing Xie
  • Masashi Sugiyama

Learning with reduced labeling standards, such as noisy label, partial label, and supplementary unlabeled data, which we generically refer to as imprecise label, is a commonplace challenge in machine learning tasks. Previous methods tend to propose specific designs for every emerging imprecise label configuration, which is usually unsustainable when multiple configurations of imprecision coexist. In this paper, we introduce imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations. ILL leverages expectation-maximization (EM) for modeling the imprecise label information, treating the precise labels as latent variables. Instead of approximating the correct labels for training, it considers the entire distribution of all possible labeling entailed by the imprecise information. We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings, with closed-form learning objectives derived from the unified EM modeling. Notably, ILL surpasses the existing specified techniques for handling imprecise labels, marking the first practical and unified framework with robust and effective performance across various challenging settings. We hope our work will inspire further research on this topic, unleashing the full potential of ILL in wider scenarios where precise labels are expensive and complicated to obtain.

AAAI Conference 2024 Conference Paper

In-Hand 3D Object Reconstruction from a Monocular RGB Video

  • Shijian Jiang
  • Qi Ye
  • Rengan Xie
  • Yuchi Huo
  • Xiang Li
  • Yang Zhou
  • Jiming Chen

Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera. Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object. However, these methods falter in accurately capturing the shape within the hand-object contact region due to occlusion. In this paper, we propose a novel method that deals with surface reconstruction under occlusion by incorporating priors of 2D occlusion elucidation and physical contact constraints. For the former, we introduce an object amodal completion network to infer the 2D complete mask of objects under occlusion. To ensure the accuracy and view consistency of the predicted 2D amodal masks, we devise a joint optimization method for both amodal mask refinement and 3D reconstruction. For the latter, we impose penetration and attraction constraints on the local geometry in contact regions. We evaluate our approach on HO3D and HOD datasets and demonstrate that it outperforms the state-of-the-art methods in terms of reconstruction surface quality, with an improvement of 52% on HO3D and 20% on HOD. Project webpage: https://east-j.github.io/ihor.

NeurIPS Conference 2024 Conference Paper

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

  • Qianli Shen
  • Yezhen Wang
  • Zhouhao Yang
  • Xiang Li
  • Haonan Wang
  • Yang Zhang
  • Jonathan Scarlett
  • Zhanxing Zhu

Bi-level optimizaiton (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization has become increasingly critical. Traditional gradient-based bi-level optimizaiton algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce **F**orward **G**radient **U**nrolling with **F**orward **G**radient, abbreviated as **$($FG$)^2$U**, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimizaiton. $($FG$)^2$U circumvents the memory and approximation issues associated with classical bi-level optimizaiton approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimizaiton approaches. Additionally, $($FG$)^2$U is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $($FG$)^2$U and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $($FG$)^2$U is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimizaiton scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $($FG$)^2$U, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimizaiton tasks.

JBHI Journal 2024 Journal Article

MH2AFormer: An Efficient Multiscale Hierarchical Hybrid Attention With a Transformer for Bladder Wall and Tumor Segmentation

  • Xiang Li
  • Jian Wang
  • Haifeng Wei
  • Jinyu Cong
  • Hongfu Sun
  • Pingping Wang
  • Benzheng Wei

Achieving accurate bladder wall and tumor segmentation from MRI is critical for diagnosing and treating bladder cancer. However, automated segmentation remains challenging due to factors such as comparable density distributions, intricate tumor morphologies, and unclear boundaries. Considering the attributes of bladder MRI images, we propose an efficient multiscale hierarchical hybrid attention with a transformer (MH2AFormer) for bladder cancer and wall segmentation. Specifically, a multiscale hybrid attention and transformer (MHAT) module in the encoder is designed to adaptively extract and aggregate multiscale hybrid feature representations from the input image. In the decoder stage, we devise a multiscale hybrid attention (MHA) module to generate high-quality segmentation results from multiscale hybrid features. Combining these modules enhances the feature representation and guides the model to focus on tumor and wall regions, which helps to solve bladder image segmentation challenges. Moreover, MHAT utilizes the Fast Fourier Transformer with a large kernel (e. g. , 224 × 224) to model global feature relationships while reducing computational complexity in the encoding stage. The model performance was evaluated on two datasets. As a result, the model achieves relatively best results regarding the intersection over union (IoU) and dice similarity coefficient (DSC) on both datasets (Dataset A: IoU = 80. 26%, DSC = 88. 20%; Dataset B: IoU = 89. 74%, DSC = 94. 48%). These advantageous outcomes substantiate the practical utility of our approach, highlighting its potential to alleviate the workload of radiologists when applied in clinical settings.

ICLR Conference 2024 Conference Paper

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  • Deyao Zhu
  • Jun Chen 0021
  • Xiaoqian Shen
  • Xiang Li
  • Mohamed Elhoseiny

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability.

IJCAI Conference 2024 Conference Paper

No Regularization Is Needed: Efficient and Effective Incomplete Label Distribution Learning

  • Xiang Li
  • Songcan Chen

In reality, it is laborious to obtain complete label degrees, giving birth to Incomplete Label Distribution Learning (InLDL), where some degrees are missing. Existing InLDL methods often assume that degrees are uniformly random missing. However, it is often not the case in practice, which arises the first issue. Besides, they often adopt explicit regularization to compensate the incompleteness, leading to burdensome parameter tuning and extra computation, causing the second issue. To address the first issue, we adopt a more practical setting, i. e. , small degrees are more prone to be missing, since large degrees are likely to catch more attention. To tackle the second issue, we argue that label distribution itself already contains abundant knowledge, such as label correlation and ranking order, thus it may have provided enough prior for learning. It is precisely because existing methods overlook such a prior that leads to the forced adoption of explicit regularization. By directly utilizing the label degrees prior, we design a properly weighted objective function, exempting the need from explicit regularization. Moreover, we provide rigorous theoretical analysis, revealing in principle that the weighting plays an implicit regularization role. To sum up, our method has four advantages, it is 1) model selection free; 2) with closed-form solution (sub-problem) and easy-to-implement (a few lines of codes); 3) with linear computational complexity in the number of samples, thus scalable to large datasets; 4) competitive with state-of-the-arts in both random and non-random missing scenarios.

NeurIPS Conference 2024 Conference Paper

Novel Object Synthesis via Adaptive Text-Image Harmony

  • Zeren Xiong
  • Zedong Zhang
  • Zikun Chen
  • Shuo Chen
  • Xiang Li
  • Gan Sun
  • Jian Yang
  • Jun Li

In this paper, we study an object synthesis task that combines an object text with an object image to create a new object image. However, most diffusion models struggle with this task, \textit{i. e. }, often generating an object that predominantly reflects either the text or the image due to an imbalance between their inputs. To address this issue, we propose a simple yet effective method called Adaptive Text-Image Harmony (ATIH) to generate novel and surprising objects. First, we introduce a scale factor and an injection step to balance text and image features in cross-attention and to preserve image information in self-attention during the text-image inversion diffusion process, respectively. Second, to better integrate object text and image, we design a balanced loss function with a noise parameter, ensuring both optimal editability and fidelity of the object image. Third, to adaptively adjust these parameters, we present a novel similarity score function that not only maximizes the similarities between the generated object image and the input text/image but also balances these similarities to harmonize text and image integration. Extensive experiments demonstrate the effectiveness of our approach, showcasing remarkable object creations such as colobus-glass jar. https: //xzr52. github. io/ATIH/

IJCAI Conference 2024 Conference Paper

RisQNet: Rescuing SMEs from Financial Shocks with a Novel Networked-Loan Risk Assessment

  • Zhaoyuan Lu
  • Taijun Li
  • Jingzhen Zhang
  • Moyang Liu
  • Xiang Li
  • Linyi Cui
  • Junqi Chen
  • Zhibin Niu

In the face of economic downturns, Small and Medium-sized Enterprises (SMEs) within interconnected networked-loans are vulnerable to cascading debt crises, exacerbated by factors like social media-induced financial shocks. Traditional risk assessment models, which mainly rely on financial data, inadequately predict such crises, as evidenced by the collapse of Silicon Valley Bank in 2023. To address this issue, we developed RisQNet, a model that uses temporal graph networks to incorporate diverse risks, including real-time media influences. This approach not only advances risk prediction through news feature extraction and large language models but also enhances risk management strategies with intuitive visualization tools. Validated on a dataset with a total loan volume of USD 3 trillion, RisQNet outperforms the state-of-the-art baseline and achieves 87. 1% of AUC. Our collaborative effort with financial regulators and the SME community underpins the model's development, aligning with the UN SDG 8. RisQNet represents a significant step forward in leveraging AI for financial stability, offering a promising approach to combat the propagation of debt crises in financial networks.

NeurIPS Conference 2024 Conference Paper

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

  • Yuxuan Li
  • Xiang Li
  • Weijie Li
  • Qibin Hou
  • Li Liu
  • Ming-Ming Cheng
  • Jian Yang

Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at \url{https: //github. com/zcablii/SARDet_100K}.

NeurIPS Conference 2024 Conference Paper

Slight Corruption in Pre-training Data Makes Better Diffusion Models

  • Hao Chen
  • Yujin Han
  • Diganta Misra
  • Xiang Li
  • Kai Hu
  • Difan Zou
  • Masashi Sugiyama
  • Jindong Wang

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over $50$ conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

NeurIPS Conference 2024 Conference Paper

Suitable is the Best: Task-Oriented Knowledge Fusion in Vulnerability Detection

  • Jingjing Wang
  • Minhuan Huang
  • Yuanpin Nie
  • Xiang Li
  • Qianjin Du
  • Wei Kong
  • Huan Deng
  • Xiaohui Kuang

Deep learning technologies have demonstrated remarkable performance in vulnerability detection. Existing works primarily adopt a uniform and consistent feature learning pattern across the entire target set. While designed for general-purpose detection tasks, they lack sensitivity towards target code comprising multiple functional modules or diverse vulnerability subtypes. In this paper, we present a knowledge fusion-based vulnerability detection method (KF-GVD) that integrates specific vulnerability knowledge into the Graph Neural Network feature learning process. KF-GVD achieves accurate vulnerability detection across different functional modules of the Linux kernel and vulnerability subtypes without compromising general task performance. Extensive experiments demonstrate that KF-GVD outperforms SOTAs on function-level and statement-level vulnerability detection across various target tasks, with an average increase of 40. 9% in precision and 26. 1% in recall. Notably, KF-GVD discovered 9 undisclosed vulnerabilities when employing on C/C++ open-source projects without ground truth.

ICRA Conference 2024 Conference Paper

TPGP: Temporal-Parametric Optimization with Deep Grasp Prior for Dexterous Motion Planning

  • Haoming Li 0004
  • Qi Ye 0001
  • Yuchi Huo
  • Qingtao Liu
  • Shijian Jiang
  • Tao Zhou
  • Xiang Li
  • Yang Zhou

Grasping motion planning aims to find a feasible grasping trajectory in the configuration space given an input target grasp. While optimizing grasp motion with two or three-fingered grippers has been well studied, the study on natural grasp motion planning with a dexterous hand remains a very challenging problem due to the high dimensional working space. In this work, we propose a novel temporal-parametric grasp prior (TPGP) optimization method to simplify the difficulty of grasping trajectory optimization for the dexterous hand while maintaining smooth and natural properties of the grasping motion. Specifically, we formulate the discrete trajectory parameters into a temporal-based parameterization, where the prior constraint provided by a hand poser network, is introduced to ensure that hand pose is natural and reasonable throughout the trajectory. Finally, we present a joint target optimization strategy to enhance the target pose for more feasible trajectories. Extensive validations on two public datasets show that our method outperforms state-of-the-art methods regarding grasp motion on various metrics.

NeurIPS Conference 2024 Conference Paper

Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure

  • Xiang Li
  • Yixiang Dai
  • Qing Qu

In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.

NeurIPS Conference 2024 Conference Paper

UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner

  • Dongchao Yang
  • Haohan Guo
  • Yuanyuan Wang
  • Rongjie Huang
  • Xiang Li
  • Xu Tan
  • Xixin Wu
  • Helen Meng

Large Language models (LLMs) have demonstrated supreme capabilities in textual understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel LLM-driven audio codec model, LLM-Codec, which transfers the audio modality into textual space by representing audio tokens with words or sub-words from the LLM vocabulary, while maintaining high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into the well-trained textual space of LLMs. Thus, the audio representation can be viewed as a new \textit{foreign language}, and LLMs can learn the new \textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, \textit{e. g. } speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. Experimental results show that LLMs equipped with the LLM-Codec, named as UniAudio 1. 5, prompted by only a few examples, can perform effectively in simple scenarios, validating our cross-modal in-context learning approach. To facilitate research on few-shot audio task learning and multi-modal LLMs, we have open-sourced the LLM-Codec model.

TMLR Journal 2024 Journal Article

Variance-aware decision making with linear function approximation under heavy-tailed rewards

  • Xiang Li
  • Qiang Sun

This paper studies how to achieve variance-aware regrets for online decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $\nu_{t}^2$ is the conditional variance of the reward at round $t$, $d$ is the feature dimension, {and $T$ is number of online rounds}. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a variance-aware regret bound of $\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $\mathcal{G}^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.

IROS Conference 2024 Conference Paper

Voltage Regulation in Polymer Electrolyte Fuel Cell Systems Using Gaussian Process Model Predictive Control

  • Xiufei Li
  • Miao Yang
  • Miao Zhang
  • Yuanxin Qi
  • Zhuowei Li 0008
  • Senbin Yu
  • Yuantao Wang
  • Linpeng Shen

This study presents a novel approach using Gaussian process model predictive control (MPC) to stabilize the output voltage of a polymer electrolyte fuel cell (PEFC) by regulating hydrogen and airflow rates. Two Gaussian process models capture PEFC dynamics, accounting for constraints like hydrogen pressure and input change rates to reduce predictive control errors. The performance of the physical model and Gaussian process MPC in handling constraints and system inputs is compared. Simulations show that the proposed Gaussian process MPC maintains the voltage at 48 V while adhering to safety constraints, even with workload disturbances from 110-120 A. Compared to traditional MPC with detailed system models, Gaussian process MPC has similar overshoot and slower response time but requires less system information and no underlying true system model.

NeurIPS Conference 2024 Conference Paper

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

  • Xiang Li
  • Jian Ding
  • Mohamed Elhoseiny

We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29, 614 images, with 29, 614 human-verified detailed captions, 52, 472 object references, and 123, 221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https: //vrsbench. github. io.

NeurIPS Conference 2023 Conference Paper

A Unified Solution for Privacy and Communication Efficiency in Vertical Federated Learning

  • Ganyu Wang
  • Bin Gu
  • Qingsong Zhang
  • Xiang Li
  • Boyu Wang
  • Charles X. Ling

Vertical Federated Learning (VFL) is a collaborative machine learning paradigm that enables multiple participants to jointly train a model on their private data without sharing it. To make VFL practical, privacy security and communication efficiency should both be satisfied. Recent research has shown that Zero-Order Optimization (ZOO) in VFL can effectively conceal the internal information of the model without adding costly privacy protective add-ons, making it a promising approach for privacy and efficiency. However, there are still two key problems that have yet to be resolved. First, the convergence rate of ZOO-based VFL is significantly slower compared to gradient-based VFL, resulting in low efficiency in model training and more communication round, which hinders its application on large neural networks. Second, although ZOO-based VFL has demonstrated resistance to state-of-the-art (SOTA) attacks, its privacy guarantee lacks a theoretical explanation. To address these challenges, we propose a novel cascaded hybrid optimization approach that employs a zeroth-order (ZO) gradient on the most critical output layer of the clients, with other parts utilizing the first-order (FO) gradient. This approach preserves the privacy protection of ZOO while significantly enhancing convergence. Moreover, we theoretically prove that applying ZOO to the VFL is equivalent to adding Gaussian Mechanism to the gradient information, which offers an implicit differential privacy guarantee. Experimental results demonstrate that our proposed framework achieves similar utility as the Gaussian mechanism under the same privacy budget, while also having significantly lower communication costs compared with SOTA communication-efficient VFL frameworks.

IJCAI Conference 2023 Conference Paper

Compositional Zero-Shot Artistic Font Synthesis

  • Xiang Li
  • Lei Wu
  • Changshuo Wang
  • Lei Meng
  • Xiangxu Meng

Recently, many researchers have made remarkable achievements in the field of artistic font synthesis, with impressive glyph style and effect style in the results. However, due to less exploration in style disentanglement, it is difficult for existing methods to envision a kind of unseen style (glyph-effect) compositions of artistic font, and thus can only learn the seen style compositions. To solve this problem, we propose a novel compositional zero-shot artistic font synthesis gan (CAFS-GAN), which allows the synthesis of unseen style compositions by exploring the visual independence and joint compatibility of encoding semantics between glyph and effect. Specifically, we propose two contrast-based style encoders to achieve style disentanglement due to glyph and effect intertwining in the image. Meanwhile, to preserve more glyph and effect detail, we propose a generator based on hierarchical dual styles AdaIN to reorganize content-styles representations from structure to texture gradually. Extensive experiments demonstrate the superiority of our model in generating high-quality artistic font images with unseen style compositions against other state-of-the-art methods. The source code and data is available at moonlight03. github. io/CAFS-GAN/.

IJCAI Conference 2023 Conference Paper

Contact2Grasp: 3D Grasp Synthesis via Hand-Object Contact Constraint

  • Haoming Li
  • Xinzhuo Lin
  • Yang Zhou
  • Xiang Li
  • Yuchi Huo
  • Jiming Chen
  • Qi Ye

3D grasp synthesis generates grasping poses given an input object. Existing works tackle the problem by learning a direct mapping from objects to the distributions of grasping poses. However, because the physical contact is sensitive to small changes in pose, the high-nonlinear mapping between 3D object representation to valid poses is considerably non-smooth, leading to poor generation efficiency and restricted generality. To tackle the challenge, we introduce an intermediate variable for grasp contact areas to constrain the grasp generation; in other words, we factorize the mapping into two sequential stages by assuming that grasping poses are fully constrained given contact maps: 1) we first learn contact map distributions to generate the potential contact maps for grasps; 2) then learn a mapping from the contact maps to the grasping poses. Further, we propose a penetration-aware optimization with the generated contacts as a consistency constraint for grasp refinement. Extensive validations on two public datasets show that our method outperforms state-of-the-art methods regarding grasp generation on various metrics.

AAAI Conference 2023 Conference Paper

Curriculum Temperature for Knowledge Distillation

  • Zheng Li
  • Xiang Li
  • Lingfeng Yang
  • Borui Zhao
  • Renjie Song
  • Lei Luo
  • Jun Li
  • Jian Yang

Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method.

AAAI Conference 2023 Conference Paper

Decision-Making Context Interaction Network for Click-Through Rate Prediction

  • Xiang Li
  • Shuwei Chen
  • Jian Dong
  • Jin Zhang
  • Yongkang Wang
  • Xingxing Wang
  • Dong Wang

Click-through rate (CTR) prediction is crucial in recommendation and online advertising systems. Existing methods usually model user behaviors, while ignoring the informative context which influences the user to make a click decision, e.g., click pages and pre-ranking candidates that inform inferences about user interests, leading to suboptimal performance. In this paper, we propose a Decision-Making Context Interaction Network (DCIN), which deploys a carefully designed Context Interaction Unit (CIU) to learn decision-making contexts and thus benefits CTR prediction. In addition, the relationship between different decision-making context sources is explored by the proposed Adaptive Interest Aggregation Unit (AIAU) to improve CTR prediction further. In the experiments on public and industrial datasets, DCIN significantly outperforms the state-of-the-art methods. Notably, the model has obtained the improvement of CTR+2.9%/CPM+2.1%/GMV+1.5% for online A/B testing and served the main traffic of Meituan Waimai advertising system.

AAAI Conference 2023 Conference Paper

DesNet: Decomposed Scale-Consistent Network for Unsupervised Depth Completion

  • Zhiqiang Yan
  • Kun Wang
  • Xiang Li
  • Zhenyu Zhang
  • Jun Li
  • Jian Yang

Unsupervised depth completion aims to recover dense depth from the sparse one without using the ground-truth annotation. Although depth measurement obtained from LiDAR is usually sparse, it contains valid and real distance information, i.e., scale-consistent absolute depth values. Meanwhile, scale-agnostic counterparts seek to estimate relative depth and have achieved impressive performance. To leverage both the inherent characteristics, we thus suggest to model scale-consistent depth upon unsupervised scale-agnostic frameworks. Specifically, we propose the decomposed scale-consistent learning (DSCL) strategy, which disintegrates the absolute depth into relative depth prediction and global scale estimation, contributing to individual learning benefits. But unfortunately, most existing unsupervised scale-agnostic frameworks heavily suffer from depth holes due to the extremely sparse depth input and weak supervisory signal. To tackle this issue, we introduce the global depth guidance (GDG) module, which attentively propagates dense depth reference into the sparse target via novel dense-to-sparse attention. Extensive experiments show the superiority of our method on outdoor KITTI, ranking 1st and outperforming the best KBNet more than 12% in RMSE. Additionally, our approach achieves state-of-the-art performance on indoor NYUv2 benchmark as well.

NeurIPS Conference 2023 Conference Paper

DFRD: Data-Free Robustness Distillation for Heterogeneous Federated Learning

  • Kangyang Luo
  • Shuai Wang
  • Yexuan Fu
  • Xiang Li
  • Yunshi Lan
  • Ming Gao

Federated Learning (FL) is a privacy-constrained decentralized machine learning paradigm in which clients enable collaborative training without compromising private data. However, how to learn a robust global model in the data-heterogeneous and model-heterogeneous FL scenarios is challenging. To address it, we resort to data-free knowledge distillation to propose a new FL method (namely DFRD). DFRD equips a conditional generator on the server to approximate the training space of the local models uploaded by clients, and systematically investigates its training in terms of fidelity, transferability and diversity. To overcome the catastrophic forgetting of the global model caused by the distribution shifts of the generator across communication rounds, we maintain an exponential moving average copy of the generator on the server. Additionally, we propose dynamic weighting and label sampling to accurately extract knowledge from local models. Finally, our extensive experiments on various image classification tasks illustrate that DFRD achieves significant performance gains compared to SOTA baselines.

NeurIPS Conference 2023 Conference Paper

Fine-Grained Visual Prompting

  • Lingfeng Yang
  • Yueze Wang
  • Xiang Li
  • Xinlong Wang
  • Jian Yang

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our F ine- G rained V isual P rompting ( FGVP ) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3. 0\% to 4. 6\%, with a maximum improvement of 12. 5\% on the RefCOCO+ testA subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. Code is available at https: //github. com/ylingfeng/FGVP.

ICRA Conference 2023 Conference Paper

Hierarchical Intention Tracking for Robust Human-Robot Collaboration in Industrial Assembly Tasks

  • Zhe Huang 0010
  • Ye-Ji Mun
  • Xiang Li
  • Yiqing Xie
  • Ninghan Zhong
  • Weihang Liang
  • Junyi Geng
  • Tan Chen 0001

Collaborative robots require effective human intention estimation to safely and smoothly work with humans in less structured tasks such as industrial assembly, where human intention continuously changes. We propose the concept of intention tracking and introduce a collaborative robot system that concurrently tracks intentions at hierarchical levels. The high-level intention is tracked to estimate human's interaction pattern and enable robot to (1) avoid collision with human to minimize interruption and (2) assist human to correct failure. The low-level intention estimate provides robot with task-related information. We implement the system on a UR5e robot and demonstrate robust, seamless and ergonomic human-robot collaboration in an ablative pilot study of an assembly use case.

NeurIPS Conference 2023 Conference Paper

LD2: Scalable Heterophilous Graph Neural Network with Decoupled Embeddings

  • Ningyi Liao
  • Siqiang Luo
  • Xiang Li
  • Jieming Shi

Heterophilous Graph Neural Network (GNN) is a family of GNNs that specializes in learning graphs under heterophily, where connected nodes tend to have different labels. Most existing heterophilous models incorporate iterative non-local computations to capture node relationships. However, these approaches have limited application to large-scale graphs due to their high computational costs and challenges in adopting minibatch schemes. In this work, we study the scalability issues of heterophilous GNN and propose a scalable model, LD2, which simplifies the learning process by decoupling graph propagation and generating expressive embeddings prior to training. Theoretical analysis demonstrates that LD2 achieves optimal time complexity in training, as well as a memory footprint that remains independent of the graph scale. We conduct extensive experiments to showcase that our model is capable of lightweight minibatch training on large-scale heterophilous graphs, with up to $15\times$ speed improvement and efficient memory utilization, while maintaining comparable or better performance than the baselines.

NeurIPS Conference 2023 Conference Paper

Learning to Compress Prompts with Gist Tokens

  • Jesse Mu
  • Xiang Li
  • Noah Goodman

Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for compute efficiency. Gist models can be trained with no additional cost over standard instruction finetuning by simply modifying Transformer attention masks to encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4. 2% wall time speedups, and storage savings, all with minimal loss in output quality.

AAAI Conference 2023 Conference Paper

LWSIS: LiDAR-Guided Weakly Supervised Instance Segmentation for Autonomous Driving

  • Xiang Li
  • Junbo Yin
  • Botian Shi
  • Yikang Li
  • Ruigang Yang
  • Jianbing Shen

Image instance segmentation is a fundamental research topic in autonomous driving, which is crucial for scene understanding and road safety. Advanced learning-based approaches often rely on the costly 2D mask annotations for training. In this paper, we present a more artful framework, LiDAR-guided Weakly Supervised Instance Segmentation (LWSIS), which leverages the off-the-shelf 3D data, i.e., Point Cloud, together with the 3D boxes, as natural weak supervisions for training the 2D image instance segmentation models. Our LWSIS not only exploits the complementary information in multimodal data during training but also significantly reduces the annotation cost of the dense 2D masks. In detail, LWSIS consists of two crucial modules, Point Label Assignment (PLA) and Graph-based Consistency Regularization (GCR). The former module aims to automatically assign the 3D point cloud as 2D point-wise labels, while the atter further refines the predictions by enforcing geometry and appearance consistency of the multimodal data. Moreover, we conduct a secondary instance segmentation annotation on the nuScenes, named nuInsSeg, to encourage further research on multimodal perception tasks. Extensive experiments on the nuInsSeg, as well as the large-scale Waymo, show that LWSIS can substantially improve existing weakly supervised segmentation models by only involving 3D data during training. Additionally, LWSIS can also be incorporated into 3D object detectors like PointPainting to boost the 3D detection performance for free. The code and dataset are available at https://github.com/Serenos/LWSIS.

ICLR Conference 2023 Conference Paper

Near-optimal Policy Identification in Active Reinforcement Learning

  • Xiang Li
  • Viraj Mehta
  • Johannes Kirschner
  • Ian Char
  • Willie Neiswanger
  • Jeff G. Schneider
  • Andreas Krause 0001
  • Ilija Bogunovic

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the expensive transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

NeurIPS Conference 2023 Conference Paper

PaintSeg: Painting Pixels for Training-free Segmentation

  • Xiang Li
  • Chung-Ching Lin
  • Yinpeng Chen
  • Zicheng Liu
  • Jinglu Wang
  • Rita Singh
  • Bhiksha Raj

The paper introduces PaintSeg, a new unsupervised method for segmenting objects without any training. We propose an adversarial masked contrastive painting (AMCP) process, which creates a contrast between the original image and a painted image in which a masked area is painted using off-the-shelf generative models. During the painting process, inpainting and outpainting are alternated, with the former masking the foreground and filling in the background, and the latter masking the background while recovering the missing part of the foreground object. Inpainting and outpainting, also referred to as I-step and O-step, allow our method to gradually advance the target segmentation mask toward the ground truth without supervision or training. PaintSeg can be configured to work with a variety of prompts, e. g. coarse masks, boxes, scribbles, and points. Our experimental results demonstrate that PaintSeg outperforms existing approaches in coarse mask-prompt, box-prompt, and point-prompt segmentation tasks, providing a training-free solution suitable for unsupervised segmentation. Code: https: //github. com/lxa9867/PaintSeg.

AAAI Conference 2023 Conference Paper

Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

  • Xiang Li
  • Haoyuan Cao
  • Shijie Zhao
  • Junlin Li
  • Li Zhang
  • Bhiksha Raj

Video salient object detection (VSOD), as a fundamental computer vision problem, has been extensively discussed in the last decade. However, all existing works focus on addressing the VSOD problem in 2D scenarios. With the rapid development of VR devices, panoramic videos have been a promising alternative to 2D videos to provide immersive feelings of the real world. In this paper, we aim to tackle the video salient object detection problem for panoramic videos, with their corresponding ambisonic audios. A multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion (ACF) blocks is proposed to effectively conduct audio-visual interaction. The ACF block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources from the equirectangular frames and ambisonic audios. Experimental results verify the effectiveness of our proposed components and demonstrate that our method achieves state-of-the-art performance on the ASOD60K dataset.

AAAI Conference 2023 Conference Paper

PGSS: Pitch-Guided Speech Separation

  • Xiang Li
  • Yiwen Wang
  • Yifan Sun
  • Xihong Wu
  • Jing Chen

Monaural speech separation aims to separate concurrent speakers from a single-microphone mixture recording. Inspired by the effect of pitch priming in auditory scene analysis (ASA) mechanisms, a novel pitch-guided speech separation framework is proposed in this work. The prominent advantage of this framework is that both the permutation problem and the unknown speaker number problem existing in general models can be avoided by using pitch contours as the primary means to guide the target speaker. In addition, adversarial training is applied, instead of a traditional time-frequency mask, to improve the perceptual quality of separated speech. Specifically, the proposed framework can be divided into two phases: pitch extraction and speech separation. The former aims to extract pitch contour candidates for each speaker from the mixture, modeling the bottom-up process in ASA mechanisms. Any pitch contour can be selected as the condition in the second phase to separate the corresponding speaker, where a conditional generative adversarial network (CGAN) is applied. The second phase models the effect of pitch priming in ASA. Experiments on the WSJ0-2mix corpus reveal that the proposed approaches can achieve higher pitch extraction accuracy and better separation performance, compared to the baseline models, and have the potential to be applied to SOTA architectures.

JBHI Journal 2023 Journal Article

Prediction of New-Onset Diabetes After Pancreatectomy With Subspace Clustering Based Multi-View Feature Selection

  • Peijun Hu
  • Xiang Li
  • Na Lu
  • Kaiqi Dong
  • Xueli Bai
  • Tingbo Liang
  • Jingsong Li

The pancreas plays an important role in glucose metabolism, and developing diabetes or long-term glucose metabolism disturbance may be a prevalent sequela after pancreatectomy. Nevertheless, relative factors of new-onset diabetes after pancreatectomy stay unclear. Radiomics analysis is potential to identify image markers for disease prediction or prognosis. Meanwhile, combination of imaging and electronic medical record (EMR) showed superior performance than imaging or EMR alone in previous studies. One critical step is to identity predictors from high-dimensional features, and it is even more challenging to select and fuse imaging and EMR features. In this work, we develop a radiomics pipeline to assess postoperative new-onset diabetes risk of patients undergoing distal pancreatectomy. Specifically, we extract multiscale image features with 3D wavelet transformation, and include patients’ characteristics, body composition and pancreas volume information as clinical features. Then, we propose a multi-view subspace clustering guided feature selection method (MSCUFS) for the selection and fusion of image and clinical features. Finally, a prediction model is constructed with classical machine learning classifier. Experimental results on an established distal pancreatectomy cohort showed that the SVM model with combined imaging and EMR features demonstrated good discrimination, with an AUC value of 0. 824, which improved the model with image features alone by 0. 037 AUC. Compared with state-of-the-art feature selection methods, the proposed MSCUFS has superior performance in fusing image and clinical features.

AAAI Conference 2023 Conference Paper

Recurrent Structure Attention Guidance for Depth Super-resolution

  • Jiayi Yuan
  • Haobo Jiang
  • Xiang Li
  • Jianjun Qian
  • Jun Li
  • Jian Yang

Image guidance is an effective strategy for depth super-resolution. Generally, most existing methods employ hand-crafted operators to decompose the high-frequency (HF) and low-frequency (LF) ingredients from low-resolution depth maps and guide the HF ingredients by directly concatenating them with image features. However, the hand-designed operators usually cause inferior HF maps (e.g., distorted or structurally missing) due to the diverse appearance of complex depth maps. Moreover, the direct concatenation often results in weak guidance because not all image features have a positive effect on the HF maps. In this paper, we develop a recurrent structure attention guided (RSAG) framework, consisting of two important parts. First, we introduce a deep contrastive network with multi-scale filters for adaptive frequency-domain separation, which adopts contrastive networks from large filters to small ones to calculate the pixel contrasts for adaptive high-quality HF predictions. Second, instead of the coarse concatenation guidance, we propose a recurrent structure attention block, which iteratively utilizes the latest depth estimation and the image features to jointly select clear patterns and boundaries, aiming at providing refined guidance for accurate depth recovery. In addition, we fuse the features of HF maps to enhance the edge structures in the decomposed LF maps. Extensive experiments show that our approach obtains superior performance compared with state-of-the-art depth super-resolution methods. Our code is available at: https://github.com/Yuanjiayii/DSR-RSAG.

AAAI Conference 2023 Conference Paper

Structure Flow-Guided Network for Real Depth Super-resolution

  • Jiayi Yuan
  • Haobo Jiang
  • Xiang Li
  • Jianjun Qian
  • Jun Li
  • Jian Yang

Real depth super-resolution (DSR), unlike synthetic settings, is a challenging task due to the structural distortion and the edge noise caused by the natural degradation in real-world low-resolution (LR) depth maps. These defeats result in significant structure inconsistency between the depth map and the RGB guidance, which potentially confuses the RGB-structure guidance and thereby degrades the DSR quality. In this paper, we propose a novel structure flow-guided DSR framework, where a cross-modality flow map is learned to guide the RGB-structure information transferring for precise depth upsampling. Specifically, our framework consists of a cross-modality flow-guided upsampling network (CFUNet) and a flow-enhanced pyramid edge attention network (PEANet). CFUNet contains a trilateral self-attention module combining both the geometric and semantic correlations for reliable cross-modality flow learning. Then, the learned flow maps are combined with the grid-sampling mechanism for coarse high-resolution (HR) depth prediction. PEANet targets at integrating the learned flow map as the edge attention into a pyramid network to hierarchically learn the edge-focused guidance feature for depth edge refinement. Extensive experiments on real and synthetic DSR datasets verify that our approach achieves excellent performance compared to state-of-the-art methods. Our code is available at: https://github.com/Yuanjiayii/DSR-SFG.

JBHI Journal 2023 Journal Article

The Individualized Prediction of Neurocognitive Function in People Living with HIV Based on Clinical and Multimodal Connectome Data

  • Xiang Li
  • Sheri L. Towe
  • Ryan P. Bell
  • Rongtao Jiang
  • Shana A. Hall
  • Vince D. Calhoun
  • Christina S. Meade
  • Jing Sui

Neurocognitive impairment continues to be common comorbidity for people living with HIV (PLWH). Given the chronic nature of HIV disease, identifying reliable biomarkers of these impairments is essential to advance our understanding of the underlying neural foundation and facilitate screening and diagnosis in clinical care. While neuroimaging provides immense potential for such biomarkers, to date, investigations in PLWH have been mostly limited to either univariate mass techniques or a single neuroimaging modality. In the present study, connectome-based predictive modeling (CPM) was proposed to predict individual differences of cognitive functioning in PLWH, using resting-state functional connectivity (FC), white matter structural connectivity (SC), and clinical relevant measures. We also adopted an efficient feature selection approach to identify the most predictive features, which achieved an optimal prediction accuracy of r = 0. 61 in the discovery dataset ( n = 102) and r = 0. 45 in an independent validation HIV cohort ( n = 88). Two brain templates and nine distinct prediction models were also tested for better modeling generalizability. Results show that combining multimodal FC and SC features enabled higher prediction accuracy of cognitive scores in PLWH, while adding clinical and demographic metrics may further improve the prediction by introducing complementary information, which may help better evaluate the individual-level cognitive performance in PLWH.

ICLR Conference 2023 Conference Paper

TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization

  • Xiang Li
  • Junchi Yang
  • Niao He

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

NeurIPS Conference 2023 Conference Paper

Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods

  • Junchi YANG
  • Xiang Li
  • Ilyas Fatkhullin
  • Niao He

The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $\widetilde{\mathcal{O}}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods — Normalized SGD (NSGD), AMSGrad, and AdaGrad — unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.

NeurIPS Conference 2023 Conference Paper

YouTubePD: A Multimodal Benchmark for Parkinson’s Disease Analysis

  • Andy Zhou
  • Samuel Li
  • Pranav Sriram
  • Xiang Li
  • Jiahua Dong
  • Ansh Sharma
  • Yuanyi Zhong
  • Shirui Luo

The healthcare and AI communities have witnessed a growing interest in the development of AI-assisted systems for automated diagnosis of Parkinson's Disease (PD), one of the most prevalent neurodegenerative disorders. However, the progress in this area has been significantly impeded by the absence of a unified, publicly available benchmark, which prevents comprehensive evaluation of existing PD analysis methods and the development of advanced models. This work overcomes these challenges by introducing YouTubePD -- the first publicly available multimodal benchmark designed for PD analysis. We crowd-source existing videos featured with PD from YouTube, exploit multimodal information including in-the-wild videos, audio data, and facial landmarks across 200+ subject videos, and provide dense and diverse annotations from clinical expert. Based on our benchmark, we propose three challenging and complementary tasks encompassing both discriminative and generative tasks, along with a comprehensive set of corresponding baselines. Experimental evaluation showcases the potential of modern deep learning and computer vision techniques, in particular the generalizability of the models developed on YouTubePD to real-world clinical settings, while revealing their limitations. We hope our work paves the way for future research in this direction.

NeurIPS Conference 2022 Conference Paper

Asymptotic Behaviors of Projected Stochastic Approximation: A Jump Diffusion Perspective

  • Jiadong Liang
  • Yuze Han
  • Xiang Li
  • Zhihua Zhang

In this paper, we consider linearly constrained stochastic approximation problems with federated learning (FL) as a special case. We propose a stochastic approximation algorithm named by LPSA with probabilistic projections to ensure feasibility so that projections are performed with probability $p_n$ at the $n$-th iteration. Considering a specific family of the probability $p_n$ and step size $\eta_n$, we analyze our algorithm from an asymptotic and continuous perspective. Using a novel jump diffusion approximation, we show that the trajectories consisting of properly rescaled last iterates weakly converge to the solution of specific SDEs. By analyzing the SDEs, we identify the asymptotic behaviors of LPSA for different choices of $(p_n, \eta_n)$. We find the algorithm presents an intriguing asymptotic bias-variance trade-off according to the relative magnitude of $p_n$ w. r. t. $\eta_n$. It provides insights on how to choose appropriate $\{(p_n, \eta_n)\}_{n \geq 1}$ to minimize the projection complexity.

IJCAI Conference 2022 Conference Paper

CGMN: A Contrastive Graph Matching Network for Self-Supervised Graph Similarity Learning

  • Di Jin
  • Luzhi Wang
  • Yizhen Zheng
  • Xiang Li
  • Fei Jiang
  • Wei Lin
  • Shirui Pan

Graph similarity learning refers to calculating the similarity score between two graphs, which is required in many realistic applications, such as visual tracking, graph classification, and collaborative filtering. As most of the existing graph neural networks yield effective graph representations of a single graph, little effort has been made for jointly learning two graph representations and calculating their similarity score. In addition, existing unsupervised graph similarity learning methods are mainly clustering-based, which ignores the valuable information embodied in graph pairs. To this end, we propose a contrastive graph matching network (CGMN) for self-supervised graph similarity learning in order to calculate the similarity between any two input graph objects. Specifically, we generate two augmented views for each graph in a pair respectively. Then, we employ two strategies, namely cross-view interaction and cross-graph interaction, for effective node representation learning. The former is resorted to strengthen the consistency of node representations in two views. The latter is utilized to identify node differences between different graphs. Finally, we transform node representations into graph-level representations via pooling operations for graph similarity computation. We have evaluated CGMN on eight real-world datasets, and the experiment results show that the proposed new approach is superior to the state-of-the-art methods in graph similarity learning downstream tasks.

NeurIPS Conference 2022 Conference Paper

Diffusion-LM Improves Controllable Text Generation

  • Xiang Li
  • John Thickstun
  • Ishaan Gulrajani
  • Percy S. Liang
  • Tatsunori B. Hashimoto

Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e. g. , sentiment), there has been little progress on complex, fine-grained controls (e. g. , syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.

NeurIPS Conference 2022 Conference Paper

Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

  • Xiang Li
  • Jinghuan Shang
  • Srijan Das
  • Michael Ryoo

We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e. g. , CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform evolutionary searches to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. After evaluating these approaches together in multiple different environments including a real-world robot environment, we confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we conduct the ablation study on multiple factors and demonstrate the properties of representations learned with different approaches.

NeurIPS Conference 2022 Conference Paper

DTG-SSOD: Dense Teacher Guidance for Semi-Supervised Object Detection

  • Gang Li
  • Xiang Li
  • Yujie Wang
  • Wu Yichao
  • Ding Liang
  • Shanshan Zhang

The Mean-Teacher (MT) scheme is widely adopted in semi-supervised object detection (SSOD). In MT, sparse pseudo labels, offered by the final predictions of the teacher (e. g. , after Non Maximum Suppression (NMS) post-processing), are adopted for the dense supervision for the student via hand-crafted label assignment. However, the "sparse-to-dense'' paradigm complicates the pipeline of SSOD, and simultaneously neglects the powerful direct, dense teacher supervision. In this paper, we attempt to directly leverage the dense guidance of teacher to supervise student training, i. e. , the "dense-to-dense'' paradigm. Specifically, we propose the Inverse NMS Clustering (INC) and Rank Matching (RM) to instantiate the dense supervision, without the widely used, conventional sparse pseudo labels. INC leads the student to group candidate boxes into clusters in NMS as the teacher does, which is implemented by learning grouping information revealed in NMS procedure of the teacher. After obtaining the same grouping scheme as the teacher via INC, the student further imitates the rank distribution of the teacher over clustered candidates through Rank Matching. With the proposed INC and RM, we integrate Dense Teacher Guidance into Semi-Supervised Object Detection (termed "DTG-SSOD''), successfully abandoning sparse pseudo labels and enabling more informative learning on unlabeled data. On COCO benchmark, our DTG-SSOD achieves state-of-the-art performance under various labelling ratios. For example, under 10% labelling ratio, DTG-SSOD improves the supervised baseline from 26. 9 to 35. 9 mAP, outperforming the previous best method Soft Teacher by 1. 9 points.

AAAI Conference 2022 Conference Paper

Hybrid Instance-Aware Temporal Fusion for Online Video Instance Segmentation

  • Xiang Li
  • Jinglu Wang
  • Xiao Li
  • Yan Lu

Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverage the representation, i. e. , a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes is further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i. e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.

AAAI Conference 2022 Conference Paper

Knowledge Distillation for Object Detection via Rank Mimicking and Prediction-Guided Feature Imitation

  • Gang Li
  • Xiang Li
  • Yujie Wang
  • Shanshan Zhang
  • Yichao Wu
  • Ding Liang

Knowledge Distillation (KD) is a widely-used technology to inherit information from cumbersome teacher models to compact student models, consequently realizing model compression and acceleration. Compared with image classification, object detection is a more complex task, and designing specific KD methods for object detection is non-trivial. In this work, we elaborately study the behaviour difference between the teacher and student detection models, and obtain two intriguing observations: First, the teacher and student rank their detected candidate boxes quite differently, which results in their precision discrepancy. Second, there is a considerable gap between the feature response differences and prediction differences between teacher and student, indicating that equally imitating all the feature maps of the teacher is the sub-optimal choice for improving the student’s accuracy. Based on the two observations, we propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors, respectively. RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill, which consistently outperforms the traditional soft label distillation. PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student’s accuracy. On MS COCO and PASCAL VOC benchmarks, extensive experiments are conducted on various detectors with different backbones to validate the effectiveness of our method. Specifically, RetinaNet with ResNet50 achieves 40. 4% mAP on MS COCO, which is 3. 5% higher than its baseline, and also outperforms previous KD methods.

NeurIPS Conference 2022 Conference Paper

Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization

  • Junchi YANG
  • Xiang Li
  • Niao He

Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability – requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda with AdaGrad stepsizes can achieve the near-optimal $\widetilde{O}(\epsilon^{-2})$ and $\widetilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.

NeurIPS Conference 2022 Conference Paper

Personalized Federated Learning towards Communication Efficiency, Robustness and Fairness

  • Shiyun Lin
  • Yuze Han
  • Xiang Li
  • Zhihua Zhang

Personalized Federated Learning faces many challenges such as expensive communication costs, training-time adversarial attacks, and performance unfairness across devices. Recent developments witness a trade-off between a reference model and local models to achieve personalization. We follow the avenue and propose a personalized FL method towards the three goals. When it is time to communicate, our method projects local models into a shared-and-fixed low-dimensional random subspace and uses infimal convolution to control the deviation between the reference model and projected local models. We theoretically show our method converges for smooth objectives with square regularizers and the convergence dependence on the projection dimension is mild. We also illustrate the benefits of robustness and fairness on a class of linear problems. Finally, we conduct a large number of experiments to show the empirical superiority of our method over several state-of-the-art methods on the three aspects.

IJCAI Conference 2022 Conference Paper

RAW-GNN: RAndom Walk Aggregation based Graph Neural Network

  • Di Jin
  • Rui Wang
  • Meng Ge
  • Dongxiao He
  • Xiang Li
  • Wei Lin
  • Weixiong Zhang

Graph-Convolution-based methods have been successfully applied to representation learning on homophily graphs where nodes with the same label or similar attributes tend to connect with one another. Due to the homophily assumption of Graph Convolutional Networks (GCNs) that these methods use, they are not suitable for heterophily graphs where nodes with different labels or dissimilar attributes tend to be adjacent. Several methods have attempted to address this heterophily problem, but they do not change the fundamental aggregation mechanism of GCNs because they rely on summation operators to aggregate information from neighboring nodes, which is implicitly subject to the homophily assumption. Here, we introduce a novel aggregation mechanism and develop a RAndom Walk Aggregation-based Graph Neural Network (called RAW-GNN) method. The proposed approach integrates the random walk strategy with graph neural networks. The new method utilizes breadth-first random walk search to capture homophily information and depth-first search to collect heterophily information. It replaces the conventional neighborhoods with path-based neighborhoods and introduces a new path-based aggregator based on Recurrent Neural Networks. These designs make RAW-GNN suitable for both homophily and heterophily graphs. Extensive experimental results showed that the new method achieved state-of-the-art performance on a variety of homophily and heterophily graphs.

NeurIPS Conference 2022 Conference Paper

RecursiveMix: Mixed Learning with History

  • Lingfeng Yang
  • Xiang Li
  • Borui Zhao
  • Renjie Song
  • Jian Yang

Mix-based augmentation has been proven fundamental to the generalization of deep vision models. However, current augmentations only mix samples from the current data batch during training, which ignores the possible knowledge accumulated in the learning history. In this paper, we propose a recursive mixed-sample learning paradigm, termed ``RecursiveMix'' (RM), by exploring a novel training strategy that leverages the historical input-prediction-label triplets. More specifically, we iteratively resize the input image batch from the previous iteration and paste it into the current batch while their labels are fused proportionally to the area of the operated patches. Furthermore, a consistency loss is introduced to align the identical image semantics across the iterations, which helps the learning of scale-invariant feature representations. Based on ResNet-50, RM largely improves classification accuracy by $\sim$3. 2% on CIFAR-100 and $\sim$2. 8% on ImageNet with negligible extra computation/storage costs. In the downstream object detection task, the RM-pretrained model outperforms the baseline by 2. 1 AP points and surpasses CutMix by 1. 4 AP points under the ATSS detector on COCO. In semantic segmentation, RM also surpasses the baseline and CutMix by 1. 9 and 1. 1 mIoU points under UperNet on ADE20K, respectively. Codes and pretrained models are available at https: //github. com/implus/RecursiveMix.

JBHI Journal 2021 Journal Article

Automatic Pancreas Segmentation in CT Images With Distance-Based Saliency-Aware DenseASPP Network

  • Peijun Hu
  • Xiang Li
  • Yu Tian
  • Tianyu Tang
  • Tianshu Zhou
  • Xueli Bai
  • Shiqiang Zhu
  • Tingbo Liang

Pancreas identification and segmentation is an essential task in the diagnosis and prognosis of pancreas disease. Although deep neural networks have been widely applied in abdominal organ segmentation, it is still challenging for small organs (e. g. pancreas) that present low contrast, highly flexible anatomical structure and relatively small region. In recent years, coarse-to-fine methods have improved pancreas segmentation accuracy by using coarse predictions in the fine stage, but only object location is utilized and rich image context is neglected. In this paper, we propose a novel distance-based saliency-aware model, namely DSD-ASPP-Net, to fully use coarse segmentation to highlight the pancreas feature and boost accuracy in the fine segmentation stage. Specifically, a DenseASPP (Dense Atrous Spatial Pyramid Pooling) model is trained to learn the pancreas location and probability map, which is then transformed into saliency map through geodesic distance-based saliency transformation. In the fine stage, saliency-aware modules that combine saliency map and image context are introduced into DenseASPP to develop the DSD-ASPP-Net. The architecture of DenseASPP brings multi-scale feature representation and achieves larger receptive field in a denser way, which overcome the difficulties brought by variable object sizes and locations. Our method was evaluated on both public NIH pancreas dataset and local hospital dataset, and achieved an average Dice-Sørensen Coefficient (DSC) value of 85. 49±4. 77% on the NIH dataset, outperforming former coarse-to-fine methods.

AAAI Conference 2021 Conference Paper

Capturing Delayed Feedback in Conversion Rate Prediction via Elapsed-Time Sampling

  • Jia-Qi Yang
  • Xiang Li
  • Shuguang Han
  • Tao Zhuang
  • De-Chuan Zhan
  • Xiaoyi Zeng
  • Bin Tong

Conversion rate (CVR) prediction is one of the most critical tasks for digital display advertising. Commercial systems often require to update models in an online learning manner to catch up with the evolving data distribution. However, conversions usually do not happen immediately after user clicks. This may result in inaccurate labeling, which is called delayed feedback problem. In previous studies, delayed feedback problem is handled either by waiting positive label for a long period of time, or by consuming the negative sample on its arrival and then insert a positive duplicate when conversion happens later. Indeed, there is a trade-off between waiting for more accurate labels and utilizing fresh data, which is not considered in existing works. To strike a balance in this trade-off, we propose Elapsed-Time Sampling Delayed Feedback Model (ES-DFM), which models the relationship between the observed conversion distribution and the true conversion distribution. Then we optimize the expectation of true conversion distribution via importance sampling under the elapsed-time sampling distribution. We further estimate the importance weight for each instance, which is used as the weight of loss function in CVR prediction. To demonstrate the effectiveness of ES-DFM, we conduct extensive experiments on a public data and a private industrial dataset. Experimental results confirm that our method consistently outperforms the previous state-of-the-art results.

AAAI Conference 2021 Conference Paper

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

  • Binbin Xie
  • Jinsong Su
  • Yubin Ge
  • Xiang Li
  • Jianwei Cui
  • Junfeng Yao
  • Bin Wang

Code generation aims to automatically generate a piece of code given an input natural language utterance. Currently, among dominant models, it is treated as a sequence-to-tree task, where a decoder outputs a sequence of actions corresponding to the pre-order traversal of an Abstract Syntax Tree. However, such a decoder only exploits the preorder traversal based preceding actions, which are insufficient to ensure correct action predictions. In this paper, we first throughly analyze the context modeling difference between neural code generation models with different traversals based decodings (preorder traversal vs breadth-first traversal), and then propose to introduce a mutual learning framework to jointly train these models. Under this framework, we continuously enhance both two models via mutual distillation, which involves synchronous executions of two one-to-one knowledge transfers at each training step. More specifically, we alternately choose one model as the student and the other as its teacher, and require the student to fit the training data and the action prediction distributions of its teacher. By doing so, both models can fully absorb the knowledge from each other and thus could be improved simultaneously. Experimental results and in-depth analysis on several benchmark datasets demonstrate the effectiveness of our approach. We release our code at https: //github. com/DeepLearnXMU/CGML.

JBHI Journal 2021 Journal Article

Left Ventricle Quantification Challenge: A Comprehensive Comparison and Evaluation of Segmentation and Regression for Mid-Ventricular Short-Axis Cardiac MR Data

  • Wufeng Xue
  • Jiahui Li
  • Zhiqiang Hu
  • Eric Kerfoot
  • James Clough
  • Ilkay Oksuz
  • Hao Xu
  • Vicente Grau

Automatic quantification of the left ventricle (LV) from cardiac magnetic resonance (CMR) images plays an important role in making the diagnosis procedure efficient, reliable, and alleviating the laborious reading work for physicians. Considerable efforts have been devoted to LV quantification using different strategies that include segmentation-based (SG) methods and the recent direct regression (DR) methods. Although both SG and DR methods have obtained great success for the task, a systematic platform to benchmark them remains absent because of differences in label information during model learning. In this paper, we conducted an unbiased evaluation and comparison of cardiac LV quantification methods that were submitted to the Left Ventricle Quantification (LVQuan) challenge, which was held in conjunction with the Statistical Atlases and Computational Modeling of the Heart (STACOM) workshop at the MICCAI 2018. The challenge was targeted at the quantification of 1) areas of LV cavity and myocardium, 2) dimensions of the LV cavity, 3) regional wall thicknesses (RWT), and 4) the cardiac phase, from mid-ventricle short-axis CMR images. First, we constructed a public quantification dataset Cardiac-DIG with ground truth labels for both the myocardium mask and these quantification targets across the entire cardiac cycle. Then, the key techniques employed by each submission were described. Next, quantitative validation of these submissions were conducted with the constructed dataset. The evaluation results revealed that both SG and DR methods can offer good LV quantification performance, even though DR methods do not require densely labeled masks for supervision. Among the 12 submissions, the DR method LDAMT offered the best performance, with a mean estimation error of 301 mm $^2$ for the two areas, 2. 15 mm for the cavity dimensions, 2. 03 mm for RWTs, and a 9. 5% error rate for the cardiac phase classification. Three of the SG methods also delivered comparable performances. Finally, we discussed the advantages and disadvantages of SG and DR methods, as well as the unsolved problems in automatic cardiac quantification for clinical practice applications.

NeurIPS Conference 2021 Conference Paper

Reinforcement Learning Enhanced Explainer for Graph Neural Networks

  • Caihua Shan
  • Yifei Shen
  • Yao Zhang
  • Xiang Li
  • Dongsheng Li

Graph neural networks (GNNs) have recently emerged as revolutionary technologies for machine learning tasks on graphs. In GNNs, the graph structure is generally incorporated with node representation via the message passing scheme, making the explanation much more challenging. Given a trained GNN model, a GNN explainer aims to identify a most influential subgraph to interpret the prediction of an instance (e. g. , a node or a graph), which is essentially a combinatorial optimization problem over graph. The existing works solve this problem by continuous relaxation or search-based heuristics. But they suffer from key issues such as violation of message passing and hand-crafted heuristics, leading to inferior interpretability. To address these issues, we propose a RL-enhanced GNN explainer, RG-Explainer, which consists of three main components: starting point selection, iterative graph generation and stopping criteria learning. RG-Explainer could construct a connected explanatory subgraph by sequentially adding nodes from the boundary of the current generated graph, which is consistent with the message passing scheme. Further, we design an effective seed locator to select the starting point, and learn stopping criteria to generate superior explanations. Extensive experiments on both synthetic and real datasets show that RG-Explainer outperforms state-of-the-art GNN explainers. Moreover, RG-Explainer can be applied in the inductive setting, demonstrating its better generalization ability.

NeurIPS Conference 2021 Conference Paper

The Image Local Autoregressive Transformer

  • Chenjie Cao
  • Yuxin Hong
  • Xiang Li
  • Chengrong Wang
  • Chengming Xu
  • Yanwei Fu
  • Xiangyang Xue

Recently, AutoRegressive (AR) models for the whole image generation empowered by transformers have achieved comparable or even better performance compared to Generative Adversarial Networks (GANs). Unfortunately, directly applying such AR models to edit/change local image regions, may suffer from the problems of missing global information, slow inference speed, and information leakage of local guidance. To address these limitations, we propose a novel model -- image Local Autoregressive Transformer (iLAT), to better facilitate the locally guided image synthesis. Our iLAT learns the novel local discrete representations, by the newly proposed local autoregressive (LA) transformer of the attention mask and convolution mechanism. Thus iLAT can efficiently synthesize the local image regions by key guidance information. Our iLAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both quantitative and qualitative results show the efficacy of our model.

IJCAI Conference 2020 Conference Paper

An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension

  • Xin Liu
  • Kai Liu
  • Xiang Li
  • Jinsong Su
  • Yubin Ge
  • Bin Wang
  • Jiebo Luo

The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner. Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.

JBHI Journal 2020 Journal Article

Automated Semantic Segmentation of Red Blood Cells for Sickle Cell Disease

  • Mo Zhang
  • Xiang Li
  • Mengjia Xu
  • Quanzheng Li

Red blood cell (RBC) segmentation and classification from microscopic images is a crucial step for the diagnosis of sickle cell disease (SCD). In this work, we adopt a deep learning based semantic segmentation framework to solve the RBC classification task. A major challenge for robust segmentation and classification is the large variations on the size, shape and viewpoint of the cells, combining with the low image quality caused by noise and artifacts. To address these challenges, we apply deformable convolution layers to the classic U-Net structure and implement the deformable U-Net (dU-Net). U-Net architecture has been shown to offer accurate localization for image semantic segmentation. Moreover, deformable convolution enables free-form deformation of the feature learning process, thus making the network more robust to various cell morphologies and image settings. dU-Net is tested on microscopic red blood cell images from patients with sickle cell disease. Results show that dU-Net can achieve highest accuracy for both binary segmentation and multi-class semantic segmentation tasks, comparing with both unsupervised and state-of-the-art deep learning based supervised segmentation methods. Through detailed investigation of the segmentation results, we further conclude that the performance improvement is mainly caused by the deformable convolution layer, which has better ability to separate the touching cells, discriminate the background noise and predict correct cell shapes without any shape priors.

AAAI Conference 2020 Conference Paper

Do Subsampled Newton Methods Work for High-Dimensional Data?

  • Xiang Li
  • Shusen Wang
  • Zhihua Zhang

Subsampled Newton methods approximate Hessian matrices through subsampling techniques to alleviate the per-iteration cost. Previous results require Ω(d) samples to approximate Hessians, where d is the dimension of data points, making it less practical for high-dimensional data. The situation is deteriorated when d is comparably as large as the number of data points n, which requires to take the whole dataset into account, making subsampling not useful. This paper theoretically justifies the effectiveness of subsampled Newton methods on strongly convex empirical risk minimization with high dimensional data. Specifically, we provably require only Θ(dγ eff) samples for approximating the Hessian matrices, where dγ eff is the γ-ridge leverage and can be much smaller than d as long as nγ 1. Our theories work for three types of Newton methods: subsampled Netwon, distributed Newton, and proximal Newton.

NeurIPS Conference 2020 Conference Paper

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

  • Xiang Li
  • Wenhai Wang
  • Lijun Wu
  • Shuo Chen
  • Xiaolin Hu
  • Jun Li
  • Jinhui Tang
  • Jian Yang

One-stage detector basically formulates object detection as dense classification and localization (i. e. , bounding box regression). The classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. A recent trend for one-stage detectors is to introduce an \emph{individual} prediction branch to estimate the quality of localization, where the predicted quality facilitates the classification to improve detection performance. This paper delves into the \emph{representations} of the above three fundamental elements: quality estimation, classification and localization. Two problems are discovered in existing practices, including (1) the inconsistent usage of the quality estimation and classification between training and inference, and (2) the inflexible Dirac delta distribution for localization. To address the problems, we design new representations for these elements. Specifically, we merge the quality estimation into the class prediction vector to form a joint representation, and use a vector to represent arbitrary distribution of box locations. The improved representations eliminate the inconsistency risk and accurately depict the flexible distribution in real data, but contain \emph{continuous} labels, which is beyond the scope of Focal Loss. We then propose Generalized Focal Loss (GFL) that generalizes Focal Loss from its discrete form to the \emph{continuous} version for successful optimization. On COCO {\tt test-dev}, GFL achieves 45. 0\% AP using ResNet-101 backbone, surpassing state-of-the-art SAPD (43. 5\%) and ATSS (43. 6\%) with higher or comparable inference speed.

NeurIPS Conference 2020 Conference Paper

Improving Local Identifiability in Probabilistic Box Embeddings

  • Shib Dasgupta
  • Michael Boratko
  • Dongxu Zhang
  • Luke Vilnis
  • Xiang Li
  • Andrew McCallum

Geometric embeddings have recently received attention for their natural ability to represent transitive asymmetric relations via containment. Box embeddings, where objects are represented by n-dimensional hyperrectangles, are a particularly promising example of such an embedding as they are closed under intersection and their volume can be calculated easily, allowing them to naturally represent calibrated probability distributions. The benefits of geometric embeddings also introduce a problem of local identifiability, however, where whole neighborhoods of parameters result in equivalent loss which impedes learning. Prior work addressed some of these issues by using an approximation to Gaussian convolution over the box parameters, however this intersection operation also increases the sparsity of the gradient. In this work we model the box parameters with min and max Gumbel distributions, which were chosen such that the space is still closed under the operation of intersection. The calculation of the expected intersection volume involves all parameters, and we demonstrate experimentally that this drastically improves the ability of such models to learn.

NeurIPS Conference 2020 Conference Paper

Neuron-level Structured Pruning using Polarization Regularizer

  • Tao Zhuang
  • Zhixuan Zhang
  • Yuheng Huang
  • Xiaoyi Zeng
  • Kai Shuang
  • Xiang Li

Neuron-level structured pruning is a very effective technique to reduce the computation of neural networks without compromising prediction accuracy. In previous works, structured pruning is usually achieved by imposing L1 regularization on the scaling factors of neurons, and pruning the neurons whose scaling factors are below a certain threshold. The reasoning is that neurons with smaller scaling factors have weaker influence on network output. A scaling factor close to 0 actually suppresses a neuron. However, L1 regularization lacks discrimination between neurons because it pushes all scaling factors towards 0. A more reasonable pruning method is to only suppress unimportant neurons (with 0 scaling factors) and simultaneously keep important neurons intact (with larger scaling factor). To achieve this goal, we propose a new regularizer on scaling factors, namely polarization regularizer. Theoretically, we prove that polarization regularizer pushes some scaling factors to 0 and others to a value $a > 0$. Experimentally, we show that structured pruning using polarization regularizer achieves much better results than using L1 regularizer. Experiments on CIFAR and ImageNet datasets show that polarization pruning achieves the state-of-the-art result to date.

AAAI Conference 2020 Conference Paper

Quadruply Stochastic Gradient Method for Large Scale Nonlinear Semi-Supervised Ordinal Regression AUC Optimization

  • Wanli Shi
  • Bin Gu
  • Xiang Li
  • Heng Huang

Semi-supervised ordinal regression (S2 OR) problems are ubiquitous in real-world applications, where only a few ordered instances are labeled and massive instances remain unlabeled. Recent researches have shown that directly optimizing concordance index or AUC can impose a better ranking on the data than optimizing the traditional error rate in ordinal regression (OR) problems. In this paper, we propose an unbiased objective function for S2 OR AUC optimization based on ordinal binary decomposition approach. Besides, to handle the large-scale kernelized learning problems, we propose a scalable algorithm called QS3 ORAO using the doubly stochastic gradients (DSG) framework for functional optimization. Theoretically, we prove that our method can converge to the optimal solution at the rate of O(1/t), where t is the number of iterations for stochastic data sampling. Extensive experimental results on various benchmark and realworld datasets also demonstrate that our method is efficient and effective while retaining similar generalization performance.

AAAI Conference 2020 Conference Paper

Safe Sample Screening for Robust Support Vector Machine

  • Zhou Zhai
  • Bin Gu
  • Xiang Li
  • Heng Huang

Robust support vector machine (RSVM) has been shown to perform remarkably well to improve the generalization performance of support vector machine under the noisy environment. Unfortunately, in order to handle the non-convexity induced by ramp loss in RSVM, existing RSVM solvers often adopt the DC programming framework which is computationally inefficient for running multiple outer loops. This hinders the application of RSVM to large-scale problems. Safe sample screening that allows for the exclusion of training samples prior to or early in the training process is an effective method to greatly reduce computational time. However, existing safe sample screening algorithms are limited to convex optimization problems while RSVM is a non-convex problem. To address this challenge, in this paper, we propose two safe sample screening rules for RSVM based on the framework of concave-convex procedure (CCCP). Specifically, we provide screening rule for the inner solver of CCCP and another rule for propagating screened samples between two successive solvers of CCCP. To the best of our knowledge, this is the first work of safe sample screening to a non-convex optimization problem. More importantly, we provide the security guarantee to our sample screening rules to RSVM. Experimental results on a variety of benchmark datasets verify that our safe sample screening rules can significantly reduce the computational time.

AAAI Conference 2020 Conference Paper

Understanding the Disharmony between Weight Normalization Family and Weight Decay

  • Xiang Li
  • Shuo Chen
  • Jian Yang

The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight W to W, which makes W independent to the magnitude of W. Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under- fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. Moreover, if we substitute (e. g. , weight normalization) W = W ||W || in the original loss function i L(f(xi; W ), yi) + 1 2 λ||W ||2, it is observed that the regularization term 1 2 λ||W ||2 will be canceled as a constant 1 2 λ in the optimization objective. Therefore, to decay W, we need to explicitly append: 1 2 λ||W ||2. In this paper, we theoretically prove that 1 2 λ||W ||2 improves optimization only by modulating the effective learning rate and fairly has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several serious problems when introducing weight decay term to weight normalization family, including the missing of global minimum, training instability and sensitivity of initialization. To address these problems, we propose an Adaptive Weight Shrink (AWS) scheme, which gradually shrinks the weights during optimization by a dynamic coefficient proportional to the magnitude of the parameter. This simple yet effective method appropriately controls the effective learning rate, which significantly improves the training stability and makes optimization more robust to initialization.

NeurIPS Conference 2019 Conference Paper

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning

  • Wenhao Yang
  • Xiang Li
  • Zhihua Zhang

We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term. The extant entropy-regularized MDPs can be cast into our framework. Moreover, under our framework, many regularization terms can bring multi-modality and sparsity, which are potentially useful in reinforcement learning. In particular, we present sufficient and necessary conditions that induce a sparse optimal policy. We also conduct a full mathematical analysis of the proposed regularized MDPs, including the optimality condition, performance error, and sparseness control. We provide a generic method to devise regularization forms and propose off-policy actor critic algorithms in complex environment settings. We empirically analyze the numerical properties of optimal policies and compare the performance of different sparse regularization forms in discrete and continuous environments.

NeurIPS Conference 2019 Conference Paper

Arbicon-Net: Arbitrary Continuous Geometric Transformation Networks for Image Registration

  • Jianchun Chen
  • Lingjing Wang
  • Xiang Li
  • Yi Fang

This paper concerns the undetermined problem of estimating geometric transformation between image pairs. Recent methods introduce deep neural networks to predict the controlling parameters of hand-crafted geometric transformation models (e. g. thin-plate spline) for image registration and matching. However, the low-dimension parametric models are incapable of estimating a highly complex geometric transform with limited flexibility to model the actual geometric deformation from image pairs. To address this issue, we present an end-to-end trainable deep neural networks, named Arbitrary Continuous Geometric Transformation Networks (Arbicon-Net), to directly predict the dense displacement field for pairwise image alignment. Arbicon-Net is generalized from training data to predict the desired arbitrary continuous geometric transformation in a data-driven manner for unseen new pair of images. Particularly, without imposing penalization terms, the predicted displacement vector function is proven to be spatially continuous and smooth. To verify the performance of Arbicon-Net, we conducted semantic alignment tests over both synthetic and real image dataset with various experimental settings. The results demonstrate that Arbicon-Net outperforms the previous image alignment techniques in identifying the image correspondences.

IJCAI Conference 2019 Conference Paper

Dynamic Feature Fusion for Semantic Edge Detection

  • Yuan Hu
  • Yunpeng Chen
  • Xiang Li
  • Jiashi Feng

Features from multiple scales can greatly benefit the semantic edge detection task if they are well fused. However, the prevalent semantic edge detection methods apply a fixed weight fusion strategy where images with different semantics are forced to share the same weights, resulting in universal fusion weights for all images and locations regardless of their different semantics or local context. In this work, we propose a novel dynamic feature fusion strategy that assigns different fusion weights for different input images and locations adaptively. This is achieved by a proposed weight learner to infer proper fusion weights over multi-level features for each location of the feature map, conditioned on the specific input. In this way, the heterogeneity in contributions made by different locations of feature maps and input images can be better considered and thus help produce more accurate and sharper edge predictions. We show that our model with the novel dynamic feature fusion is superior to fixed weight fusion and also the na\"ive location-invariant weight fusion methods, via comprehensive experiments on benchmarks Cityscapes and SBD. In particular, our method outperforms all existing well established methods and achieves new state-of-the-art.

IROS Conference 2019 Conference Paper

Fast Motion Planning via Free C-space Estimation Based on Deep Neural Network

  • Xiang Li
  • Qixin Cao
  • Mingjing Sun
  • Ganggang Yang

This paper presents a novel learning-based method for fast motion planning in high-dimensional spaces. A deep neural network is designed to predict the free configuration space rapidly given the environment point cloud. With a generated roadmap as an approximate view of the free C-space, LazyPRM is applied to find and check the path with A * search. Due to the application of LazyPRM, the presented method can preserve probabilistic completeness and asymptotic optimality. The new algorithm is tested on a 3-DOF robot arm and a 6-DOF UR3 robot to plan in randomly generated obstacle environments. Results indicate that compared to planners including PRM, RRT *, RRT-connect and the original LazyPRM, our method is of the lowest time consumption and relatively short path length, showing good performance on both planning speed and path quality.

AAAI Conference 2019 Conference Paper

Inter-Class Angular Loss for Convolutional Neural Networks

  • Le Hui
  • Xiang Li
  • Chen Gong
  • Meng Fang
  • Joey Tianyi Zhou
  • Jian Yang

Convolutional Neural Networks (CNNs) have shown great power in various classification tasks and have achieved remarkable results in practical applications. However, the distinct learning difficulties in discriminating different pairs of classes are largely ignored by the existing networks. For instance, in CIFAR-10 dataset, distinguishing cats from dogs is usually harder than distinguishing horses from ships. By carefully studying the behavior of CNN models in the training process, we observe that the confusion level of two classes is strongly correlated with their angular separability in the feature space. That is, the larger the inter-class angle is, the lower the confusion will be. Based on this observation, we propose a novel loss function dubbed “Inter-Class Angular Loss” (ICAL), which explicitly models the class correlation and can be directly applied to many existing deep networks. By minimizing the proposed ICAL, the networks can effectively discriminate the examples in similar classes by enlarging the angle between their corresponding class vectors. Thorough experimental results on a series of vision and nonvision datasets confirm that ICAL critically improves the discriminative ability of various representative deep neural networks and generates superior performance to the original networks with conventional softmax loss.

NeurIPS Conference 2019 Conference Paper

Joint Optimization of Tree-based Index and Deep Model for Recommender Systems

  • Han Zhu
  • Daqing Chang
  • Ziru Xu
  • Pengye Zhang
  • Xiang Li
  • Jie He
  • Han Li
  • Jian Xu

Large-scale industrial recommender systems are usually confronted with computational problems due to the enormous corpus size. To retrieve and recommend the most relevant items to users under response time limits, resorting to an efficient index structure is an effective and practical solution. The previous work Tree-based Deep Model (TDM) \cite{zhu2018learning} greatly improves recommendation accuracy using tree index. By indexing items in a tree hierarchy and training a user-node preference prediction model satisfying a max-heap like property in the tree, TDM provides logarithmic computational complexity w. r. t. the corpus size, enabling the use of arbitrary advanced models in candidate retrieval and recommendation. In tree-based recommendation methods, the quality of both the tree index and the user-node preference prediction model determines the recommendation accuracy for the most part. We argue that the learning of tree index and preference model has interdependence. Our purpose, in this paper, is to develop a method to jointly learn the index structure and user preference prediction model. In our proposed joint optimization framework, the learning of index and user preference prediction model are carried out under a unified performance measure. Besides, we come up with a novel hierarchical user preference representation utilizing the tree index hierarchy. Experimental evaluations with two large-scale real-world datasets show that the proposed method improves recommendation accuracy significantly. Online A/B test results at a display advertising platform also demonstrate the effectiveness of the proposed method in production environments.

IJCAI Conference 2019 Conference Paper

Quadruply Stochastic Gradients for Large Scale Nonlinear Semi-Supervised AUC Optimization

  • Wanli Shi
  • Bin Gu
  • Xiang Li
  • Xiang Geng
  • Heng Huang

Semi-supervised learning is pervasive in real-world applications, where only a few labeled data are available and large amounts of instances remain unlabeled. Since AUC is an important model evaluation metric in classification, directly optimizing AUC in semi-supervised learning scenario has drawn much attention in the machine learning community. Recently, it has been shown that one could find an unbiased solution for the semi-supervised AUC maximization problem without knowing the class prior distribution. However, this method is hardly scalable for nonlinear classification problems with kernels. To address this problem, in this paper, we propose a novel scalable quadruply stochastic gradient algorithm (QSG-S2AUC) for nonlinear semi-supervised AUC optimization. In each iteration of the stochastic optimization process, our method randomly samples a positive instance, a negative instance, an unlabeled instance and their random features to compute the gradient and then update the model by using this quadruply stochastic gradient to approach the optimal solution. More importantly, we prove that QSG-S2AUC can converge to the optimal solution in O(1/t), where t is the iteration number. Extensive experimental results on a variety of benchmark datasets show that QSG-S2AUC is far more efficient than the existing state-of-the-art algorithms for semi-supervised AUC maximization, while retaining the similar generalization performance.

IJCAI Conference 2019 Conference Paper

Scalable Semi-Supervised SVM via Triply Stochastic Gradients

  • Xiang Geng
  • Bin Gu
  • Xiang Li
  • Wanli Shi
  • Guansheng Zheng
  • Heng Huang

Semi-supervised learning (SSL) plays an increasingly important role in the big data era because a large number of unlabeled samples can be used effectively to improve the performance of the classifier. Semi-supervised support vector machine (S3VM) is one of the most appealing methods for SSL, but scaling up S3VM for kernel learning is still an open problem. Recently, a doubly stochastic gradient (DSG) algorithm has been proposed to achieve efficient and scalable training for kernel methods. However, the algorithm and theoretical analysis of DSG are developed based on the convexity assumption which makes them incompetent for non-convex problems such as S3VM. To address this problem, in this paper, we propose a triply stochastic gradient algorithm for S3VM, called TSGS3VM. Specifically, to handle two types of data instances involved in S3VM, TSGS3VM samples a labeled instance and an unlabeled instance as well with the random features in each iteration to compute a triply stochastic gradient. We use the approximated gradient to update the solution. More importantly, we establish new theoretic analysis for TSGS3VM which guarantees that TSGS3VM can converge to a stationary point. Extensive experimental results on a variety of datasets demonstrate that TSGS3VM is much more efficient and scalable than existing S3VM algorithms.

AAAI Conference 2019 Conference Paper

Spectral Clustering in Heterogeneous Information Networks

  • Xiang Li
  • Ben Kao
  • Zhaochun Ren
  • Dawei Yin

A heterogeneous information network (HIN) is one whose objects are of different types and links between objects could model different object relations. We study how spectral clustering can be effectively applied to HINs. In particular, we focus on how meta-path relations are used to construct an effective similarity matrix based on which spectral clustering is done. We formulate the similarity matrix construction as an optimization problem and propose the SClump algorithm for solving the problem. We conduct extensive experiments comparing SClump with other state-of-the-art clustering algorithms on HINs. Our results show that SClump outperforms the competitors over a range of datasets w. r. t. different clustering quality measures.

IJCAI Conference 2018 Conference Paper

Adversarial Metric Learning

  • Shuo Chen
  • Chen Gong
  • Jian Yang
  • Xiang Li
  • Yang Wei
  • Jun Li

In the past decades, intensive efforts have been put to design various loss functions and metric forms for metric learning problem. These improvements have shown promising results when the test data is similar to the training data. However, the trained models often fail to produce reliable distances on the ambiguous test pairs due to the different samplings between training set and test set. To address this problem, the Adversarial Metric Learning (AML) is proposed in this paper, which automatically generates adversarial pairs to remedy the sampling bias and facilitate robust metric learning. Specifically, AML consists of two adversarial stages, i. e. confusion and distinguishment. In confusion stage, the ambiguous but critical adversarial data pairs are adaptively generated to mislead the learned metric. In distinguishment stage, a metric is exhaustively learned to try its best to distinguish both adversarial pairs and original training pairs. Thanks to the challenges posed by the confusion stage in such competing process, the AML model is able to grasp plentiful difficult knowledge that has not been contained by the original training pairs, so the discriminability of AML can be significantly improved. The entire model is formulated into optimization framework, of which the global convergence is theoretically proved. The experimental results on toy data and practical datasets clearly demonstrate the superiority of AML to representative state-of-the-art metric learning models.

IJCAI Conference 2018 Conference Paper

Faster Training Algorithms for Structured Sparsity-Inducing Norm

  • Bin Gu
  • Xingwang Ju
  • Xiang Li
  • Guansheng Zheng

Structured-sparsity regularization is popular for sparse learning because of its flexibility of encoding the feature structures. This paper considers a generalized version of structured-sparsity regularization (especially for $l_1/l_{\infty}$ norm) with arbitrary group overlap. Due to the group overlap, it is time-consuming to solve the associated proximal operator. Although Mairal~\shortcite{mairal2010network} have proposed a network-flow algorithm to solve the proximal operator, it is still time-consuming especially in the high-dimensional setting. To address this challenge, in this paper, we have developed a more efficient solution for $l_1/l_{\infty}$ group lasso with arbitrary group overlap using an Inexact Proximal-Gradient method. In each iteration, our algorithm only requires to calculate an inexact solution to the proximal sub-problem, which can be done efficiently. On the theoretic side, the proposed algorithm enjoys the same global convergence rate as the exact proximal methods. Experiments demonstrate that our algorithm is much more efficient than network-flow algorithm, while retaining the similar generalization performance.

JBHI Journal 2018 Journal Article

Frequency Network Analysis of Heart Rate Variability for Obstructive Apnea Patient Detection

  • Zhao Dong
  • Xiang Li
  • Wei Chen

Obstructive sleep apnea (OSA) is a popular sleep disorder. Traditional OSA diagnosis methods are cumbersome and expensive, which bring inconvenience for patient diagnosis and heavy workload for physician. Automatically identifying OSA patients from electrocardiogram (ECG) records is important for clinical diagnosis and treatment. In this paper, a new method based on the frequency and network domains is proposed to automatically recognize OSA patients with nocturnal ECG records. First, each RR-interval (beat to beat heart rate) series was divided into segments. By calculating the power spectral density (PSD) of heart rate variability segment with Lomb-Scargle method, the dynamic time warping (DTW) distance was used to evaluate the similarity (dissimilarity) of the lower frequency in the PSD series, then the DTW distance matrix was transformed to a binary matrix, and then network metrics were calculated to discriminate OSA patients with healthy subjects. The new method was tested with data of 389 subjects collected from two public databases that consist of normal subjects without OSA (apnea-hypopnea index, AHI≤5) and OSA patients (AHI>5). Results show that a single network metric (local clustering coefficient) can recognize OSA patients with 90. 1% accuracy, 88. 29% sensitivity, and 90. 5% specificity, and confirm the potential of using the ECG records for OSA patients recognition.

IJCAI Conference 2018 Conference Paper

Mixed Link Networks

  • Wenhai Wang
  • Xiang Li
  • Tong Lu
  • Jian Yang

On the basis of the analysis by revealing the equivalence of modern networks, we find that both ResNet and DenseNet are essentially derived from the same "dense topology", yet they only differ in the form of connection: addition (dubbed "inner link") vs. concatenation (dubbed "outer link"). However, both forms of connections have the superiority and insufficiency. To combine their advantages and avoid certain limitations on representation learning, we present a highly efficient and modularized Mixed Link Network (MixNet) which is equipped with flexible inner link and outer link modules. Consequently, ResNet, DenseNet and Dual Path Network (DPN) can be regarded as a special case of MixNet, respectively. Furthermore, we demonstrate that MixNets can achieve superior efficiency in parameter over the state-of-the-art architectures on many competitive datasets like CIFAR-10/100, SVHN and ImageNet.

IJCAI Conference 2018 Conference Paper

Pairwise-Ranking based Collaborative Recurrent Neural Networks for Clinical Event Prediction

  • Zhi Qiao
  • Shiwan Zhao
  • Cao Xiao
  • Xiang Li
  • Yong Qin
  • Fei Wang

Patient Electronic Health Records (EHR) data consist of sequences of patient visits over time. Sequential prediction of patients' future clinical events (e. g. , diagnoses) from their historical EHR data is a core research task and motives a series of predictive models including deep learning. The existing research mainly adopts a classification framework, which treats the observed and unobserved events as positive and negative classes. However, this may not be true in real clinical setting considering the high rate of missed diagnoses and human errors. In this paper, we propose to formulate the clinical event prediction problem as an events recommendation problem. An end-to-end pairwise-ranking based collaborative recurrent neural networks (PacRNN) is proposed to solve it, which firstly embeds patient clinical contexts with attention RNN, then uses Bayesian Personalized Ranking (BPR) regularized by disease co-occurrence to rank probabilities of patient-specific diseases, as well as use point process to provide simultaneous prediction of the occurring time of these diagnoses. Experimental results on two real world EHR datasets demonstrate the robust performance, interpretability, and efficacy of PacRNN.

NeurIPS Conference 2018 Conference Paper

Pelee: A Real-Time Object Detection System on Mobile Devices

  • Robert Wang
  • Xiang Li
  • Charles Ling

An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and MobileNetV2. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy and 1. 8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system, named Pelee, achieves 76. 4% mAP (mean average precision) on PASCAL VOC2007 and 22. 4 mAP on MS COCO dataset at the speed of 23. 6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13. 6 times lower computational cost and 11. 3 times smaller model size. The code and models are open sourced.

AAAI Conference 2017 Conference Paper

Fast Generalized Distillation for Semi-Supervised Domain Adaptation

  • Shuang Ao
  • Xiang Li
  • Charles Ling

Semi-supervised domain adaptation (SDA) is a typical setting when we face the problem of domain adaptation in real applications. How to effectively utilize the unlabeled data is an important issue in SDA. Previous work requires access to the source data to measure the data distribution mismatch, which is ineffective, when the size of the source data is relatively large. In this paper, we propose a new paradigm, called Generalized Distillation Semi-supervised Domain Adaptation (GDSDA). We show that without accessing the source data, GDSDA can effectively utilize the unlabeled data to transfer the knowledge from the source models. Then we propose GDSDA-SVM which uses SVM as the base classifier and can efficiently solve the SDA problem. Experimental results show that GDSDA-SVM can effectively utilize the unlabeled data to transfer the knowledge between different domains under the SDA setting.

AAAI Conference 2017 Conference Paper

TaGiTeD: Predictive Task Guided Tensor Decomposition for Representation Learning from Electronic Health Records

  • Kai Yang
  • Xiang Li
  • Haifeng Liu
  • Jing Mei
  • Guotong Xie
  • Junfeng Zhao
  • Bing Xie
  • Fei Wang

With the better availability of healthcare data, such as Electronic Health Records (EHR), more and more data analytics methodologies are developed aiming at digging insights from them to improve the quality of care delivery. There are many challenges on analyzing EHR, such as high dimensionality and event sparsity. Moreover, different from other application domains, the EHR analysis algorithms need to be highly interpretable to make them clinically useful. This makes representation learning from EHRs of key importance. In this paper, we propose an algorithm called Predictive Task Guided Tensor Decomposition (TaGiTeD), to analyze EHRs. Specifically, TaGiTeD learns event interaction patterns that are highly predictive for certain tasks from EHRs with supervised tensor decomposition. Compared with unsupervised methods, TaGiTeD can learn effective EHR representations in a more focused way. This is crucial because most of the medical problems have very limited patient samples, which are not enough for unsupervised algorithms to learn meaningful representations form. We apply TaGiTeD on real world EHR data warehouse and demonstrate that TaGiTeD can learn representations that are both interpretable and predictive.

NeurIPS Conference 2016 Conference Paper

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

  • Xiang Li
  • Tao Qin
  • Jian Yang
  • Tie-Yan Liu

Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e. g. , possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we propose a novel technique to tackle this challenge. The key idea is to use 2-Component (2C) shared embedding for word representations. We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. Based on the 2-Component shared embedding, we design a new RNN algorithm and evaluate it using the language modeling task on several benchmark datasets. The results show that our algorithm significantly reduces the model size and speeds up the training process, without sacrifice of accuracy (it achieves similar, if not better, perplexity as compared to state-of-the-art language models). Remarkably, on the One-Billion-Word benchmark Dataset, our algorithm achieves comparable perplexity to previous language models, whilst reducing the model size by a factor of 40-100, and speeding up the training process by a factor of 2. We name our proposed algorithm \emph{LightRNN} to reflect its very small model size and very high training speed.

IJCAI Conference 2016 Conference Paper

StalemateBreaker: A Proactive Content-Introducing Approach to Automatic Human-Computer Conversation

  • Xiang Li
  • Lili Mou
  • Rui Yan
  • Ming Zhang

Existing open-domain human-computer conversation systems are typically passive: they either synthesize or retrieve a reply provided with a human-issued utterance. It is generally presumed that humans should take the role to lead the conversation and introduce new content when a stalemate occurs, and that computers only need to "respond. " In this paper, we propose STALEMATEBREAKER, a conversation system that can proactively introduce new content when appropriate. We design a pipeline to determine when, what, and how to introduce new content during human-computer conversation. We further propose a novel reranking algorithm Bi-PageRank-HITS to enable rich interaction between conversation context and candidate replies. Experiments show that both the content-introducing approach and the reranking algorithm are effective. Our full STALEMATEBREAKER model outperforms a state-of-the-practice conversation system by +14. 4% p@1 when a stalemate occurs.

IJCAI Conference 2015 Conference Paper

Data Sparseness in Linear SVM

  • Xiang Li
  • Huaimin Wang
  • Bin Gu
  • Charles X. Ling

Large sparse datasets are common in many realworld applications. Linear SVM has been shown to be very efficient for classifying such datasets. However, it is still unknown how data sparseness would affect its convergence behavior. To study this problem in a systematic manner, we propose a novel approach to generate large and sparse data from real-world datasets, using statistical inference and the data sampling process in the PAC framework. We first study the convergence behavior of linear SVM experimentally, and make several observations, useful for real-world applications. We then offer theoretical proofs for our observations by studying the Bayes risk and PAC bound. Our experiment and theoretic results are valuable for learning large sparse datasets with linear SVM.

AAMAS Conference 2013 Conference Paper

Learning Visual Object Models on A Robot Using Context and Appearance Cues

  • Xiang Li
  • Mohan Sridharan
  • Catie Meador

Visual object recognition is a key challenge to the deployment of robots in domains characterized by partial observability and unforeseen changes. Sophisticated algorithms developed for modeling and recognizing objects using different visual cues [3, 4] are computationally expensive, sensitive to changes in object configurations and environmental factors, and require many training samples and accurate domain knowledge to learn object models, making it difficult for robots to reliably and efficiently model and recognize objects. These challenges are partially offset by the fact that many objects possess unique characteristics (e. g. , color and shape) and motion patterns, although these characteristics and patterns are not known in advance and may change over time. Furthermore, only a subset of domain objects are relevant to any given task and a variety of cues can be extracted from images to represent objects. This paper presents an algorithm that enables robots to identify a set of interesting objects, using appearance-based and contextual cues extracted from a small number of images to efficiently learn models of these objects. Robots learn the domain map and consider objects that move to be interesting, using motion cues to identify the corresponding image regions. Object models learned automatically from these regions consist of spatial arrangement of gradient features, graph-based models of neighborhoods of gradient features, parts-based models of image segments, color distributions, and mixture models of local context. The learned models are used for object recognition in novel scenes based on energy minimization and a generative model for information fusion. All algorithms are evaluated on wheeled robots in indoor and outdoor domains.

IS Journal 2008 Journal Article

IT Strategies for Increased Rail Employee Satisfaction

  • P. Jackson
  • Yanbin Chen
  • R. Farhangi
  • Xiang Li
  • D. Mansion
  • E. Markel
  • R. Morris
  • L. Podgurny

CN (Canadian National Railway) is the largest railway in Canada and a leader in the North American rail industry. In the bids and bulletins evaluation process, 7, 500 CN employees submit their job preferences as bids and are assigned jobs in a manner to ensure that all positions are filled. Each week, CN posts bulletins describing the available jobs and their requirements. On the basis of the bulletins, employees submit a bid card that identifies and ranks the jobs in which they're most interested. The bidding period runs for seven days. Forty-eight hours after the bidding period closes, job assignments are posted and become effective. A legacy software system, coded in Cobol, assigns the jobs, but regional schedulers must manage exceptions and infeasibilities. Typically, to achieve a publishable schedule, the software must be run several times, with manual overrides. Because the legacy system is slow, schedulers sometimes resort to personalized spreadsheets to assist decision making. CN is pursuing a phased implementation of SAP enterprise systems. As the company phases out legacy software, it's looking for new ways to leverage information technology to improve operations and increase employee satisfaction.

ICRA Conference 2008 Conference Paper

Nonlinear predictive control of an omnidirectional robot dribbling a rolling ball

  • Xiang Li
  • Andreas Zell

This paper focuses on the dribbling control problem of an omnidirectional mobile robot and a rolling ball in the RoboCup Middle Size domain. Because the ball easily slides away from the robot when the ball moves along a curve, dribbling control is more challenging than the normal mobile robot motion control problem. Based on an introduced reference point with respect to the robot body and a sophisticated planning method of robot pose, the nonlinear predictive control is used to steer the robot to follow the planned poses so as to prevent the ball from leaving the robot. Real-world experiments showed that nonlinear predictive control is capable of solving the pose following problem in a real-time application.

ICRA Conference 2007 Conference Paper

Dribbling Control of Omnidirectional Soccer Robots

  • Xiang Li
  • Maosen Wang
  • Andreas Zell

This paper focuses on the dribbling control problem of an omnidirectional mobile robot. Because the movement of the dribbled object must be considered, dribbling control is more challenging than normal mobile robot motion control. A new feedback control algorithm, which steers a reference point to follow the desired movement and keeps the ball near to this point simultaneously, is proposed. To dribble a rolling ball along a given path, the robot should provide the ball with appropriate force by consecutive pushing operations when they travel in an environment with obstacles. Based on the analysis of the forces acting on the ball with respect to the mobile robot coordinate system, a constraint for robot movement in the dribbling process is also introduced. The simulation and real-world experiments address the performance of this control algorithm.