Arrow Research search

Author name cluster

Yang Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

169 papers
2 author rows

Possible papers

169

AAAI Conference 2026 Conference Paper

Deeply Seeking Boundary for Lunar Regolith Segmentation

  • Yifeng Wang
  • Lingxin Wang
  • Lu Zhang
  • Yang Li
  • Chao Xu
  • Weiwei Zhang
  • Junyue Tang
  • Yanhong Zheng

The sharp, intricate contours of lunar regolith particles hold critical clues to the Moon's geological evolution and inform engineering applications from habitat construction to spacecraft design, making their precise segmentation a task of significant scientific and engineering value. However, this task exposes a weakness in deep learning models known as spectral bias, an inherent tendency to learn smooth, low-frequency functions which causes them to systematically erase the very high-frequency boundary details that are of primary interest. To resolve this conflict, we propose a framework to deeply seek object boundaries. First, we propose High-Frequency Initialized LoRA (HiFi-LoRA) to counteract spectral bias. By initializing the LoRA adaptation matrices as the optimal low-rank approximation of a high-pass filter, it fundamentally enhances the model's high-frequency perception and injects a strong preference for edges. Second, we propose the Wavelet Energy Modulation (WEM) regularizer. It guides the model to learn the intrinsic correlation between contour complexity and mask area, forcing the model to build a geometric understanding of contour morphology upon its high-frequency perception, thereby enabling the generation of boundary details commensurate with the object's scale. Experimentally, we constructed the Lunar Regolith Segmentation Dataset (LRSD), the first large-scale benchmark with expert-annotated contours. Extensive experiments demonstrate that our method sets a new state of the art on this challenging benchmark, not only achieving top performance on regional metrics like mIoU and DSC but, more critically, drastically outperforming existing models on boundary accuracy. This work not only provides a powerful computational tool for lunar science but also offers a robust and synergistic design pattern for other fine-grained segmentation challenges.

AAAI Conference 2026 Conference Paper

DHCM-CACL: Dynamic Hierarchical Cross-modal Mamba with Confidence-Adaptive Contrastive Learning for Multimodal Emotion Recognition

  • Baiqiang Wu
  • Yang Li

Multimodal emotion recognition plays a crucial role in enhancing the intelligence of human-computer interaction and emotional understanding. However, conventional approaches face challenges such as scarcity of annotated data, significant modality heterogeneity, and temporal misalignment. To address these issues, we propose DHCM-CACL, a novel self-supervised emotion recognition framework integrating EEG and facial expressions. During the pre-training phase, we propose a Dynamic Hierarchical Cross-modal Mamba module (DHCM), which models long-term dependencies through dynamic state matrices, incorporates forgetting gates for noise suppression, and constructs a hierarchical cross-modal interaction structure, effectively achieving cross-modal temporal alignment and mitigating modality heterogeneity. Subsequently, we propose a Confidence-Adaptive Contrastive Learning module (CACL) that dynamically adjusts sample weights using gated confidence signals derived from DHCM to compute loss, prioritizing reliable samples while suppressing noisy instances through adaptive weighting, thereby enhancing representation reliability and generalization in data-scarce scenarios. During the fine-tuning phase, we integrate a cross-modal attention gating mechanism to reinforce temporal associations and adopt an evidence-aware joint optimization objective, providing probabilistic credibility outputs for emotion prediction. Experimental results on the DEAP and MAHNOB-HCI datasets demonstrate that our approach achieves state-of-the-art performance in emotion classification under both subject-dependent and subject-independent settings.

JBHI Journal 2026 Journal Article

DTQFL: A Digital Twin-Assisted Quantum Federated Learning Algorithm for Intelligent Diagnosis in 5G Mobile Network

  • Zhiguo Qu
  • Yang Li
  • Bo Liu
  • Deepak Gupta
  • Prayag Tiwari

Smart healthcare aims to revolutionize medical services by integrating artificial intelligence (AI). The limitations of classical machine learning include privacy concerns that prevent direct data sharing among medical institutions, untimely updates, and long training times. To address these issues, this study proposes a digital twin-assisted quantum federated learning algorithm (DTQFL). By leveraging the 5G mobile network, digital twins (DT) of patients can be created instantly using data from various Internet of Medical Things (IoMT) devices and simultaneously reduce communication time in federated learning (FL) at the same time. DTQFL generates DT for patients with specific diseases, allowing for synchronous training and updating of the variational quantum neural network (VQNN) without disrupting the VQNN in the real world. This study utilized DTQFL to train its own personalized VQNN for each hospital, considering privacy security and training speed. Simultaneously, the personalized VQNN of each hospital was obtained through further local iterations of the final global parameters. The results indicate that DTQFL can train a good VQNN without collecting local data while achieving accuracy comparable to that of data-centralized algorithms. In addition, after personalized training, the VQNN can achieve higher accuracy than that without personalized training.

AAAI Conference 2026 Conference Paper

From Imitation to Discrimination: Toward a Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

  • Changpeng Yang
  • Jinyang Wu
  • Yuchen Liu
  • Shuai Zhang
  • Yang Li
  • Qiliang Liang
  • Hongzhen Wang
  • Shuai Nie

Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, existing approaches often mix them indiscriminately, especially in the early stages, leading to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization paradigm.

AAAI Conference 2026 Conference Paper

FVNet: Harnessing Liquid Neural Dynamics for Lightweight Visual Representation

  • Zhenzhe Hou
  • Xiaohui Chu
  • Runze Hu
  • Yang Li
  • Yutao Liu

Efficient visual backbone design remains crucial for resource-constrained computer vision applications. Inspired by the adaptive continuous-time dynamics observed in biological neurons, we propose FVNet, a novel lightweight architecture that integrates liquid neural dynamics for efficient and dynamic visual feature extraction. Central to FVNet is the Fluid Temporal Flow Unit (FTFU), which employs continuous-time equations with learnable time constants to capture spatio-temporal dependencies adaptively. By further stacking these units in a Multi-Phase Fluid Block (MPFB), our model processes features across parallel temporal scales, enabling context-aware feature encoding without incurring excessive computational overhead. Through a discrete closed-form solution, FVNet achieves the representational power of continuous-time models while avoiding the instability and overhead of iterative numerical solvers. Extensive experiments on various vision tasks demonstrate that FVNet achieves superior performance and efficiency over existing state-of-the-art lightweight networks.

AAAI Conference 2026 Conference Paper

GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting

  • Wenjie Liu
  • Zhongliang Liu
  • Junwei Shu
  • Changbo Wang
  • Yang Li

Transferring 2D textures onto complex 3D scenes plays a vital role in enhancing the efficiency and controllability of 3D multimedia content creation. However, existing 3D style transfer methods primarily focus on transferring abstract artistic styles to 3D scenes. These methods often overlook the geometric information of the scene, which makes it challenging to achieve high-quality 3D texture transfer results. In this paper, we present GT2-GS, a geometry-aware texture transfer framework for gaussian splatting. First, we propose a geometry-aware texture transfer loss that enables view-consistent texture transfer by leveraging prior view-dependent feature information and texture features augmented with additional geometric parameters. Moreover, an adaptive fine-grained control module is proposed to address the degradation of scene information caused by low-granularity texture features. Finally, a geometry preservation branch is introduced. This branch refines the geometric parameters using additionally bound Gaussian color priors, thereby decoupling the optimization objectives of appearance and geometry. Extensive experiments demonstrate the effectiveness and controllability of our method. Through geometric awareness, our approach achieves texture transfer results that better align with human visual perception.

JBHI Journal 2026 Journal Article

HSGO: Harmonized Swarm Learning With Guided Optimization for Multi-Center sMRI Classification of Alzheimer's Disease

  • Fangtao Song
  • Yang Li
  • Mingfeng Jiang
  • Kaicheng Li
  • Jucheng Zhang
  • Yinlong Zhang
  • Zhibo Pang

Developing robust Alzheimer's Disease (AD) classification models necessitates extensive training data, but aggregating multi-center medical data poses privacy risks. Although Federated Learning (FL) and Swarm Learning (SL) allow training generic models without data sharing, their performance is limited by variations in AD pathology features and sample class imbalances across centers. To address this issue, we propose a novel Harmonized Swarm Learning framework with Guided Optimization (HSGO) to enhance multi-center collaboration while preserving data privacy. Our framework employs a class-balanced loss function to train a robust generic model and guides the optimization of personalized models towards the generic model, eliminating extra AD pathology feature extraction steps. Furthermore, we design a dynamic feature similarity storage mechanism to facilitate personalized training. Experiments performed under two different multi-center data partitioning scenarios demonstrate that HSGO achieves competitive performance when compared with five baseline methods. Additionally, Layer-wise Relevance Propagation (LRP) analysis indicates that HSGO may help identify potential key brain regions in AD by integrating local and global features compared to traditional SL.

AAAI Conference 2026 Conference Paper

HyperDiag: Temporal–Regional Hypergraph Learning via Topology-Enhanced State Propagation for Brain Disease Diagnosis

  • Yulan Ma
  • Fangkun Li
  • Wenchao Yang
  • Qian Si
  • Chenglong Yu
  • Yang Li

Dynamic brain networks provide a powerful representation for capturing temporal variations in functional brain connectivity and have gained increasing attention in brain disease diagnosis. However, most existing methods extract features from isolated time windows, making it difficult to capture the high-order dynamic evolution of brain activity. Moreover, these methods often neglect the functional heterogeneity among brain regions, thereby limiting diagnostic performance. To address these limitations, we propose HyperDiag, a novel temporal-regional Hypergraph learning via topology-enhanced state propagation for brain disease Diagnosis. Specifically, we first design a dual-level hypergraph learning strategy: a temporally-evolving hypergraph message passing strategy to capture dynamic high-order dependencies within and across time windows, and meanwhile, a region-wise functional hypergraph learning strategy to capture regional dependencies. Subsequently, we construct a topology-enhanced selective state-space propagation network to integrate complementary information from both the temporally-evolving and region-wise features. Extensive experiments on four brain disorder datasets (ABIDE-I, ADNI, REST-meta-MDD, and Epilepsy) demonstrate that HyperDiag not only outperforms state-of-the-art methods but also identifies biologically meaningful abnormal connections, offering potential biomarkers for clinical interpretation.

AAAI Conference 2026 Conference Paper

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

  • Zhenyu Bi
  • Gaurav Srivastava
  • Yang Li
  • Swastik Roy
  • Meng Lu
  • Morteza Ziyadi
  • Xuan Wang

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

AAAI Conference 2026 Conference Paper

Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

  • Enming Zhang
  • Liwen Cao
  • Yanru Wu
  • Zhao Zijie
  • Yang Li

Prompt tuning has emerged as a lightweight strategy for adapting foundation models to downstream tasks, particularly for resource-constrained systems. As pre-trained prompts become valuable assets, combining multiple source prompts offers a promising approach to enhance generalization for new tasks by leveraging complementary knowledge. However, naive aggregation often overlooks different source prompts have different contribution potential to the target task. To address this, we propose HGPrompt, a dynamic framework that learns optimal ensemble weights. These weights are optimized by jointly maximizing an information-theoretic metric for transferability and minimizing gradient conflicts via a novel regularization strategy. Specifically, we propose a differentiable prompt transferability metric to captures the discriminability of prompt-induced features on the target task. Meanwhile, HGPrompt match the gradient variances with respect to different source prompts based on Hessian and Fisher Information, ensuring stable and coherent knowledge transfer while suppressing gradient conflicts among them. Extensive experiments on the large-scale VTAB benchmark demonstrate the state-of-the-art performance of HGPrompt, validating its effectiveness in learning an optimal ensemble for effective multi-source prompt transfer.

JBHI Journal 2026 Journal Article

PAM-CDR: Property-Aware Multi-Modal Drug Representation Learning for Accurate Cancer Drug Response Prediction

  • Yang Li
  • Chang Liu
  • Haijie Cui
  • Jianli Ma

Accurate prediction of cancer drug response is essential for advancing precision oncology, enabling tailored therapies that account for the molecular heterogeneity of tumors. While deep learning has shown promise in this domain, many existing approaches fail to incorporate physicochemical properties of drug compounds, limiting the biological interpretability and generalizability of learned representations. To address this gap, we present PAM-CDR, a property-aware multi-modal representation learning framework that integrates molecular graphs, fingerprints, and physicochemical descriptors with transcriptomic and genomic profiles of cancer cell lines. PAM-CDR employs a three-stage hierarchical fusion strategy to enable fine-grained representation learning across drug and cell modalities. In the first stage, property-guided attention injects biologically meaningful context to enrich molecular graph and fingerprint features. In the second stage, bidirectional cross-modality interactions capture complementary patterns and enhance multi-omic cellular representations. In the final stage, unified drug and cell line embeddings are integrated to accurately predict drug responses. Benefiting from these designs, PAM-CDR consistently outperforms competitive baselines, achieving an AUC of 0. 9161 and an AUPR of 0. 9313. Ablation studies confirm the critical contribution of physicochemical priors, while embedding visualizations reveal improved biological coherence in the learned molecular representations. The code is publicly available at https://github.com/catly/PAM-CDR.

AAAI Conference 2026 Conference Paper

PrAda-GAN: A Private Adaptive Generative Adversarial Network with Bayes Network Structure

  • Ke Jia
  • Yuheng Ma
  • Yang Li
  • Feifei Wang

We revisit the problem of generating synthetic data under differential privacy. To address the core limitations of marginal-based methods, we propose the Private Adaptive Generative Adversarial Network with Bayes Network Structure (PrAda-GAN), which integrates the strengths of both GAN-based and marginal-based approaches. Our method adopts a sequential generator architecture to capture complex dependencies among variables, while adaptively regularizing the learned structure to promote sparsity in the underlying Bayes network. Theoretically, we establish diminishing bounds on the parameter distance, variable selection error, and Wasserstein distance. Our analysis shows that leveraging dependency sparsity leads to significant improvements in convergence rates. Empirically, experiments on both synthetic and real-world datasets demonstrate that PrAda-GAN outperforms existing tabular data synthesis methods in terms of the privacy–utility trade-off.

AAAI Conference 2026 Conference Paper

Simulating Distribution Dynamics: Liquid Temporal Feature Evolution for Single-Domain Generalized Object Detection

  • Zihao Zhang
  • Yang Li
  • Aming WU
  • Yahong Han

In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains. Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model's ability to perceive fine-grained cross-domain differences. To this end, we propose a new method, i.e., Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network–driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains. By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts. Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method.

AAAI Conference 2026 Conference Paper

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

  • Jiankang Wang
  • Zhihan Zhang
  • Zhihang Liu
  • Yang Li
  • Jiannan Ge
  • Hongtao Xie
  • Yongdong Zhang

Multimodal Large Language Models (MLLMs) have shown remarkable progress in temporal or spatial localization tasks, but struggle with joint spatio-temporal video grounding (STVG). We identify two key bottlenecks hindering this capability: (1) the sheer number of visual tokens makes long-range and fine-grained visual modeling challenging; (2) generating a long sequence of bounding boxes in text makes it hard to accurately align each box with its specific video frame. Distinct from prior efforts that rely on attaching complex modules, we argue for a more elegant paradigm that unlocks the inherent potential of MLLMs and leverages their strengths. To this end, we propose \textbf{\textit{SpaceVLLM}}, a MLLM equipped with spatio-temporal video grounding capabilities. Specifically, we propose Spatio-Temporal Aware Queries, interleaved with video frames, to guide the MLLM in capturing both static appearance and dynamic motion features. We further present a lightweight Query-Guided Space Head that maps queries to precise spatial coordinates, bypassing the need for direct textual coordinate generation and enabling the MLLM to focus on video understanding. To further facilitate research in this area, we propose an automated data synthesis pipeline to construct \textbf{V-STG} dataset, comprising 110K STVG instances. Extensive experiments show that \textit{SpaceVLLM} achieves the state-of-the-art performance on STVG benchmarks and maintains strong performance on various video understanding tasks, validating our approach's effectiveness.

AAAI Conference 2026 Conference Paper

Splats in Splats: Robust and Effective 3D Steganography Towards Gaussian Splatting

  • Yijia Guo
  • Wenkai Huang
  • Yang Li
  • Gaolei Li
  • Hang Zhang
  • Liwen Hu
  • Jianhua Li
  • Tiejun Huang

3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe splats in splats, the first 3DGS steganography framework that embeds 3D content in 3DGS itself without modifying any attributes. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that our method significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3x faster rendering speed, while ensuring security, robustness, and user experience.

AAAI Conference 2026 Conference Paper

Target-Balanced Score Distillation

  • Zhou Xu
  • Qi Wang
  • Yuxiao Yang
  • Luyuan Zhang
  • Zhang Liang
  • Yang Li

Score Distillation Sampling (SDS) enables 3D asset generation by distilling priors from pretrained 2D text-to-image diffusion models, but vanilla SDS suffers from over-saturation and over-smoothing. To mitigate this issue, recent variants have incorporated negative prompts. However, these methods face a critical trade-off: limited texture optimization, or significant texture gains with shape distortion. In this work, we first conduct a systematic analysis and reveal that this trade-off is fundamentally governed by the utilization of the negative prompts, where Target Negative Prompts (TNP) that embed target information in the negative prompts dramatically enhancing texture realism and fidelity but inducing shape distortions. Informed by this key insight, we introduce the Target-Balanced Score Distillation (TBSD). It formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the aforementioned trade-off. Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.

AAAI Conference 2026 Conference Paper

Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

  • Jun Xu
  • Xinkai Du
  • Yu Ao
  • Peilong Zhao
  • Yang Li
  • Ling Zhong
  • Lin Yuan
  • Zhongpu Bo

Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM's intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes.

AAAI Conference 2026 Conference Paper

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

  • Haotian Jin
  • Yang Li
  • Haihui Fan
  • Lin Shen
  • Xiangfang Li
  • Bo Li

Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to accurately identify their specific forms. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model’s performance on downstream tasks.

NeurIPS Conference 2025 Conference Paper

A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning

  • Qingyue Zhang
  • Haohao Fu
  • Guanbo Huang
  • Yaoyuan Liang
  • Chang Chu
  • Tianren Peng
  • Yanru Wu
  • Qi Li

Multi-source transfer learning provides an effective solution to data scarcity in real-world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi-source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code is available at https: //github. com/zqy0126/OTQMS.

ICML Conference 2025 Conference Paper

Active Evaluation Acquisition for Efficient LLM Benchmarking

  • Yang Li
  • Jie Ma 0005
  • Miguel Ballesteros
  • Yassine Benajiba
  • Graham Horwood

As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.

AAMAS Conference 2025 Conference Paper

Bottom-Up Reputation Promotes Cooperation with Multi-Agent Reinforcement Learning

  • Tianyu Ren
  • Xuan Yao
  • Yang Li
  • Xiao-Jun Zeng

Reputation serves as a powerful mechanism for promoting cooperation in multi-agent systems, as agents are more inclined to cooperate with those of good social standing. While existing multi-agent reinforcement learning methods typically rely on predefined social norms to assign reputations, the question of how a population reaches a consensus on judgement when agents hold private, independent views remains unresolved. In this paper, we propose a novel bottom-up reputation learning method, Learning with Reputation Reward (LR2), designed to promote cooperative behaviour through rewards shaping based on assigned reputation. Our agent architecture includes a dilemma policy that determines cooperation by considering the impact on neighbours, and an evaluation policy that assigns reputations to affect the actions of neighbours while optimizing self-objectives. It operates using local observations and interaction-based rewards, without relying on centralized modules or predefined norms. Our findings demonstrate the effectiveness and adaptability of LR2 across various spatial social dilemma scenarios. Interestingly, we find that LR2 stabilizes and enhances cooperation not only with reward reshaping from bottom-up reputation but also by fostering strategy clustering in structured populations, thereby creating environments conducive to sustained cooperation.

NeurIPS Conference 2025 Conference Paper

Bridging Crypto with ML-based Solvers: the SAT Formulation and Benchmarks

  • Xinhao Zheng
  • Xinhao Song
  • Bolin Qiu
  • Yang Li
  • Zhongteng Gui
  • Junchi Yan

The Boolean Satisfiability Problem (SAT) plays a crucial role in cryptanalysis, enabling tasks like key recovery and distinguisher construction. Conflict-Driven Clause Learning (CDCL) has emerged as the dominant paradigm in modern SAT solving, and machine learning has been increasingly integrated with CDCL-based SAT solvers to tackle complex cryptographic problems. However, the lack of a unified evaluation framework, inconsistent input formats, and varying modeling approaches hinder fair comparison. Besides, cryptographic SAT instances also differ structurally from standard SAT problems, and the absence of standardized datasets further complicates evaluation. To address these issues, we introduce SAT4CryptoBench, the first comprehensive benchmark for assessing machine learning–based solvers in cryptanalysis. SAT4CryptoBench provides diverse SAT datasets in both Arithmetic Normal Form (ANF) and Conjunctive Normal Form (CNF), spanning various algorithms, rounds, and key sizes. Our framework evaluates three levels of machine learning integration: standalone distinguishers for instance classification, heuristic enhancement for guiding solving strategies, and hyperparameter optimization for adapting to specific problem distributions. Experiments demonstrate that ANF-based networks consistently achieve superior performance over CNF-based networks in learning cryptographic features. Nonetheless, current ML techniques struggle to generalize across algorithms and instance sizes, with computational overhead potentially offsetting benefits on simpler cases. Despite this, ML-driven optimization strategies notably improve solver efficiency on cryptographic SAT instances. Besides, we propose BASIN, a bitwise solver taking plaintext-ciphertext bitstrings as inputs. Crucially, its superior performance on high-round problems highlights the importance of input modeling and the advantage of direct input representations for complex cryptographic structures.

NeurIPS Conference 2025 Conference Paper

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

  • Wenbo Zhang
  • Tianrun Hu
  • Hanbo Zhang
  • Yanyuan Qiao
  • Yuchu Qin
  • Yang Li
  • Jiajun Liu
  • Tao Kong

We present Chain-of-Action (CoA), a novel visuomotor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuomotor policy. Empirically, we observe that CoA outperforms representative imitation learning algorithms such as ACT and Diffusion Policy across 60 RLBench tasks and 8 real-world tasks.

ICML Conference 2025 Conference Paper

COExpander: Adaptive Solution Expansion for Combinatorial Optimization

  • Jiale Ma
  • Wenzheng Pan
  • Yang Li
  • Junchi Yan

Despite rapid progress in neural combinatorial optimization (NCO) for solving CO problems (COPs), as the problem scale grows, several bottlenecks persist: 1) solvers in the Global Prediction (GP) paradigm struggle in long-range decisions where the overly smooth intermediate heatmaps impede effective decoding, and 2) solvers in the Local Construction (LC) paradigm are time-consuming and incapable of tackling large instances due to the onerous auto-regressive process. Observing these challenges, we propose a new paradigm named Adaptive Expansion AE with its instantiation COExpander, positioned to leverage both advantages of GP and LC. COExpander utilizes informative heatmaps generated by a global predictor, which is learned under the guidance of locally determined partial solutions, to in turn direct the expansion of determined decision variables with adaptive step-sizes. To ensure transparent evaluation, we further take the lead to canonicalize 29 benchmarks spanning 6 popular COPs (MIS, MCl, MVC, MCut, TSP, ATSP) and various scales (50-10K nodes), upon which experiments demonstrate concrete SOTA performance of COExpander over these tasks. Source code and our standardized datasets will be made public.

NeurIPS Conference 2025 Conference Paper

Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations

  • Ansong Ni
  • Ruta Desai
  • Yang Li
  • Xinjie Lei
  • Dong Wang
  • Jiemin Zhang
  • Jane Yu
  • Ramya Raghavendra

With increasingly powerful large language models (LLMs) and LLM-based agents tackling an ever-growing list of tasks, we envision a future where numerous LLM agents work seamlessly with other AI agents and humans to solve complex problems and enhance daily life. To achieve these goals, LLM agents must develop collaborative skills such as effective persuasion, assertion and disagreement, which are often overlooked in the prevalent single-turn training and evaluation of LLMs. In this work, we present Collaborative Reasoner (Coral), a framework to evaluate and improve the collaborative reasoning abilities of language models. In particular, tasks and metrics in Coral necessitate agents to disagree with incorrect solutions, convince their partners of a correct solution, and ultimately agree as a team to commit to a final solution, all through a natural multi-turn conversation. Through comprehensive evaluation on six collaborative reasoning tasks covering domains of coding, math, scientific QA and social reasoning, we show that current models cannot effectively collaborate due to undesirable social behaviors, collapsing even on problems that they can solve singlehandedly. To improve the collaborative reasoning capabilities of LLMs, we propose a self-play method to generate synthetic multi-turn preference data and further train the language models to be better collaborators. Experiments with Llama-3. 1, Ministral and Qwen-2. 5 models show that our proposed self-improvement approach consistently outperforms finetuned chain-of-thought performance of the same base model, yielding gains up to 16. 7% absolute. Human evaluations show that the models exhibit more effective disagreement and produce more natural conversations after training on our synthetic interaction data.

TIST Journal 2025 Journal Article

Cross-User Federated Recommendation Unlearning

  • Yang Li
  • Enyue Yang
  • Weike Pan
  • Qiang Yang
  • Zhong Ming

Cross-user federated recommendation (CUFR) is a promising solution for providing personalized services without collecting users’ raw data. However, most previous CUFR works mainly focus on providing accurate and privacy-preserving personalized recommendations, but overlook the fact that users can opt out at any time during the training process. In response, we study an emerging and new problem of efficiently training an unlearned model to forget the data of the clients who leave a federated system. It is challenging to simply apply or slightly modify existing machine unlearning or federated unlearning methods to CUFR because of the unique collaboration effect in recommender systems. Although a recent gradient calibration-based method (i.e., FRU) shows promising in training an unlearned model, there are still some limitations: (i) there is a potential possibility that some clients run out of the storage space, (ii) all the remaining clients need to participate in computing the new gradients, (iii) it masks the uniqueness of the local gradients, and (iv) the errors of the calibrated gradients will increase gradually with more iterations. In this article, we propose a novel CUFR unlearning (CUFRU) method. Specifically, we design a gradient transfer station (GTS) module for storing the historical gradients while enabling clients to dynamically participate in the computation of the calibrated gradients with the new gradients based on their online status. Moreover, we design a novel iteration-aware gradient calibration mechanism to strike a balance between the weights of the historical and new gradients at the different stages of the unlearning process, alleviating the calibration errors. Finally, we conduct extensive experiments on three real-world datasets to show that our CUFRU can more efficiently train an unlearned model with the competitive recommendation performance.

ICRA Conference 2025 Conference Paper

Effective Tuning Strategies for Generalist Robot Manipulation Policies

  • Wenbo Zhang 0009
  • Yang Li
  • Yanyuan Qiao
  • Siyuan Huang 0004
  • Jiajun Liu
  • Feras Dayoub
  • Xiao Ma
  • Lingqiao Liu

Generalist robot manipulation policies (GMPs) have the potential to generalize across a wide range of tasks, devices, and environments. However, existing policies continue to struggle with out-of-distribution scenarios due to the inherent difficulty of collecting sufficient action data to cover extensively diverse domains. While fine-tuning offers a practical way to quickly adapt a GMPs to novel domains and tasks with limited samples, we observe that the performance of the resulting GMPs differs significantly with respect to the design choices of fine-tuning strategies. In this work, we first conduct an indepth empirical study to investigate the effect of key factors in GMPs fine-tuning strategies, covering the action space, policy head, supervision signal and the choice of tunable parameters, where 2, 500 rollouts are evaluated for a single configuration. We systematically discuss and summarize our findings and identify the key design choices, which we believe give a practical guideline for GMPs fine-tuning. We observe that in a lowdata regime, with carefully chosen fine-tuning strategies, a GMPs significantly outperforms the state-of-the-art imitation learning algorithms. The results presented in this work establish a new baseline for future studies on fine-tuned GMPs.

AAAI Conference 2025 Conference Paper

Enhancing Sequential Recommendation with Global Diffusion

  • Mingxuan Luo
  • Yang Li
  • Chen Lin

Existing sequential recommendation models are mostly based on sequential models, which can be misled by inconsistent items in the local sequence. This study proposes GlobalDiff, a plug-and-play framework to enhance the performance of sequential models by utilizing a diffusion model to restore the global non-sequential data structure of the item universe and compensate for the local sequential context. Several novel techniques are proposed, including training construction, guided reverse approximator, and inference ensemble, to seamlessly integrate the diffusion model with the sequential model. Extensive experiments on various datasets demonstrate that GlobalDiff can enhance advanced sequential models by an average improvement of 9.67%.

AAAI Conference 2025 Conference Paper

EventZoom: A Progressive Approach to Event-Based Data Augmentation for Enhanced Neuromorphic Vision

  • Yiting Dong
  • Xiang He
  • Guobin Shen
  • Dongcheng Zhao
  • Yang Li
  • Yi Zeng

Dynamic Vision Sensors (DVS) capture event data with high temporal resolution and low power consumption, presenting a more efficient solution for visual processing in dynamic and real-time scenarios compared to conventional video capture methods. Event data augmentation serves as an essential method for overcoming the limitation of scale and diversity in event datasets. Our comparative experiments demonstrate that the two factors, spatial integrity and temporal continuity, can significantly affect the capacity of event data augmentation, which guarantee the maintenance of the sparsity and high dynamic range characteristics unique to event data. However, existing augmentation methods often neglect the preservation of spatial integrity and temporal continuity. To address this, we developed a novel event data augmentation strategy EventZoom, which employs a temporal progressive strategy, embedding transformed samples into the original samples through progressive scaling and shifting. The scaling process avoids the spatial information loss associated with cropping, while the progressive strategy prevents interruptions or abrupt changes in temporal information. We validated EventZoom across various supervised learning frameworks. The experimental results show that EventZoom consistently outperforms existing event data augmentation methods with SOTA performance. For the first time, we have concurrently employed Semi-supervised and Unsupervised learning to verify feasibility on event augmentation algorithms, demonstrating the applicability and effectiveness of EventZoom as a powerful event-based data augmentation tool in handling real-world scenes with high dynamics and variability environments.

IS Journal 2025 Journal Article

Explicable Artificial Intelligence for Affective Computing

  • Rui Mao
  • Erik Cambria
  • Yang Li
  • Newton Howard

Artificial intelligence (AI) is increasingly tasked with recognizing and responding to human emotions, making affective computing one of its most consequential frontiers. As AI spreads into finance, policymaking, and mental health, the opacity of deep learning models raises urgent challenges for trust, accountability, and ethics. This special issue addresses explicability not just as algorithmic transparency, but as a paradigm integrating cognitive science, the humanities, and ethical foresight with technical innovation. Guided by the “Seven Pillars for the Future of AI”— multidisciplinarity, task decomposition, parallel analogy, symbol grounding, similarity measure, intention awareness, and trustworthiness—it envisions affective AI as a partner in meaning-making rather than a mere inference engine. The six featured articles span topics from depression detection and sentiment analysis to hate speech moderation and interpretable driving behaviors, advancing affective AI that is accurate, interpretable, and aligned with human dignity.

NeurIPS Conference 2025 Conference Paper

Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings

  • Yanru Wu
  • Jianning Wang
  • Xiangyu Chen
  • Yang Tan
  • Hanbing Liu
  • Yang Li

Continual learning (CL) has been a critical topic in contemporary deep neural network applications, where higher levels of both forward and backward transfer are desirable for an effective CL performance. Existing CL strategies primarily focus on task models — either by regularizing model updates or by separating task-specific and shared components — while often overlooking the potential of leveraging inter-task relationships to enhance transfer. To address this gap, we propose a transferability-aware task embedding, termed H-embedding, and construct a hypernet framework under its guidance to learn task-conditioned model weights for CL tasks. Specifically, H-embedding is derived from an information theoretic measure of transferability and is designed to be online and easy to compute. Our method is also characterized by notable practicality, requiring only the storage of a low-dimensional task embedding per task and supporting efficient end-to-end training. Extensive evaluations on benchmarks including CIFAR-100, ImageNet-R, and DomainNet show that our framework performs prominently compared to various baseline and SOTA approaches, demonstrating strong potential in capturing and utilizing intrinsic task relationships. Our code is publicly available at \url{https: //github. com/viki760/Hembedding Guided Hypernet}.

IJCAI Conference 2025 Conference Paper

FGeo-HyperGNet: Geometric Problem Solving Integrating FormalGeo Symbolic System and Hypergraph Neural Network

  • Xiaokai Zhang
  • Yang Li
  • Na Zhu
  • Cheng Qin
  • Zhenbing Zeng
  • Tuo Leng

Geometric problem solving has always been a long-standing challenge in the fields of mathematical reasoning and artificial intelligence. We built a neural-symbolic system, called FGeo-HyperGNet, to automatically perform human-like geometric problem solving. The symbolic component is a formal system built on FormalGeo, which can automatically perform geometric relational reasoning and algebraic calculations and organize the solution into a hypergraph with conditions as hypernodes and theorems as hyperedges. The neural component, called HyperGNet, is a hypergraph neural network based on the attention mechanism, including an encoder to effectively encode the structural and semantic information of the hypergraph and a theorem predictor to provide guidance in solving problems. The neural component predicts theorems according to the hypergraph, and the symbolic component applies theorems and updates the hypergraph, thus forming a predict-apply cycle to ultimately achieve readable and traceable automatic solving of geometric problems. Experiments demonstrate the correctness and effectiveness of this neural-symbolic architecture. We achieved state-of-the-art results with a TPA of 93. 50% and a PSSR of 88. 36% on the FormalGeo7K dataset.

NeurIPS Conference 2025 Conference Paper

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

  • Yang Li
  • Qiang Sheng
  • Yehan Yang
  • Xueyao Zhang
  • Juan Cao

Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained token-level annotations to provide reasonable supervision for token-level training. Then, we propose the Streaming Content Monitor (SCM), which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0. 95+ in macro F1 score that is comparable to full-detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.

NeurIPS Conference 2025 Conference Paper

Generation as Search Operator for Test-Time Scaling of Diffusion-based Combinatorial Optimization

  • Yang Li
  • Lvda Chen
  • Haonan Wang
  • Runzhong Wang
  • Junchi Yan

While diffusion models have shown promise for combinatorial optimization (CO), their inference-time scaling cost-efficiency remains relatively underexplored. Existing methods improve solution quality by increasing denoising steps, but the performance often becomes saturated quickly. This paper proposes GenSCO to systematically scale diffusion solvers by an orthogonal dimension of inference-time computation beyond denoising step expansion, i. e. , search-driven generation. GenSCO takes generation as a search operator rather than a complete solving process, where each operator cycle combines solution disruption (via local search operators) and diffusion sampling, enabling iterative exploration of the learned solution space. Rather than over-refining current solutions, this paradigm encourages the model to leave local optima and explore a broader area of the solution space, ensuring a more consistent scaling effect. The search loop is supported by a search-friendly solution-enhancement training procedure that incorporates a rectified flow model learning to establish diffusion trajectories between suboptimal solutions and the optimal ones. The flow model is empowered by a lightweight transformer architecture to learn neural ODEs that linearize solution trajectories, accelerating convergence of the scaling effect with efficiency. The resulting enhanced scaling efficiency and practical scalability lead to synergistic performance improvements. Extensive experiments show that GenSCO delivers performance improvements by orders of magnitude over previous state-of-the-art neural methods. Notably, GenSCO even achieves significant speedups compared to the state-of-the-art classic mathematical solver LKH3, delivering a 141x speedup to reach 0. 000% optimality gap on TSP-100, and approximately a 10x speedup to reach 0. 02% on TSP-500.

ICLR Conference 2025 Conference Paper

GLOMA: Global Video Text Spotting with Morphological Association

  • Han Wang
  • Yanjie Wang
  • Yang Li
  • Can Huang

Video Text Spotting (VTS) is a fundamental visual task that aims to predict the trajectories and content of texts in a video. Previous works usually conduct local associations and apply IoU-based distance and complex post-processing procedures to boost performance, ignoring the abundant temporal information and the morphological characteristics in VTS. In this paper, we propose \model{} to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames. Our main contributions can be summarized as three folds. 1). We propose a Transformer-based global tracking method \model{} for VTS and associate multiple frames simultaneously. 2). We introduce a Wasserstein distance-based method to conduct positional associations between frames. 3). We conduct extensive experiments on public datasets. On the ICDAR2015 video dataset, \model{} achieves \textbf{56.0} MOTA with \textbf{4.6} absolute improvement compared with the previous SOTA method and outperforms the previous Transformer-based method by a significant \textbf{8.3} MOTA.

IROS Conference 2025 Conference Paper

GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction

  • Tianhao Li
  • Yang Li
  • Mengtian Li
  • Yisheng Deng
  • Weifeng Ge

Accurately perceiving dynamic environments is a fundamental task for autonomous driving and robotic systems. Existing methods inadequately utilize temporal information, relying mainly on local temporal interactions between adjacent frames and failing to leverage global sequence information effectively. To address this limitation, we investigate how to effectively aggregate global temporal features from temporal sequences, aiming to achieve occupancy representations that efficiently utilize global temporal information from historical observations. For this purpose, we propose a global temporal aggregation denoising network named GTAD, introducing a global temporal information aggregation framework as a new paradigm for holistic 3D scene understanding. Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences. This approach enables the effective perception of both fine-grained temporal information from adjacent frames and global temporal patterns from historical observations. As a result, it provides a more coherent and comprehensive understanding of the environment. Extensive experiments on the nuScenes and Occ3D-nuScenes benchmark and ablation studies demonstrate the superiority of our method.

NeurIPS Conference 2025 Conference Paper

Learning the Plasticity: Plasticity-Driven Learning Framework in Spiking Neural Networks

  • Guobin Shen
  • Dongcheng Zhao
  • Yiting Dong
  • Yang Li
  • Feifei Zhao
  • Yi Zeng

The evolution of the human brain has led to the development of complex synaptic plasticity, enabling dynamic adaptation to a constantly evolving world. This progress inspires our exploration into a new paradigm for Spiking Neural Networks (SNNs): a Plasticity-Driven Learning Framework (PDLF). This paradigm diverges from traditional neural network models that primarily focus on direct training of synaptic weights, leading to static connections that limit adaptability in dynamic environments. Instead, our approach delves into the heart of synaptic behavior, prioritizing the learning of plasticity rules themselves. This shift in focus from weight adjustment to mastering the intricacies of synaptic change offers a more flexible and dynamic pathway for neural networks to evolve and adapt. Our PDLF does not merely adapt existing concepts of functional and Presynaptic-Dependent Plasticity but redefines them, aligning closely with the dynamic and adaptive nature of biological learning. This reorientation enhances key cognitive abilities in artificial intelligence systems, such as working memory and multitasking capabilities, and demonstrates superior adaptability in complex, real-world scenarios. Moreover, our framework sheds light on the intricate relationships between various forms of plasticity and cognitive functions, thereby contributing to a deeper understanding of the brain's learning mechanisms. Integrating this groundbreaking plasticity-centric approach in SNNs marks a significant advancement in the fusion of neuroscience and artificial intelligence. It paves the way for developing AI systems that not only learn but also adapt in an ever-changing world, much like the human brain.

AAAI Conference 2025 Conference Paper

LiON: Learning Point-Wise Abstaining Penalty for LiDAR Outlier DetectioN Using Diverse Synthetic Data

  • Shaocong Xu
  • Pengfei Li
  • Qianpu Sun
  • Xinyu Liu
  • Yang Li
  • Shihui Guo
  • Zhen Wang
  • Bo Jiang

LiDAR-based semantic scene understanding is an important module in the modern autonomous driving perception stack. However, identifying outlier points in a LiDAR point cloud is challenging as LiDAR point clouds lack semantically-rich information. While former SOTA methods adopt heuristic architectures, we revisit this problem from the perspective of Selective Classification, which introduces a selective function into the standard closed-set classification setup. Our solution is built upon the basic idea of abstaining from choosing any inlier categories but learns a point-wise abstaining penalty with a margin-based loss. Apart from learning paradigms, synthesizing outliers to approximate unlimited real outliers is also critical, so we propose a strong synthesis pipeline that generates outliers originated from various factors: object categories, sampling patterns and sizes. We demonstrate that learning different abstaining penalties, apart from point-wise penalty, for different types of (synthesized) outliers can further improve the performance. We benchmark our method on SemanticKITTI and nuScenes and achieve SOTA results.

NeurIPS Conference 2025 Conference Paper

Meta CLIP 2: A Worldwide Scaling Recipe

  • Yung-Sung Chuang
  • Yang Li
  • Dong Wang
  • Ching-Feng Yeh
  • Kehan Lyu
  • Ramya Raghavendra
  • Jim Glass
  • Lifei Huang

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i. e. , "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0. 8% and mSigLIP by 0. 7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e. g. , translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57. 4%, Babel-ImageNet with 50. 2% and XM3600 with 64. 3% on image-to-text retrieval. Code and model are available at https: //github. com/facebookresearch/MetaCLIP.

NeurIPS Conference 2025 Conference Paper

ML4CO-Bench-101: Benchmark Machine Learning for Classic Combinatorial Problems on Graphs

  • Jiale Ma
  • Wenzheng Pan
  • Yang Li
  • Junchi Yan

Combinatorial problems on graphs have attracted extensive efforts from the machine learning community over the past decade. Despite notable progress in this area under the umbrella of ML4CO, a comprehensive categorization, unified reproducibility, and transparent evaluation protocols are still lacking for the emerging and immense pool of neural CO solvers. In this paper, we establish a modular and streamlined framework benchmarking prevalent neural CO methods, dissecting their design choices via a tri-leveled "paradigm-model-learning'' taxonomy to better characterize different approaches. Further, we integrate their shared features and respective strengths to form 3 unified solvers representing global prediction (GP), local construction (LC), and adaptive expansion (AE) mannered neural solvers. We also collate a total of 65 datasets for 7 mainstream CO problems (including both edge-oriented tasks: TSP, ATSP, CVRP, as well as node-oriented: MIS, MCl, MVC, MCut) across scales to facilitate more comparable results among literature. Extensive experiments upon our benchmark reveal a fair and exact performance exhibition indicative of the raw contribution of the learning components in each method, rethinking and insisting that pre- and post-inference heuristic tricks are not supposed to compensate for sub-par capability of the data-driven counterparts. Under this unified benchmark, an up-to-date replication of typical ML4CO methods is maintained, hoping to provide convenient reference and insightful guidelines for both engineering development and academic exploration of the ML4CO community in the future. Code is available at https: //github. com/Thinklab-SJTU/ML4CO-Bench-101, and the dataset is at https: //huggingface. co/datasets/ML4CO/ML4CO-Bench-101-SL.

AAAI Conference 2025 Conference Paper

Motion-Zero: A Zero-Shot Trajectory Control Framework of Moving Object for Diffusion-Based Video Generation

  • Changgu Chen
  • Junwei Shu
  • Gaoqi He
  • Changbo Wang
  • Yang Li

Recent large-scale pre-trained diffusion models have demonstrated a powerful generative ability to produce high-quality videos from detailed text descriptions. However, exerting control over the motion of objects in videos generated by any video diffusion model remains a challenging problem. In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable arbitrary single-object-trajectory control for the text-to-video diffusion model. To this end, an initial noise prior module is designed to provide a position-based prior to improve the stability of the appearance of the moving object and the accuracy of position. In addition, based on the attention map of the U-Net, spatial constraints are directly applied to the denoising process of diffusion models, which further ensures the positional consistency of moving objects during the inference. Furthermore, temporal consistency is guaranteed with a proposed shift temporal attention mechanism. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process. Extensive experiments demonstrate our proposed method can control the motion trajectories of arbitrary objects while preserving the original ability to generate high-quality videos.

NeurIPS Conference 2025 Conference Paper

MPS-Prover: Advancing Stepwise Theorem Proving by Multi-Perspective Search and Data Curation

  • Zhenwen Liang
  • Linfeng Song
  • Yang Li
  • Tao Yang
  • Haitao Mi
  • Dong Yu

Automated Theorem Proving (ATP) in formal languages remains a formidable challenge in AI, demanding rigorous logical deduction and navigating vast search spaces. While large language models (LLMs) have shown promising performance, existing stepwise provers often suffer from biased search guidance, leading to inefficiencies and suboptimal proof strategies. This paper introduces the Multi-Perspective Search Prover (MPS-Prover), a novel stepwise ATP system designed to overcome these limitations. MPS-Prover incorporates two key innovations: a highly effective post-training data curation strategy that prunes approximately 40\% of redundant training data without sacrificing performance, and a multi-perspective tree search mechanism. This search integrates a learned critic model with strategically designed heuristic rules to diversify tactic selection, prevent getting trapped in unproductive states, and enhance search robustness. Extensive evaluations demonstrate that MPS-Prover achieves state-of-the-art performance on multiple challenging benchmarks, including miniF2F and ProofNet, outperforming prior 7B parameter models. Furthermore, our analyses reveal that MPS-Prover generates significantly shorter and more diverse proofs compared to existing stepwise and whole-proof methods, highlighting its efficiency and efficacy. Our work advances the capabilities of LLM-based formal reasoning and offers a robust framework and a comprehensive analysis for developing more powerful theorem provers.

NeurIPS Conference 2025 Conference Paper

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

  • Weizhe Yuan
  • Jane Yu
  • Song Jiang
  • Karthik Padthe
  • Yang Li
  • Dong Wang
  • Ilia Kulikov
  • Kyunghyun Cho

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2. 8 million questions that span multiple domains, including STEM fields (e. g. , Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.

NeurIPS Conference 2025 Conference Paper

Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning

  • Yang Li
  • Aming WU
  • Zihao Zhang
  • Yahong Han

In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method, i. e. , Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes' causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.

ICML Conference 2025 Conference Paper

Optimal Auction Design in the Joint Advertising

  • Yang Li
  • Yuchao Ma 0002
  • Qi Qi 0003

Online advertising is a vital revenue source for major internet platforms. Recently, joint advertising, which assigns a bundle of two advertisers in an ad slot instead of allocating a single advertiser, has emerged as an effective method for enhancing allocation efficiency and revenue. However, existing mechanisms for joint advertising fail to realize the optimality, as they tend to focus on individual advertisers and overlook bundle structures. This paper identifies an optimal mechanism for joint advertising in a single-slot setting. For multi-slot joint advertising, we propose BundleNet, a novel bundle-based neural network approach specifically designed for joint advertising. Our extensive experiments demonstrate that the mechanisms generated by BundleNet approximate the theoretical analysis results in the single-slot setting and achieve state-of-the-art performance in the multi-slot setting. This significantly increases platform revenue while ensuring approximate dominant strategy incentive compatibility and individual rationality.

ICML Conference 2025 Conference Paper

Parrot: Multilingual Visual Instruction Tuning

  • Hai-Long Sun
  • Da-Wei Zhou 0001
  • Yang Li
  • Shiyin Lu
  • Chao Yi
  • Qing-Guo Chen
  • Zhao Xu
  • Weihua Luo

The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12, 000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https: //github. com/AIDC-AI/Parrot.

AAAI Conference 2025 Conference Paper

pFedGPA: Diffusion-based Generative Parameter Aggregation for Personalized Federated Learning

  • Jiahao Lai
  • Jiaqi Li
  • Jian Xu
  • Yanru Wu
  • Boshi Tang
  • Siqi Chen
  • Yongfeng Huang
  • Wenbo Ding

Federated Learning (FL) offers a decentralized approach to model training, where data remains local and only model parameters are shared between the clients and the central server. Traditional methods, such as Federated Averaging (FedAvg), linearly aggregate these parameters which are usually trained on heterogeneous data distributions, potentially overlooking the complex, high-dimensional nature of the parameter space. This can result in degraded performance of the aggregated model. While personalized FL approaches can mitigate the heterogeneous data issue to some extent, the limitation of linear aggregation remains unresolved. To alleviate this issue, we investigate the generative approach of diffusion model and propose a novel generative parameter aggregation framework for personalized FL, pFedGPA. In this framework, we deploy a diffusion model on the server to integrate the diverse parameter distributions and propose a parameter inversion method to efficiently generate a set of personalized parameters for each client. This inversion method transforms the uploaded parameters into a latent code, which is then aggregated through denoising sampling to produce the final personalized parameters. By encoding the dependence of a client's model parameters on the specific data distribution using the high-capacity diffusion model, pFedGPA can effectively decouple the complexity of the overall distribution of all clients' model parameters from the complexity of each individual client's parameter distribution. Our experimental results consistently demonstrate the superior performance of the proposed method across multiple datasets, surpassing baseline approaches.

ICML Conference 2025 Conference Paper

Policy Guided Tree Search for Enhanced LLM Reasoning

  • Yang Li

Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

AAAI Conference 2025 Conference Paper

Population Aware Diffusion for Time Series Generation

  • Yang Li
  • Han Meng
  • Zhenyu Bi
  • Ingolv T. Urnes
  • Haipeng Chen

Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.

NeurIPS Conference 2025 Conference Paper

ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility

  • Yihang Zhou
  • Chen Wei
  • Minghao Sun
  • Jin Song
  • Yang Li
  • Lin Wang
  • Yang Zhang

Understanding the conformational landscape of proteins is essential for elucidating protein function and facilitating drug design. However, existing protein conformation benchmarks fail to capture the full energy landscape, limiting their ability to evaluate the diversity and physical plausibility of AI-generated structures. We introduce ProteinConformers, a large-scale benchmark dataset comprising over 381, 000 physically realistic conformations for 87 CASP targets. These were derived from more than 40, 000 structural decoys via extensive all-atom molecular dynamics simulations totaling over 6 million CPU hours. Using this dataset, we propose novel metrics to evaluate conformational diversity and plausibility, and systematically benchmark six protein conformation generative models. Our results highlight that leveraging large-scale protein sequence data can enhance a model’s ability to explore conformational space, potentially reducing reliance on MD-derived data. Additionally, we find that PDB and MD datasets influence model performance differently, current models perform well on inter-atomic distance prediction but struggle with inter-residue orientation generation. Overall, our dataset, evaluation metrics, and benchmarking results provide the first comprehensive foundation for assessing generative models in protein conformational modeling. Dataset and instructions are available at https: //huggingface. co/ datasets/Jim990908/ProteinConformers/tree/main. Codes are stored at https: //github. com/auroua/ProteinConformers. An interactive website locates at https: //zhanggroup. org/ProteinConformers.

NeurIPS Conference 2025 Conference Paper

REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

  • Weihan Xu
  • Yimeng Ma
  • Jingyue Huang
  • Yang Li
  • Wenye Ma
  • Taylor Berg-Kirkpatrick
  • Julian McAuley
  • Paul Liang

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i. e. , inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

JBHI Journal 2025 Journal Article

scSTD: A Swin Transformer-Based Diffusion Model for Recovering scRNA-Seq Data

  • Yang Li
  • Furui Liu
  • Junlei Zhou
  • Fangyuan Shi
  • Zhenhua Yu

Dropout events and technical noise are pervasive challenges in single-cell RNA sequencing (scRNA-seq) data, often obscuring true gene expression profiles and undermining the reliability of downstream analyses. Existing imputation and denoising methods offer partial relief but frequently struggle with over-smoothing and fail to fully capture the complex heterogeneity of cellular states. To address these limitations, we introduce scSTD, a novel imputation and denoising framework that uniquely combines the Swin Transformer (SwinT) architecture with a latent diffusion model. In scSTD, a deep autoencoder first encodes each cell into a compact latent embedding, which is then modeled via a SwinT-based latent diffusion process designed to learn the rich, multimodal distribution of scRNA-seq data. This integration enables scSTD to accurately recover gene expression profiles while preserving subtle biological variation. By synthesizing realistic latent neighbors for each cell and aggregating their decoded outputs, scSTD achieves high-fidelity imputation and denoising. Comprehensive evaluations on both synthetic and real scRNA-seq datasets demonstrate that scSTD significantly outperforms existing methods in recovering true gene expression profiles and maintaining the topological integrity of cellular landscapes.

IROS Conference 2025 Conference Paper

SF-TIM: A Simple Framework for Enhancing Quadrupedal Robot Jumping Agility by Combining Terrain Imagination and Measurement

  • Ze Wang 0009
  • Yang Li
  • Long Xu 0002
  • Hao Shi 0004
  • Zunwang Ma
  • Zhen Chu
  • Chao Li
  • Fei Gao 0011

Dynamic jumping on high platforms and over gaps differentiates legged robots from wheeled counterparts. Compared to walking on rough terrains, dynamic locomotion on abrupt surfaces requires fusing proprioceptive and exteroceptive perception for explosive movements. In this paper, we propose SF-TIM (Simple Framework combining Terrain Imagination and Measurement), a single-policy method that enhances quadrupedal robot jumping agility, while preserving their fundamental blind walking capabilities. In addition, we introduce a terrain-guided reward design specifically to assist quadrupedal robots in high jumping, improving their performance in this task. To narrow the simulation-to-reality gap in quadrupedal robot learning, we introduce a stable and high-speed elevation map generation framework, enabling zero-shot simulation-to-reality transfer of locomotion ability. Our algorithm has been deployed and validated on both the small-/large-size quadrupedal robots, demonstrating its effectiveness in real-world applications: the robot has successfully traversed various high platforms and gaps, showing the robustness of our proposed approach. A demo video has been made available at https://flysoaryun.github.io/SF-TIM.

TMLR Journal 2025 Journal Article

Streamlining Language Models via Semantic Basis Analysis

  • Yang Li
  • Daniel Agyei Asante
  • Changsheng Zhao
  • Ernie Chang
  • Yangyang Shi
  • Vikas Chandra

As the size of language models increases, they deliver substantial performance improvements across a variety of applications. However, this growth also leads to greater computational demands, making deployment on resource-constrained devices—such as personal computers and mobile or wearable devices—more challenging, and significantly raising inference costs on cloud servers. To address these challenges, we introduce Basel, a method to streamline language models by leveraging the semantic structure of their weight matrices. Specifically, Basel treats each weight matrix as a linear combination of bases, selectively retaining those that are associated with essential semantics for the target application, pruning redundant ones, and introducing new bases that enhance task performance. Experimental results demonstrate that Basel achieves significant model size reduction compared to baseline techniques, while maintaining comparable or even superior accuracy across diverse applications.

NeurIPS Conference 2025 Conference Paper

StruDiCO: Structured Denoising Diffusion with Gradient-free Inference-stage Boosting for Memory and Time Efficient Combinatorial Optimization

  • Yu Wang
  • Yang Li
  • Junchi Yan
  • Yi Chang

Diffusion models have recently emerged as powerful neural solvers for combinatorial optimization (CO). However, existing approaches fail to reveal how variables are progressively determined during inference, making the final solution opaque until the last step. To address this limitation, we propose a structured denoising diffusion model, StruDiCO, which incrementally constructs solutions through step-wise variable selection. This is achieved via a variable-absorption noising model, wherein the forward process simulates gradual variable deactivation, converging to an empty solution, while the reverse process incrementally selects variables to reconstruct the final solution. This design induces structural continuity across intermediate states, enabling interpretable and trajectory-consistent partial solutions throughout inference. To further improve the reliability of reverse inference, we introduce a constrained consistency sampling strategy, which suppresses low-confidence variable selection at each step to stabilize the reverse process. Leveraging the structure-preserving reverse process, we further propose a lightweight, gradient-free, objective-aware refinement framework, which iteratively improves solution quality by applying structure-aware perturbations to the current solution, performing reverse inference through the constraint consistency model, and decoding with an objective-guided scoring scheme. Extensive experiments on two canonical CO tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), show that StruDiCO outperforms state-of-the-art diffusion-based solvers, achieving up to $3. 5\times$ faster inference, 70\% lower GPU memory usage, and significantly improved solution quality, with up to 37. 7\% drop reduction on TSP and an average 38. 1\% improvement on MIS. The codes are publicly available at https: //github. com/yuuuuwang/StruDiCO.

JBHI Journal 2025 Journal Article

TCGAN: Temporal Convolutional Generative Adversarial Network for Fetal ECG Extraction Using Single-Channel Abdominal ECG

  • Zhen-Zhen Huang
  • Wei-Tao Zhang
  • Yang Li
  • Jian Cui
  • Ya-Ru Zhang

Noninvasive fetal ECG (FECG) monitoring holds significant importance in ensuring the normal development of the fetus. Since FECG is usually submerged by maternal ECG (MECG) and background noise in abdominal ECG (AECG), it is challenging to exactly restore the waveform details of FECG from AECG. To address this issue, a temporal convolutional generative adversarial network (TCGAN) is proposed for FECG extraction using single-channel AECG. In order to utilize both the global and local ECG features in time domain, we built an encoder-decoder architecture for designing generator. The model architecture consists of temporal convolution blocks, transpose convolutions and skip connections. The skip connections attempt to achieve the purpose of amalgamating information from feature maps extracted by convolutional layers using transpose convolution operations, which facilitates the decoder for extracting more detail information. TCGAN is rigorously evaluated using both synthetic dataset FECGSYDB and real-world dataset ADFECGDB. The experimental results on above datasets demonstrate the outstanding performance of TCGAN in terms of fetal QRS complex detection, achieving PPV of 99. 54% and 99. 02%, respectively. Comparing with the state-of-the-art methods, TCGAN could extract FECG with well-preserved waveform details. This helps doctors achieve more accurate assessment of fetal development.

AAAI Conference 2025 Conference Paper

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

  • Dawei Yan
  • Pengcheng Li
  • Yang Li
  • Hao Chen
  • Qingguo Chen
  • Weihua Luo
  • Wei Dong
  • Qingsen Yan

Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

NeurIPS Conference 2025 Conference Paper

THD-BAR: Topology Hierarchical Derived Brain Autoregressive Modeling for EEG Generic Representations

  • Wenchao Yang
  • Weidong Yan
  • Wenkang Liu
  • Yulan Ma
  • Yang Li

Large-scale pre-trained models hold significant potential for learning universal EEG representations. However, most existing methods, particularly autoregressive (AR) frameworks, primarily rely on straightforward temporal sequencing of multi-channel EEG data, which fails to capture the rich physiological characteristics inherent to EEG signals. Moreover, their time-centered modeling approach also limits the effective representation of the dynamic spatial topology of brain activity. To address these challenges and fully exploit the potential of large-scale EEG models, we propose a novel Topology Hierarchical Derived Brain Autoregressive Modeling (THD-BAR) for EEG generic representations. The core innovation of THD-BAR lies in the introduction of the Brain Topology Hierarchy (BTH), which establishes a multi-scale spatial order for EEG channels. This hierarchical structure enables a redefinition of autoregressive learning as a "next-scale-time prediction" problem, effectively capturing both spatial and temporal dynamics. Based on BTH, we design a Topology-Hierarchical Vector Quantized-Variational Autoencoder (THVQ-VAE) for multi-scale tokenization and develop an enhanced Brain Autoregressive (BAR) module with specialized masking strategies for prediction. Through extensive large-scale pre-training on 17 datasets, followed by rigorous validation on 10 downstream datasets spanning 5 distinct tasks, THD-BAR consistently outperforms existing methods. These results highlight the superior generalization and modeling capabilities of our proposed approach.

NeurIPS Conference 2025 Conference Paper

Theory-Driven Label-Specific Representation for Incomplete Multi-View Multi-Label Learning

  • Quanjiang Li
  • Tianxiang Xu
  • Tingjin Luo
  • Yan Zhong
  • Yang Li
  • Yiyun Zhou
  • Chenping Hou

Multi-view multi-label learning typically suffers from dual data incompleteness due to limitations in feature storage and annotation costs. The interplay of hetero geneous features, numerous labels, and missing information significantly degrades model performance. To tackle the complex yet highly practical challenges, we propose a Theory-Driven Label-Specific Representation (TDLSR) framework. Through constructing the view-specific sample topology and prototype association graph, we develop the proximity-aware imputation mechanism, while deriving class representatives that capture the label correlation semantics. To obtain semantically distinct view representations, we introduce principles of information shift, inter action and orthogonality, which promotes the disentanglement of representation information, and mitigates message distortion and redundancy. Besides, label semantic-guided feature learning is employed to identify the discriminative shared and specific representations and refine the label preference across views. Moreover, we theoretically investigate the characteristics of representation learning and the generalization performance. Finally, extensive experiments on public datasets and real-world applications validate the effectiveness of TDLSR.

TAAS Journal 2025 Journal Article

VoI-based Situation-Aware Routing Protocol for Non-linear Underwater Communication Networks

  • Kiran Saleem
  • Lei Wang
  • Rana Zeeshan Ahmed
  • Thippa Reddy Gadekallu
  • Ahmad Almadhor
  • Yang Li

One of the main challenges for underwater applications, such as environmental monitoring and disaster management, is achieving efficient data transmission in environments where conditions change rapidly, and resources need for data transport are scarce. The capability of evaluating the Value of information (VoI) enables us to assess these problems by proposing a Value of Information-based Situation-Aware Non-Linear Routing (VoI SANLR/VoI SANL) method. It aims to deal with critical event scenarios using BDI (Belief-Desire-Intention) logic criteria and prioritizing the timely uploading of data-driven information towards the destination. SANLR of VoI is developed to reduce energy consumption, end-to-end latency, jitter, and improve Packet Delivery Ratio (PDR) in underwater communication networks. VoI SANLR introduces principles of priority-based methods and intends to address challenges in terms of underwater environment such as varying channel conditions, lack energy resources, and real-time decision requirements by using SANLR. Energy optimization analysis reveals consistent outperformance, achieving a remarkable 95% reduction in energy consumption compared to other techniques. Low latency is maintained, ranging from 2.5 to 0.5 seconds, showcasing enhanced efficiency and scalability. VoI SANLR demonstrates exceptional performance in both throughput and jitter. It achieves the highest data transfer rates, ranging from 100 kbps to 110 kbps, indicating outstanding efficiency. Additionally, the jitter remains consistently low, between 1.8 ms and 2 ms, ensuring minimal delay variability and improved communication stability. PDR consistently surpasses other techniques, reaching a maximum of 99%. Additionally, network lifetime analysis demonstrates VoI SANLR's superiority, exhibiting the highest network lifetime at each node and a significant 31.25% improvement at Node 100 compared to other methods.

AAAI Conference 2025 Conference Paper

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

  • Yifang Xu
  • Yunzhuo Sun
  • Benxiang Zhai
  • Ming Li
  • Wenxin Liang
  • Yang Li
  • Sidan Du

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply Video-ChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

AAAI Conference 2024 Conference Paper

A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction

  • Ruili Pu
  • Yang Li
  • Jun Zhao
  • Suge Wang
  • Deyu Li
  • Jian Liao
  • Jianxing Zheng

Event Causality Extraction (ECE) aims to extract the cause-effect event pairs with their structured event information from plain texts. As far as we know, the existing ECE methods mainly focus on the correlation between arguments, without explicitly modeling the causal relationship between events, and usually design two independent frameworks to extract cause events and effect events, respectively, which cannot effectively capture the dependency between the subtasks. Therefore, we propose a joint multi-label extraction framework for ECE to alleviate the above limitations. In particular, 1) we design a heterogeneous-relation-aware graph module to learn the potential relationships between events and arguments, in which we construct the heterogeneous graph by taking the predefined event types and all the words in the sentence as nodes, and modeling three relationships of "event-event", "event-argument" and "argument-argument" as edges. 2) We also design a multi-channel label enhancing module to better learn the distributed representation of each label in the multi-label extraction framework, and further enhance the interaction between the subtasks by considering the preliminary results of cause-effect type identification and event argument extraction. The experimental results on the benchmark dataset ECE-CCKS show that our approach outperforms previous state-of-the-art methods, and that our model also performs well on the complex samples with multiple cause-effect event pairs.

JBHI Journal 2024 Journal Article

A Transferability-Based Method for Evaluating the Protein Representation Learning

  • Fan Hu
  • Weihong Zhang
  • Huazhen Huang
  • Wang Li
  • Yang Li
  • Peng Yin

Self-supervised pre-trained language models have recently risen as a powerful approach in learning protein representations, showing exceptional effectiveness in various biological tasks, such as drug discovery. Amidst the evolving trend in protein language model development, there is an observable shift towards employing large-scale multimodal and multitask models. However, the predominant reliance on empirical assessments using specific benchmark datasets for evaluating these models raises concerns about the comprehensiveness and efficiency of current evaluation methods. Addressing this gap, our study introduces a novel quantitative approach for estimating the performance of transferring multi-task pre-trained protein representations to downstream tasks. This transferability-based method is designed to quantify the similarities in latent space distributions between pre-trained features and those fine-tuned for downstream tasks. It encompasses a broad spectrum, covering multiple domains and a variety of heterogeneous tasks. To validate this method, we constructed a diverse set of protein-specific pre-training tasks. The resulting protein representations were then evaluated across several downstream biological tasks. Our experimental results demonstrate a robust correlation between the transferability scores obtained using our method and the actual transfer performance observed. This significant correlation highlights the potential of our method as a more comprehensive and efficient tool for evaluating protein representation learning.

NeurIPS Conference 2024 Conference Paper

ActSort: An active-learning accelerated cell sorting algorithm for large-scale calcium imaging datasets

  • Yiqi Jiang
  • Hakki O. Akengin
  • Ji Zhou
  • Mehmet A. Aslihak
  • Yang Li
  • Radosław Chrapkiewicz
  • Oscar Hernandez
  • Sadegh Ebrahimi

Recent advances in calcium imaging enable simultaneous recordings of up to a million neurons in behaving animals, producing datasets of unprecedented scales. Although individual neurons and their activity traces can be extracted from these videos with automated algorithms, the results often require human curation to remove false positives, a laborious process called \emph{cell sorting}. To address this challenge, we introduce ActSort, an active-learning algorithm for sorting large-scale datasets that integrates features engineered by domain experts together with data formats with minimal memory requirements. By strategically bringing outlier cell candidates near the decision boundary up for annotation, ActSort reduces human labor to about 1–3\% of cell candidates and improves curation accuracy by mitigating annotator bias. To facilitate the algorithm's widespread adoption among experimental neuroscientists, we created a user-friendly software and conducted a first-of-its-kind benchmarking study involving about 160, 000 annotations. Our tests validated ActSort's performance across different experimental conditions and datasets from multiple animals. Overall, ActSort addresses a crucial bottleneck in processing large-scale calcium videos of neural activity and thereby facilitates systems neuroscience experiments at previously inaccessible scales. (\url{https: //github. com/schnitzer-lab/ActSort-public})

NeurIPS Conference 2024 Conference Paper

Aligning Individual and Collective Objectives in Multi-Agent Cooperation

  • Yang Li
  • Wenhao Zhang
  • Jianhong Wang
  • Shao Zhang
  • Yali Du
  • Ying Wen
  • Wei Pan

Among the research topics in multi-agent learning, mixed-motive cooperation is one of the most prominent challenges, primarily due to the mismatch between individual and collective goals. The cutting-edge research is focused on incorporating domain knowledge into rewards and introducing additional mechanisms to incentivize cooperation. However, these approaches often face shortcomings such as the effort on manual design and the absence of theoretical groundings. To close this gap, we model the mixed-motive game as a differentiable game for the ease of illuminating the learning dynamics towards cooperation. More detailed, we introduce a novel optimization method named \textbf{\textit{A}}ltruistic \textbf{\textit{G}}radient \textbf{\textit{A}}djustment (\textbf{\textit{AgA}}) that employs gradient adjustments to progressively align individual and collective objectives. Furthermore, we theoretically prove that AgA effectively attracts gradients to stable fixed points of the collective objective while considering individual interests, and we validate these claims with empirical evidence. We evaluate the effectiveness of our algorithm AgA through benchmark environments for testing mixed-motive collaboration with small-scale agents such as the two-player public good game and the sequential social dilemma games, Cleanup and Harvest, as well as our self-developed large-scale environment in the game StarCraft II.

AAAI Conference 2024 Conference Paper

An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain

  • Xiang He
  • Dongcheng Zhao
  • Yang Li
  • Guobin Shen
  • Qingqun Kong
  • Yi Zeng

Spiking neural networks (SNNs) are rich in spatio-temporal dynamics and are suitable for processing event-based neuromorphic data. However, event-based datasets are usually less annotated than static datasets. This small data scale makes SNNs prone to overfitting and limits their performance. In order to improve the generalization ability of SNNs on event-based datasets, we use static images to assist SNN training on event data. In this paper, we first discuss the domain mismatch problem encountered when directly transferring networks trained on static datasets to event data. We argue that the inconsistency of feature distributions becomes a major factor hindering the effective transfer of knowledge from static images to event data. To address this problem, we propose solutions in terms of two aspects: feature distribution and training strategy. Firstly, we propose a knowledge transfer loss, which consists of domain alignment loss and spatio-temporal regularization. The domain alignment loss learns domain-invariant spatial features by reducing the marginal distribution distance between the static image and the event data. Spatio-temporal regularization provides dynamically learnable coefficients for domain alignment loss by using the output features of the event data at each time step as a regularization term. In addition, we propose a sliding training strategy, which gradually replaces static image inputs probabilistically with event data, resulting in a smoother and more stable training for the network. We validate our method on neuromorphic datasets, including N-Caltech101, CEP-DVS, and N-Omniglot. The experimental results show that our proposed method achieves better performance on all datasets compared to the current state-of-the-art methods. Code is available at https://github.com/Brain-Cog-Lab/Transfer-for-DVS.

AAAI Conference 2024 Conference Paper

Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection

  • Beizhe Hu
  • Qiang Sheng
  • Juan Cao
  • Yuhui Shi
  • Yang Li
  • Danding Wang
  • Peng Qi

Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.

NeurIPS Conference 2024 Conference Paper

Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime

  • Haoyu Geng
  • Hang Ruan
  • Runzhong Wang
  • Yang Li
  • Yang Wang
  • Lei Chen
  • Junchi Yan

Predictive combinatorial optimization, where the parameters of combinatorial optimization (CO) are unknown at the decision-making time, is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. Tackling such a problem usually involves a prediction model and a CO solver. These two modules are integrated into the predictive CO pipeline following two design principles: ''Predict-then-Optimize (PtO)'', which learns predictions by supervised training and subsequently solves CO using predicted coefficients, while the other, named ''Predict-and-Optimize (PnO)'', directly optimizes towards the ultimate decision quality and claims to yield better decisions than traditional PtO approaches. However, there lacks a systematic benchmark of both approaches, including the specific design choices at the module level, as well as an evaluation dataset that covers representative real-world scenarios. To this end, we develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for combinatorial advertising that will be released. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO. A comprehensive categorization of current approaches and integration of typical scenarios are provided under a unified benchmark. Therefore, this paper could serve as a comprehensive benchmark for future PnO approach development and also offer fast prototyping for application-focused development. The code is available at \url{https: //github. com/Thinklab-SJTU/PredictiveCO-Benchmark}.

NeurIPS Conference 2024 Conference Paper

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

  • Yiming Sun
  • Fan Yu
  • Shaoxiang Chen
  • Yu Zhang
  • Junwei Huang
  • Yang Li
  • Chenhui Li
  • Changbo Wang

Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate descriptions of the target with tracking feedback. To further utilize semantic information produced by MLLM, a simple yet effective VL tracking framework is proposed and can be easily integrated as a plug-and-play module to boost the performance of both VL and visual trackers. Experimental results show that our proposed ChatTracker achieves a performance comparable to existing methods.

IJCAI Conference 2024 Conference Paper

ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces

  • Libing Yang
  • Yang Li
  • Long Chen

Vision-based robotic cloth unfolding has made great progress recently. However, prior works predominantly rely on value learning and have not fully explored policy-based techniques. Recently, the success of reinforcement learning on the large language model has shown that the policy gradient algorithm can enhance policy with huge action space. In this paper, we introduce ClothPPO, a framework that employs a policy gradient algorithm based on actor-critic architecture to enhance a pre-trained model with huge 10^6 action spaces aligned with observation in the task of unfolding clothes. To this end, we redefine the cloth manipulation problem as a partially observable Markov decision process. A supervised pre-training stage is employed to train a baseline model of our policy. In the second stage, the Proximal Policy Optimization (PPO) is utilized to guide the supervised model within the observation-aligned action space. By optimizing and updating the strategy, our proposed method increases the garment's surface area for cloth unfolding under the soft-body manipulation task. Experimental results show that our proposed framework can further improve the unfolding performance of other state-of-the-art methods. Our project is available at https: //vpx-ecnu. github. io/ClothPPO-website/.

AAAI Conference 2024 Conference Paper

Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

  • Jiyong Li
  • Dilshod Azizov
  • Yang Li
  • Shangsong Liang

Recently, because of the high-quality representations of contrastive learning methods, rehearsal-based contrastive continual learning has been proposed to explore how to continually learn transferable representation embeddings to avoid the catastrophic forgetting issue in traditional continual settings. Based on this framework, we propose Contrastive Continual Learning via Importance Sampling (CCLIS) to preserve knowledge by recovering previous data distributions with a new strategy for Replay Buffer Selection (RBS), which minimize estimated variance to save hard negative samples for representation learning with high quality. Furthermore, we present the Prototype-instance Relation Distillation (PRD) loss, a technique designed to maintain the relationship between prototypes and sample representations using a self-distillation process. Experiments on standard continual learning benchmarks reveal that our method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts. The code is available at https://github.com/lijy373/CCLIS.

JBHI Journal 2024 Journal Article

De-Biased Disentanglement Learning for Pulmonary Embolism Survival Prediction on Multimodal Data

  • Zhusi Zhong
  • Jie Li
  • Shreyas Kulkarni
  • Helen Zhang
  • Fayez H. Fayad
  • Yang Li
  • Scott Collins
  • Harrison Bai

Health disparities among marginalized populations with lower socioeconomic status significantly impact the fairness and effectiveness of healthcare delivery. The increasing integration of artificial intelligence (AI) into healthcare presents an opportunity to address these inequalities, provided that AI models are free from bias. This paper aims to address the bias challenges by population disparities within healthcare systems, existing in the presentation of and development of algorithms, leading to inequitable medical implementation for conditions such as pulmonary embolism (PE) prognosis. In this study, we explore the diverse bias in healthcare systems, which highlights the demand for a holistic framework to reducing bias by complementary aggregation. By leveraging de-biasing deep survival prediction models, we propose a framework that disentangles identifiable information from images, text reports, and clinical variables to mitigate potential biases within multimodal datasets. Our study offers several advantages over traditional clinical-based survival prediction methods, including richer survival-related characteristics and bias-complementary predicted results. By improving the robustness of survival analysis through this framework, we aim to benefit patients, clinicians, and researchers by enhancing fairness and accuracy in healthcare AI systems.

JBHI Journal 2024 Journal Article

Developing Deep LSTMs With Later Temporal Attention for Predicting COVID-19 Severity, Clinical Outcome, and Antibody Level by Screening Serological Indicators Over Time

  • Jiaxin Cai
  • Yang Li
  • Baichen Liu
  • Zhixi Wu
  • Shengjun Zhu
  • Qiliang Chen
  • Qing Lei
  • Hongyan Hou

Objective: The clinical course of COVID-19, as well as the immunological reaction, is notable for its extreme variability. Identifying the main associated factors might help understand the disease progression and physiological status of COVID-19 patients. The dynamic changes of the antibody against Spike protein are crucial for understanding the immune response. This work explores a temporal attention (TA) mechanism of deep learning to predict COVID-19 disease severity, clinical outcomes, and Spike antibody levels by screening serological indicators over time. Methods: We use feature selection techniques to filter feature subsets that are highly correlated with the target. The specific deep Long Short-Term Memory (LSTM) models are employed to capture the dynamic changes of disease severity, clinical outcome, and Spike antibody level. We also propose deep LSTMs with a TA mechanism to emphasize the later blood test records because later records often attract more attention from doctors. Results: Risk factors highly correlated with COVID-19 are revealed. LSTM achieves the highest classification accuracy for disease severity prediction. Temporal Attention Long Short-Term Memory (TA-LSTM) achieves the best performance for clinical outcome prediction. For Spike antibody level prediction, LSTM achieves the best permanence. Conclusion: The experimental results demonstrate the effectiveness of the proposed models. The proposed models can provide a computer-aided medical diagnostics system by simply using time series of serological indicators.

NeurIPS Conference 2024 Conference Paper

Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization

  • Yang Li
  • Jinpei Guo
  • Runzhong Wang
  • Hongyuan Zha
  • Junchi Yan

Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which, for a given instance, minimizes the difference among samples originating from varying generative trajectories and time steps relative to the optimal solution. The proposed model enables fast single-step solution generation while retaining the option of multi-step sampling to trade for sampling quality, which offers a more effective and efficient alternative backbone for neural solvers. In addition, within the training-to-testing (T2T) framework, to bridge the gap between training on historical instances and solving new instances, we introduce a novel consistency-based gradient search scheme during the test stage, enabling more effective exploration of the solution space learned during training. It is achieved by updating the latent solution probabilities under objective gradient guidance during the alternation of noise injection and denoising steps. We refer to this model as Fast T2T. Extensive experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency, even outperforming LKH given limited time budgets. Notably, Fast T2T with merely one-step generation and one-step gradient search can mostly outperform the SOTA diffusion-based counterparts that require hundreds of steps, while achieving tens of times speedup.

IS Journal 2024 Journal Article

Few-Shot Object Detection Based on Self-Knowledge Distillation

  • Yang Li
  • Yicheng Gong
  • Zhuo Zhang

In many fields, due to the lack of large-scale training data, the traditional object detection methods cannot complete the actual work well. The main reason is the overfitting problem and lack of the generalization ability. In this work, we propose a general method to alleviate the overfitting problem in the few-shot object detection. Our work extends Faster R-CNN with self-knowledge distillation algorithm and designs the loss function with attention mechanism, which can improve true detection in the foreground. In this way, object detector can learn an approximate mapping relationship from few samples, which makes the network possess a stronger generalization ability when tackling few images. Through numerous comparative experiments, we demonstrate that our method is general and feasible on VOC and COCO benchmarks datasets with different settings. We provide a new idea for solving the problem of few-shot object detection, and produce an excellent performance of recall rate on few-shot object detection.

AAAI Conference 2024 Conference Paper

H-ensemble: An Information Theoretic Approach to Reliable Few-Shot Multi-Source-Free Transfer

  • Yanru Wu
  • Jianning Wang
  • Weida Wang
  • Yang Li

Multi-source transfer learning is an effective solution to data scarcity by utilizing multiple source tasks for the learning of the target task. However, access to source data and model details is limited in the era of commercial models, giving rise to the setting of multi-source-free (MSF) transfer learning that aims to leverage source domain knowledge without such access. As a newly defined problem paradigm, MSF transfer learning remains largely underexplored and not clearly formulated. In this work, we adopt an information theoretic perspective on it and propose a framework named H-ensemble, which dynamically learns the optimal linear combination, or ensemble, of source models for the target task, using a generalization of maximal correlation regression. The ensemble weights are optimized by maximizing an information theoretic metric for transferability. Compared to previous works, H-ensemble is characterized by: 1) its adaptability to a novel and realistic MSF setting for few-shot target tasks, 2) theoretical reliability, 3) a lightweight structure easy to interpret and adapt. Our method is empirically validated by ablation studies, along with extensive comparative analysis with other task ensemble and transfer learning methods. We show that the H-ensemble can successfully learn the optimal task ensemble, as well as outperform prior arts.

NeurIPS Conference 2024 Conference Paper

LAM3D: Large Image-Point Clouds Alignment Model for 3D Reconstruction from Single Image

  • Ruikai Cui
  • Xibin Song
  • Weixuan Sun
  • Senbo Wang
  • Weizhe Liu
  • Shenzhou Chen
  • Taizhang Shang
  • Yang Li

Large Reconstruction Models have made significant strides in the realm of automated 3D content generation from single or multiple input images. Despite their success, these models often produce 3D meshes with geometric inaccuracies, stemming from the inherent challenges of deducing 3D shapes solely from image data. In this work, we introduce a novel framework, the Large Image and Point Cloud Alignment Model (LAM3D), which utilizes 3D point cloud data to enhance the fidelity of generated 3D meshes. Our methodology begins with the development of a point-cloud-based network that effectively generates precise and meaningful latent tri-planes, laying the groundwork for accurate 3D mesh reconstruction. Building upon this, our Image-Point-Cloud Feature Alignment technique processes a single input image, aligning to the latent tri-planes to imbue image features with robust 3D information. This process not only enriches the image features but also facilitates the production of high-fidelity 3D meshes without the need for multi-view input, significantly reducing geometric distortions. Our approach achieves state-of-the-art high-fidelity 3D mesh reconstruction from a single image in just 6 seconds, and experiments on various datasets demonstrate its effectiveness.

ICML Conference 2024 Conference Paper

Language Models as Semantic Indexers

  • Bowen Jin
  • Hansi Zeng
  • Guoyin Wang 0001
  • Xiusi Chen
  • Tianxin Wei
  • Ruirui Li 0002
  • Zhengyang Wang
  • Zheng Li 0018

Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss, and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. It is non-trivial to design a method that can learn the document’s semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval on five datasets from various domains. Code is available at https: //github. com/PeterGriffinJin/LMIndexer.

AAAI Conference 2024 Conference Paper

Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis

  • Dexu Kong
  • Anping Zhang
  • Yang Li

Dynamic community detection methods often lack effective mechanisms to ensure temporal consistency, hindering the analysis of network evolution. In this paper, we propose a novel deep graph clustering framework with temporal consistency regularization on inter-community structures, inspired by the concept of minimal network topological changes within short intervals. Specifically, to address the representation collapse problem, we first introduce MFC, a matrix factorization-based deep graph clustering algorithm that preserves node embedding. Based on static clustering results, we construct probabilistic community networks and compute their persistence homology, a robust topological measure, to assess structural similarity between them. Moreover, a novel neural network regularization TopoReg is introduced to ensure the preservation of topological similarity between inter-community structures over time intervals. Our approach enhances temporal consistency and clustering accuracy on real-world datasets with both fixed and varying numbers of communities. It is also a pioneer application of TDA in temporally persistent community detection, offering an insightful contribution to field of network analysis. Code and data are available at the public git repository: https://github.com/kundtx/MFC-TopoReg.

NeurIPS Conference 2024 Conference Paper

Learning Plaintext-Ciphertext Cryptographic Problems via ANF-based SAT Instance Representation

  • Xinhao Zheng
  • Yang Li
  • Cunxin Fan
  • Huaijin Wu
  • Xinhao Song
  • Junchi Yan

Cryptographic problems, operating within binary variable spaces, can be routinely transformed into Boolean Satisfiability (SAT) problems regarding specific cryptographic conditions like plaintext-ciphertext matching. With the fast development of learning for discrete data, this SAT representation also facilitates the utilization of machine-learning approaches with the hope of automatically capturing patterns and strategies inherent in cryptographic structures in a data-driven manner. Existing neural SAT solvers consistently adopt conjunctive normal form (CNF) for instance representation, which in the cryptographic context can lead to scale explosion and a loss of high-level semantics. In particular, extensively used XOR operations in cryptographic problems can incur an exponential number of clauses. In this paper, we propose a graph structure based on Arithmetic Normal Form (ANF) to efficiently handle the XOR operation bottleneck. Additionally, we design an encoding method for AND operations in these ANF-based graphs, demonstrating improved efficiency over alternative general graph forms for SAT. We then propose CryptoANFNet, a graph learning approach that trains a classifier based on a message-passing scheme to predict plaintext-ciphertext satisfiability. Using ANF-based SAT instances, CryptoANFNet demonstrates superior scalability and can naturally capture higher-order operational information. Empirically, CryptoANFNet achieves a 50x speedup over heuristic solvers and outperforms SOTA learning-based SAT solver NeuroSAT, with 96\% vs. 91\% accuracy on small-scale and 72\% vs. 55\% on large-scale datasets from real encryption algorithms. We also introduce a key-solving algorithm that simplifies ANF-based SAT instances from plaintext and ciphertext, enhancing key decryption accuracy from 76. 5\% to 82\% and from 72\% to 75\% for datasets generated from two real encryption algorithms.

NeurIPS Conference 2024 Conference Paper

Mixtures of Experts for Audio-Visual Learning

  • Ying Cheng
  • Yang Li
  • Junjie He
  • Rui Feng

With the rapid development of multimedia technology, audio-visual learning has emerged as a promising research topic within the field of multimodal analysis. In this paper, we explore parameter-efficient transfer learning for audio-visual learning and propose the Audio-Visual Mixture of Experts (\ourmethodname) to inject adapters into pre-trained models flexibly. Specifically, we introduce unimodal and cross-modal adapters as multiple experts to specialize in intra-modal and inter-modal information, respectively, and employ a lightweight router to dynamically allocate the weights of each expert according to the specific demands of each task. Extensive experiments demonstrate that our proposed approach \ourmethodname achieves superior performance across multiple audio-visual tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing. The source code is available at \url{https: //github. com/yingchengy/AVMOE}.

AAAI Conference 2024 Conference Paper

Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation

  • Lianggangxu Chen
  • Youqi Song
  • Yiqing Cai
  • Jiale Lu
  • Yang Li
  • Yuan Xie
  • Changbo Wang
  • Gaoqi He

In the domain of scene graph generation, modeling commonsense as a single-prototype representation has been typically employed to facilitate the recognition of infrequent predicates. However, a fundamental challenge lies in the large intra-class variations of the visual appearance of predicates, resulting in subclasses within a predicate class. Such a challenge typically leads to the problem of misclassifying diverse predicates due to the rough predicate space clustering. In this paper, inspired by cognitive science, we maintain multi-prototype representations for each predicate class, which can accurately find the multiple class centers of the predicate space. Technically, we propose a novel multi-prototype learning framework consisting of three main steps: prototype-predicate matching, prototype updating, and prototype space optimization. We first design a triple-level optimal transport to match each predicate feature within the same class to a specific prototype. In addition, the prototypes are updated using momentum updating to find the class centers according to the matching results. Finally, we enhance the inter-class separability of the prototype space through iterations of the inter-class separability loss and intra-class compactness loss. Extensive evaluations demonstrate that our approach significantly outperforms state-of-the-art methods on the Visual Genome dataset.

JMLR Journal 2024 Journal Article

OpenBox: A Python Toolkit for Generalized Black-box Optimization

  • Huaijun Jiang
  • Yu Shen
  • Yang Li
  • Beicheng Xu
  • Sixian Du
  • Wentao Zhang
  • Ce Zhang
  • Bin Cui

Black-box optimization (BBO) has a broad range of applications, including automatic machine learning, experimental design, and database knob tuning. However, users still face challenges when applying BBO methods to their problems at hand with existing software packages in terms of applicability, performance, and efficiency. This paper presents OpenBox, an open-source BBO toolkit with improved usability. It implements user-friendly interfaces and visualization for users to define and manage their tasks. The modular design behind OpenBox facilitates its flexible deployment in existing systems. Experimental results demonstrate the effectiveness and efficiency of OpenBox over existing systems. The source code of OpenBox is available at https://github.com/PKU-DAIR/open-box. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

IJCAI Conference 2024 Conference Paper

OUCopula: Bi-Channel Multi-Label Copula-Enhanced Adapter-Based CNN for Myopia Screening Based on OU-UWF Images

  • Yang Li
  • Qiuyi Huang
  • Chong Zhong
  • Danjuan Yang
  • Meiyan Li
  • A. H. Welsh
  • Aiyi Liu
  • Bo Fu

Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging is potentially significant for ophthalmic outcomes. Current multidisciplinary research between ophthalmology and deep learning (DL) concentrates primarily on disease classification and diagnosis using single-eye images, largely ignoring joint modeling and prediction for Oculus Uterque (OU, both eyes). Inspired by the complex relationships between OU and the high correlation between the (continuous) outcome labels (Spherical Equivalent and Axial Length), we propose a framework of copula-enhanced adapter convolutional neural network (CNN) learning with OU UWF fundus images (OUCopula) for joint prediction of multiple clinical scores. We design a novel bi-channel multi-label CNN which can (1) take bi-channel image inputs subject to both high correlation and heterogeneity (by sharing the same backbone network and employing adapters to parameterize the channel-wise discrepancy), and (2) incorporate correlation information between continuous output labels (using a copula). Solid experiments show that OUCopula achieves satisfactory performance in myopia score prediction compared to backbone models. Moreover, OUCopula can far exceed the performance of models constructed for single-eye inputs. Importantly, our study also hints at the potential extension of the bi-channel model to a multi-channel paradigm and the generalizability of OUCopula across various backbone CNNs. The code and the supplementary materials are available at: github. com/Charley-HUANG/OUCopula.

ICLR Conference 2024 Conference Paper

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

  • Zhiwei Deng
  • Ting Chen
  • Yang Li

Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 79.7% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

AAAI Conference 2024 Conference Paper

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

  • Siyuan Shan
  • Yang Li
  • Amartya Banerjee
  • Junier B. Oliva

Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method Phoneme Hallucinator that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Quantitative and qualitative evaluations show that Phoneme Hallucinator outperforms existing VC methods for both intelligibility and speaker similarity.

JAIR Journal 2024 Journal Article

Tackling Cooperative Incompatibility for Zero-Shot Human-AI Coordination

  • Yang Li
  • Shao Zhang
  • Jichen Sun
  • Wenhao Zhang
  • Yali Du
  • Ying Wen
  • Xinbing Wang
  • Wei Pan

Securing coordination between AI agent and teammates (human players or AI agents) in contexts involving unfamiliar humans continues to pose a significant challenge in Zero-Shot Coordination. The issue of cooperative incompatibility becomes particularly prominent when an AI agent is unsuccessful in synchronizing with certain previously unknown partners. Traditional algorithms have aimed to collaborate with partners by optimizing fixed objectives within a population, fostering diversity in strategies and behaviors. However, these techniques may lead to learning loss and an inability to cooperate with specific strategies within the population, a phenomenon named cooperative incompatibility in learning. In order to solve cooperative incompatibility in learning and effectively address the problem in the context of ZSC, we introduce the Cooperative Open-ended LEarning (COLE) framework, which formulates open-ended objectives in cooperative games with two players using perspectives of graph theory to evaluate and pinpoint the cooperative capacity of each strategy. We present two practical algorithms, specifically COLESV and COLER, which incorporate insights from game theory and graph theory. We also show that COLE could effectively overcome the cooperative incompatibility from theoretical and empirical analysis. Subsequently, we created an online Overcooked human-AI experiment platform, the COLE platform, which enables easy customization of questionnaires, model weights, and other aspects. Utilizing the COLE platform, we enlist 130 participants for human experiments. Our findings reveal a preference for our approach over state-of-the-art methods using a variety of subjective metrics. Moreover, objective experimental outcomes in the Overcooked game environment indicate that our method surpasses existing ones when coordinating with previously unencountered AI agents and the human proxy model. Our code and demo are publicly available at https://sites.google.com/view/cole-2023.

NeurIPS Conference 2024 Conference Paper

UniAR: A Unified model for predicting human Attention and Responses on visual content

  • Peizhao Li
  • Junfeng He
  • Gang Li
  • Rachit Bhargava
  • Shaolei Shen
  • Nachiappan Valliappan
  • Youwei Liang
  • Hongxiang Gu

Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR -- a unified model of human attention and preference behavior across diverse visual content. UniAR leverages a multimodal transformer to predict subjective feedback, such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks across various image domains and behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.

NeurIPS Conference 2024 Conference Paper

VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks

  • Yang Li
  • Shaobo Han
  • Shihao Ji

As the adoption of large language models increases and the need for per-user or per-task model customization grows, the parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, incur substantial storage and transmission costs. To further reduce stored parameters, we introduce a "divide-and-share" paradigm that breaks the barriers of low-rank decomposition across matrix dimensions, modules, and layers by sharing parameters globally via a vector bank. As an instantiation of the paradigm to LoRA, our proposed VB-LoRA composites all the low-rank matrices of LoRA from a shared vector bank with a differentiable top-$k$ admixture module. VB-LoRA achieves extreme parameter efficiency while maintaining comparable or better performance compared to state-of-the-art PEFT methods. Extensive experiments demonstrate the effectiveness of VB-LoRA on natural language understanding, natural language generation, instruction tuning, and mathematical reasoning tasks. When fine-tuning the Llama2-13B model, VB-LoRA only uses 0. 4% of LoRA's stored parameters, yet achieves superior results. Our source code is available at https: //github. com/leo-yangli/VB-LoRA. This method has been merged into the Hugging Face PEFT package.

NeurIPS Conference 2024 Conference Paper

Wings: Learning Multimodal LLMs without Text-only Forgetting

  • Yi-Kai Zhang
  • Shiyin Lu
  • Yang Li
  • Yanqing Ma
  • Qing-Guo Chen
  • Zhao Xu
  • Weihua Luo
  • Kaifu Zhang

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, during the continued training, the MLLM catastrophically forgets the text-only instructions that the initial LLM masters. In this paper, we present Wings, a novel MLLM that excels in both text-only and multimodal instructions. By examining attention across layers of MLLM, we find that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct an additional Low-Rank Residual Attention (LoRRA) block that acts as the "modality learner" to expand the learnable space and compensate for the attention shift. The complementary learners, like "wings" on either side, are connected in parallel to each layer's attention block. The LoRRA mirrors the structure of attention but utilizes low-rank connections to ensure efficiency. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Later, textual learners are integrated with token-wise routing, blending the outputs of both modality learners collaboratively. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. Wings with compensation of learners addresses text-only forgetting during visual modality expansion in general MLLMs.

JBHI Journal 2023 Journal Article

AC-E Network: Attentive Context-Enhanced Network for Liver Segmentation

  • Yang Li
  • Beiji Zou
  • Peishan Dai
  • Miao Liao
  • Harrison X. Bai
  • Zhicheng Jiao

Segmentation of liver from CT scans is essential in computer-aided liver disease diagnosis and treatment. However, the 2DCNN ignores the 3D context, and the 3DCNN suffers from numerous learnable parameters and high computational cost. In order to overcome this limitation, we propose an Attentive Context-Enhanced Network (AC-E Network) consisting of 1) an attentive context encoding module (ACEM) that can be integrated into the 2D backbone to extract 3D context without a sharp increase in the number of learnable parameters; 2) a dual segmentation branch including complemental loss making the network attend to both the liver region and boundary so that getting the segmented liver surface with high accuracy. Extensive experiments on the LiTS and the 3D-IRCADb datasets demonstrate that our method outperforms existing approaches and is competitive to the state-of-the-art 2D-3D hybrid method on the equilibrium of the segmentation precision and the number of model parameters.

NeurIPS Conference 2023 Conference Paper

Bullying10K: A Large-Scale Neuromorphic Dataset towards Privacy-Preserving Bullying Recognition

  • Yiting Dong
  • Yang Li
  • Dongcheng Zhao
  • Guobin Shen
  • Yi Zeng

The prevalence of violence in daily life poses significant threats to individuals' physical and mental well-being. Using surveillance cameras in public spaces has proven effective in proactively deterring and preventing such incidents. However, concerns regarding privacy invasion have emerged due to their widespread deployment. To address the problem, we leverage Dynamic Vision Sensors (DVS) cameras to detect violent incidents and preserve privacy since it captures pixel brightness variations instead of static imagery. We introduce the Bullying10K dataset, encompassing various actions, complex movements, and occlusions from real-life scenarios. It provides three benchmarks for evaluating different tasks: action recognition, temporal action localization, and pose estimation. With 10, 000 event segments, totaling 12 billion events and 255 GB of data, Bullying10K contributes significantly by balancing violence detection and personal privacy persevering. And it also poses a challenge to the neuromorphic dataset. It will serve as a valuable resource for training and developing privacy-protecting video systems. The Bullying10K opens new possibilities for innovative approaches in these domains.

JBHI Journal 2023 Journal Article

HDL: Hybrid Deep Learning for the Synthesis of Myocardial Velocity Maps in Digital Twins for Cardiac Analysis

  • Xiaodan Xing
  • Javier Del Ser
  • Yinzhe Wu
  • Yang Li
  • Jun Xia
  • Lei Xu
  • David Firmin
  • Peter Gatehouse

Synthetic digital twins based on medical data accelerate the acquisition, labelling and decision making procedure in digital healthcare. A core part of digital healthcare twins is model-based data synthesis, which permits the generation of realistic medical signals without requiring to cope with the modelling complexity of anatomical and biochemical phenomena producing them in reality. Unfortunately, algorithms for cardiac data synthesis have been so far scarcely studied in the literature. An important imaging modality in the cardiac examination is three-directional CINE multi-slice myocardial velocity mapping (3Dir MVM), which provides a quantitative assessment of cardiac motion in three orthogonal directions of the left ventricle. The long acquisition time and complex acquisition produce make it more urgent to produce synthetic digital twins of this imaging modality. In this study, we propose a hybrid deep learning (HDL) network, especially for synthetic 3Dir MVM data. Our algorithm is featured by a hybrid UNet and a Generative Adversarial Network with a foreground-background generation scheme. The experimental results show that from temporally down-sampled magnitude CINE images (six times), our proposed algorithm can still successfully synthesise high temporal resolution 3Dir MVM CMR data (PSNR=42. 32) with precise left ventricle segmentation (DICE=0. 92). These performance scores indicate that our proposed HDL algorithm can be implemented in real-world digital twins for myocardial velocity mapping data simulation. To the best of our knowledge, this work is the first one investigating digital twins of the 3Dir MVM CMR, which has shown great potential for improving the efficiency of clinical studies via synthesised cardiac data.

IJCAI Conference 2023 Conference Paper

IID-GAN: an IID Sampling Perspective for Regularizing Mode Collapse

  • Yang Li
  • Liangliang Shi
  • Junchi Yan

Despite its success, generative adversarial networks (GANs) still suffer from mode collapse, i. e. , the generator can only map latent variables to a partial set of modes in the target distribution. In this paper, we analyze and seek to regularize this issue with an independent and identically distributed (IID) sampling perspective and emphasize that holding the IID property referring to the target distribution for generation can naturally avoid mode collapse. This is based on the basic IID assumption for real data in machine learning. However, though the source samples {z} obey IID, the generations {G(z)} may not necessarily be IID sampling from the target distribution. Based on this observation, considering a necessary condition of IID generation, we propose a new loss to encourage the closeness between the inverse samples of real data and the Gaussian source in the latent space to regularize the generation to be IID from the target distribution. The logic is that the inverse samples from target data should also be IID in the source distribution. Experiments on both synthetic and real-world data show the effectiveness of our model.

TMLR Journal 2023 Journal Article

JiangJun: Mastering Xiangqi by Tackling Non-Transitivity in Two-Player Zero-Sum Games

  • Yang Li
  • Kun Xiong
  • Yingping Zhang
  • Jiangcheng Zhu
  • Stephen Marcus McAleer
  • Wei Pan
  • Jun Wang
  • Zonghong Dai

This paper presents an empirical exploration of non-transitivity in perfect-information games, specifically focusing on Xiangqi, a traditional Chinese board game comparable in game-tree complexity to chess and shogi. By analyzing over 10,000 records of human Xiangqi play, we highlight the existence of both transitive and non-transitive elements within the game’s strategic structure. To address non-transitivity, we introduce the JiangJun algorithm, an innovative combination of Monte-Carlo Tree Search (MCTS) and Policy Space Response Oracles (PSRO) designed to approximate a Nash equilibrium. We evaluate the algorithm empirically using a WeChat mini program and achieve a Master level with a 99.41% win rate against human players. The algorithm’s effectiveness in overcoming non-transitivity is confirmed by a plethora of metrics, such as relative population performance and visualization results. Our project site is available at https://sites.google.com/view/jiangjun-site/.

AAAI Conference 2023 Conference Paper

Learn from Yesterday: A Semi-supervised Continual Learning Method for Supervision-Limited Text-to-SQL Task Streams

  • Yongrui Chen
  • Xinnan Guo
  • Tongtong Wu
  • Guilin Qi
  • Yang Li
  • Yang Dong

Conventional text-to-SQL studies are limited to a single task with a fixed-size training and test set. When confronted with a stream of tasks common in real-world applications, existing methods struggle with the problems of insufficient supervised data and high retraining costs. The former tends to cause overfitting on unseen databases for the new task, while the latter makes a full review of instances from past tasks impractical for the model, resulting in forgetting of learned SQL structures and database schemas. To address the problems, this paper proposes integrating semi-supervised learning (SSL) and continual learning (CL) in a stream of text-to-SQL tasks and offers two promising solutions in turn. The first solution Vanilla is to perform self-training, augmenting the supervised training data with predicted pseudo-labeled instances of the current task, while replacing the full volume retraining with episodic memory replay to balance the training efficiency with the performance of previous tasks. The improved solution SFNet takes advantage of the intrinsic connection between CL and SSL. It uses in-memory past information to help current SSL, while adding high-quality pseudo instances in memory to improve future replay. The experiments on two datasets shows that SFNet outperforms the widely-used SSL-only and CL-only baselines on multiple metrics.

UAI Conference 2023 Conference Paper

Modified Retrace for Off-Policy Temporal Difference Learning

  • Xingguo Chen
  • Xingzhou Ma
  • Yang Li
  • Guang Yang 0066
  • Shangdong Yang
  • Yang Gao 0001

Off-policy learning is a key to extend reinforcement learning as it allows to learn a target policy from a different behavior policy that generates the data. However, it is well known as “the deadly triad” when combined with bootstrapping and function approximation. Retrace is an efficient and convergent off-policy algorithm with tabular value functions which employs truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace to correct the off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.

NeurIPS Conference 2023 Conference Paper

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

  • Huijie Wang
  • Tianyu Li
  • Yang Li
  • Li Chen
  • Chonghao Sima
  • Zhenbo Liu
  • Bangjun Wang
  • Peijin Jia

Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. The objective of the presented dataset is to advance research in understanding the structure of road scenes by examining the relationship between perceived entities, such as traffic elements and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2, 000 annotated road scenes that describe traffic elements and their correlation to the lanes. It comprises three primary sub-tasks, including the 3D lane detection inherited from OpenLane, accompanied by corresponding metrics to evaluate the model’s performance. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.

AAAI Conference 2023 Conference Paper

ProxyBO: Accelerating Neural Architecture Search via Bayesian Optimization with Zero-Cost Proxies

  • Yu Shen
  • Yang Li
  • Jian Zheng
  • Wentao Zhang
  • Peng Yao
  • Jixiang Li
  • Sen Yang
  • Ji Liu

Designing neural architectures requires immense manual efforts. This has promoted the development of neural architecture search (NAS) to automate the design. While previous NAS methods achieve promising results but run slowly, zero-cost proxies run extremely fast but are less promising. Therefore, it’s of great potential to accelerate NAS via those zero-cost proxies. The existing method has two limitations, which are unforeseeable reliability and one-shot usage. To address the limitations, we present ProxyBO, an efficient Bayesian optimization (BO) framework that utilizes the zero-cost proxies to accelerate neural architecture search. We apply the generalization ability measurement to estimate the fitness of proxies on the task during each iteration and design a novel acquisition function to combine BO with zero-cost proxies based on their dynamic influence. Extensive empirical studies show that ProxyBO consistently outperforms competitive baselines on five tasks from three public benchmarks. Concretely, ProxyBO achieves up to 5.41× and 3.86× speedups over the state-of-the-art approaches REA and BRP-NAS.

NeurIPS Conference 2023 Conference Paper

T2T: From Distribution Learning in Training to Gradient Search in Testing for Combinatorial Optimization

  • Yang Li
  • Jinpei Guo
  • Runzhong Wang
  • Junchi Yan

Extensive experiments have gradually revealed the potential performance bottleneck of modeling Combinatorial Optimization (CO) solving as neural solution prediction tasks. The neural networks, in their pursuit of minimizing the average objective score across the distribution of historical problem instances, diverge from the core target of CO of seeking optimal solutions for every test instance. This calls for an effective search on each problem instance, while the model should serve to provide supporting knowledge that benefits the search. To this end, we propose T2T (Training to Testing) framework that first leverages the generative modeling to estimate the high-quality solution distribution for each instance during training, and then conducts a gradient-based search within the solution space during testing. The proposed neural search paradigm consistently leverages generative modeling, specifically diffusion, for graduated solution improvement. It disrupts the local structure of the given solution by introducing noise and reconstructs a lower-cost solution guided by the optimization objective. Experimental results on Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS) show the significant superiority of T2T, demonstrating an average performance gain of 49. 15% for TSP solving and 17. 27% for MIS solving compared to the previous state-of-the-art.

IJCAI Conference 2022 Conference Paper

A Universal PINNs Method for Solving Partial Differential Equations with a Point Source

  • Xiang Huang
  • Hongsheng Liu
  • Beiji Shi
  • Zidong Wang
  • Kang Yang
  • Yang Li
  • Min Wang
  • Haotian Chu

In recent years, deep learning technology has been used to solve partial differential equations (PDEs), among which the physics-informed neural networks (PINNs)method emerges to be a promising method for solving both forward and inverse PDE problems. PDEs with a point source that is expressed as a Dirac delta function in the governing equations are mathematical models of many physical processes. However, they cannot be solved directly by conventional PINNs method due to the singularity brought by the Dirac delta function. In this paper, we propose a universal solution to tackle this problem by proposing three novel techniques. Firstly the Dirac delta function is modeled as a continuous probability density function to eliminate the singularity at the point source; secondly a lower bound constrained uncertainty weighting algorithm is proposed to balance the physics-informed loss terms of point source area and the remaining areas; and thirdly a multi-scale deep neural network with periodic activation function is used to improve the accuracy and convergence speed. We evaluate the proposed method with three representative PDEs, and the experimental results show that our method outperforms existing deep learning based methods with respect to the accuracy, the efficiency and the versatility.

NeurIPS Conference 2022 Conference Paper

DivBO: Diversity-aware CASH for Ensemble Learning

  • Yu Shen
  • Yupeng Lu
  • Yang Li
  • Yaofeng Tu
  • Wentao Zhang
  • Bin Cui

The Combined Algorithm Selection and Hyperparameters optimization (CASH) problem is one of the fundamental problems in Automated Machine Learning (AutoML). Motivated by the success of ensemble learning, recent AutoML systems build post-hoc ensembles to output the final predictions instead of using the best single learner. However, while most CASH methods focus on searching for a single learner with the best performance, they neglect the diversity among base learners (i. e. , they may suggest similar configurations to previously evaluated ones), which is also a crucial consideration when building an ensemble. To tackle this issue and further enhance the ensemble performance, we propose DivBO, a diversity-aware framework to inject explicit search of diversity into the CASH problems. In the framework, we propose to use a diversity surrogate to predict the pair-wise diversity of two unseen configurations. Furthermore, we introduce a temporary pool and a weighted acquisition function to guide the search of both performance and diversity based on Bayesian optimization. Empirical results on 15 public datasets show that DivBO achieves the best average ranks (1. 82 and 1. 73) on both validation and test errors among 10 compared methods, including post-hoc designs in recent AutoML systems and state-of-the-art baselines for ensemble learning on CASH problems.

IJCAI Conference 2022 Conference Paper

Efficient and Accurate Conversion of Spiking Neural Network with Burst Spikes

  • Yang Li
  • Yi Zeng

Spiking neural network (SNN), as a brain-inspired energy-efficient neural network, has attracted the interest of researchers. While the training of spiking neural networks is still an open problem. One effective way is to map the weight of trained ANN to SNN to achieve high reasoning ability. However, the converted spiking neural network often suffers from performance degradation and a considerable time delay. To speed up the inference process and obtain higher accuracy, we theoretically analyze the errors in the conversion process from three perspectives: the differences between IF and ReLU, time dimension, and pooling operation. We propose a neuron model for releasing burst spikes, a cheap but highly efficient method to solve residual information. In addition, Lateral Inhibition Pooling (LIPooling) is proposed to solve the inaccuracy problem caused by MaxPooling in the conversion process. Experimental results on CIFAR and ImageNet demonstrate that our algorithm is efficient and accurate. For example, our method can ensure nearly lossless conversion of SNN and only use about 1/10 (less than 100) simulation time under 0. 693x energy consumption of the typical method. Our code is available at https: //github. com/Brain-Inspired-Cognitive-Engine/Conversion_Burst.

AAAI Conference 2022 Conference Paper

Homography Decomposition Networks for Planar Object Tracking

  • Xinrui Zhan
  • Yueran Liu
  • Jianke Zhu
  • Yang Li

Planar object tracking plays an important role in AI applications, such as robotics, visual servoing, and visual SLAM. Although the previous planar trackers work well in most scenarios, it is still a challenging task due to the rapid motion and large transformation between two consecutive frames. The essential reason behind this problem is that the condition number of such a non-linear system changes unstably when the searching range of the homography parameter space becomes larger. To this end, we propose a novel Homography Decomposition Networks (HDN) approach that drastically reduces and stabilizes the condition number by decomposing the homography transformation into two groups. Specifically, a similarity transformation estimator is designed to predict the first group robustly by a deep convolution equivariant network. By taking advantage of the scale and rotation estimation with high confidence, a residual transformation is estimated by a simple regression model. Furthermore, the proposed end-to-end network is trained in a semi-supervised fashion. Extensive experiments show that our proposed approach outperforms the state-of-the-art planar tracking methods at a large margin on the challenging POT, UCSB and POIC datasets. Codes and models are available at https: //github. com/zhanxinrui/HDN.

NeurIPS Conference 2022 Conference Paper

Improving Generative Adversarial Networks via Adversarial Learning in Latent Space

  • Yang Li
  • Yichuan Mo
  • Liangliang Shi
  • Junchi Yan

For Generative Adversarial Networks which map a latent distribution to the target distribution, in this paper, we study how the sampling in latent space can affect the generation performance, especially for images. We observe that, as the neural generator is a continuous function, two close samples in latent space would be mapped into two nearby images, while their quality can differ much as the quality generally does not exhibit a continuous nature in pixel space. From such a continuous mapping function perspective, it is also possible that two distant latent samples can be mapped into two close images (if not exactly the same). In particular, if the latent samples are mapped in aggregation into a single mode, mode collapse occurs. Accordingly, we propose adding an implicit latent transform before the mapping function to improve latent $z$ from its initial distribution, e. g. , Gaussian. This is achieved using well-developed adversarial sample mining techniques, e. g. iterative fast gradient sign method (I-FGSM). We further propose new GAN training pipelines to obtain better generative mappings w. r. t quality and diversity by introducing targeted latent transforms into the bi-level optimization of GAN. Experimental results on visual data show that our method can effectively achieve improvement in both quality and diversity.

NeurIPS Conference 2022 Conference Paper

Meta-Auto-Decoder for Solving Parametric Partial Differential Equations

  • Xiang Huang
  • Zhanhong Ye
  • Hongsheng Liu
  • Shi Ji
  • Zidong Wang
  • Kang Yang
  • Yang Li
  • Min Wang

Many important problems in science and engineering require solving the so-called parametric partial differential equations (PDEs), i. e. , PDEs with different physical parameters, boundary conditions, shapes of computation domains, etc. Recently, building learning-based numerical solvers for parametric PDEs has become an emerging new field. One category of methods such as the Deep Galerkin Method (DGM) and Physics-Informed Neural Networks (PINNs) aim to approximate the solution of the PDEs. They are typically unsupervised and mesh-free, but require going through the time-consuming network training process from scratch for each set of parameters of the PDE. Another category of methods such as Fourier Neural Operator (FNO) and Deep Operator Network (DeepONet) try to approximate the solution mapping directly. Being fast with only one forward inference for each PDE parameter without retraining, they often require a large corpus of paired input-output observations drawn from numerical simulations, and most of them need a predefined mesh as well. In this paper, we propose Meta-Auto-Decoder (MAD), a mesh-free and unsupervised deep learning method that enables the pre-trained model to be quickly adapted to equation instances by implicitly encoding (possibly heterogenous) PDE parameters as latent vectors. The proposed method MAD can be interpreted by manifold learning in infinite-dimensional spaces, granting it a geometric insight. Extensive numerical experiments show that the MAD method exhibits faster convergence speed without losing accuracy than other deep learning-based methods.

NeurIPS Conference 2022 Conference Paper

Non-rigid Point Cloud Registration with Neural Deformation Pyramid

  • Yang Li
  • Tatsuya Harada

Non-rigid point cloud registration is a key component in many computer vision and computer graphics applications. The high complexity of the unknown non-rigid motion make this task a challenging problem. In this paper, we break down this problem via hierarchical motion decomposition. Our method called Neural Deformation Pyramid (NDP) represents non-rigid motion using a pyramid architecture. Each pyramid level, denoted by a Multi-Layer Perception (MLP), takes as input a sinusoidally encoded 3D point and outputs its motion increments from the previous level. The sinusoidal function starts with a low input frequency and gradually increases when the pyramid level goes down. This allows a multi-level rigid to nonrigid motion decomposition and also speeds up the solving by ×50 times compared to the existing MLP-based approach. Our method achieves advanced partial-to-partial non-rigid point cloud registration results on the 4DMatch/4DLoMatchbenchmark under both no-learned and supervised settings.

NeurIPS Conference 2022 Conference Paper

The Policy-gradient Placement and Generative Routing Neural Networks for Chip Design

  • Ruoyu Cheng
  • Xianglong Lyu
  • Yang Li
  • Junjie Ye
  • Jianye Hao
  • Junchi Yan

Placement and routing are two critical yet time-consuming steps of chip design in modern VLSI systems. Distinct from traditional heuristic solvers, this paper on one hand proposes an RL-based model for mixed-size macro placement, which differs from existing learning-based placers that often consider the macro by coarse grid-based mask. While the standard cells are placed via gradient-based GPU acceleration. On the other hand, a one-shot conditional generative routing model, which is composed of a special-designed input-size-adapting generator and a bi-discriminator, is devised to perform one-shot routing to the pins within each net, and the order of nets to route is adaptively learned. Combining these techniques, we develop a flexible and efficient neural pipeline, which to our best knowledge, is the first joint placement and routing network without involving any traditional heuristic solver. Experimental results on chip design benchmarks showcase the effectiveness of our approach, with code that will be made publicly available.

JBHI Journal 2021 Journal Article

Deep Learning-Based End-to-End Diagnosis System for Avascular Necrosis of Femoral Head

  • Yang Li
  • Yan Li
  • Hua Tian

As the first diagnostic imaging modality of avascular necrosis of the femoral head (AVNFH), accurately staging AVNFH from a plain radiograph is critical yet challenging for orthopedists. Thus, we propose a deep learning-based AVNFH diagnosis system (AVN-net). The proposed AVN-net reads plain radiographs of the pelvis, conducts diagnosis, and visualizes results automatically. Deep convolutional neural networks are trained to provide an end-to-end diagnosis solution, covering tasks of femoral head detection, exam-view identification, side classification, AVNFH diagnosis, and key clinical notes generation. AVN-net is able to obtain state-of-the-art testing AUC of 0. 97 ($95\%$ CI: $0. 97-0. 98$) in AVNFH detection and significantly greater F1 scores than less-to-moderately experienced orthopedists in all diagnostic tests (p <; 0. 01). Furthermore, two real-world pilot studies were conducted for diagnosis support and education assistance, respectively, to assess the utility of AVN-net. The experimental results are promising. With the AVN-net diagnosis as a reference, the diagnostic accuracy and consistency of all orthopedists considerably improved while requiring only 1/4 of the time. Students self-studying the AVNFH diagnosis using AVN-net can learn better and faster than the control group. To the best of our knowledge, this study is the first research on the prospective use of a deep learning-based diagnosis system for AVNFH by conducting two pilot studies representing real-world application scenarios. We have demonstrated that the proposed AVN-net achieves expert-level AVNFH diagnosis performance, provides efficient support in clinical decision-making, and effectively passes clinical experience to students.

IJCAI Conference 2021 Conference Paper

Discovering Collaborative Signals for Next POI Recommendation with Iterative Seq2Graph Augmentation

  • Yang Li
  • Tong Chen
  • Yadan Luo
  • Hongzhi Yin
  • Zi Huang

Being an indispensable component in location-based social networks, next point-of-interest (POI) recommendation recommends users unexplored POIs based on their recent visiting histories. However, existing work mainly models check-in data as isolated POI sequences, neglecting the crucial collaborative signals from cross-sequence check-in information. Furthermore, the sparse POI-POI transitions restrict the ability of a model to learn effective sequential patterns for recommendation. In this paper, we propose Sequence-to-Graph (Seq2Graph) augmentation for each POI sequence, allowing collaborative signals to be propagated from correlated POIs belonging to other sequences. We then devise a novel Sequence-to-Graph POI Recommender (SGRec), which jointly learns POI embeddings and infers a user's temporal preferences from the graph-augmented POI sequence. To overcome the sparsity of POI-level interactions, we further infuse category-awareness into SGRec with a multi-task learning scheme that captures the denser category-wise transitions. As such, SGRec makes full use of the collaborative signals for learning expressive POI representations, and also comprehensively uncovers multi-level sequential patterns for user preference modelling. Extensive experiments on two real-world datasets demonstrate the superiority of SGRec against state-of-the-art methods in next POI recommendation.

JAIR Journal 2021 Journal Article

Hybrid-order Network Consensus for Distributed Multi-agent Systems

  • Guangqiang Xie
  • Junyu Chen
  • Yang Li

As an important field of Distributed artificial intelligence (DAI), multi-agent systems (MASs) have attracted the attention of extensive research scholars. Consensus as the most important issue in MAS, much progress has been made in studying the consensus control of MAS, but there are some problems remained largely unaddressed which cause the MAS to lose some useful network structure information. First, multi-agent consensus protocol usually proceeds over the low-order structure by only considering the direct edges between agents, but ignores the higher-order structure of the whole topology network. Second, the existing work assumes all the edges in a topology network have the same weight without exploring the potential diversity of the connections. In this way, multi-agent systems fail to enforce consensus, resulting in fragmentation into multiple clusters. To address the above issues, this paper proposes a Motif-aware Weighted Multi-agent System (MWMS) method for consensus control. We focus more on triangle motif in the network, but it can be extended to other kinds of motifs as well. First, a novel weighted network is used which is the combination of the edge-based lower-order structure and the motif-based higher-order structure, i.e., hybrid-order structure. Subsequently, by simultaneously considering the quantity and the quality of the connections in the network, a novel consensus framework for MAS is designed to update agents. Then, two baseline consensus algorithms are used in MWMS. In our experiments, we use ten topologies of different shapes, densities and ranges to comprehensively analyze the performance of our proposed algorithms. The simulation results show that the hybrid higher-order network can effectively enhance the consensus of the multi-agent system in different network topologies.

NeurIPS Conference 2021 Conference Paper

Iterative Connecting Probability Estimation for Networks

  • Yichen Qin
  • Linhan Yu
  • Yang Li

Estimating the probabilities of connections between vertices in a random network using an observed adjacency matrix is an important task for network data analysis. Many existing estimation methods are based on certain assumptions on network structure, which limit their applicability in practice. Without making strong assumptions, we develop an iterative connecting probability estimation method based on neighborhood averaging. Starting at a random initial point or an existing estimate, our method iteratively updates the pairwise vertex distances, the sets of similar vertices, and connecting probabilities to improve the precision of the estimate. We propose a two-stage neighborhood selection procedure to achieve the trade-off between smoothness of the estimate and the ability to discover local structure. The tuning parameters can be selected by cross-validation. We establish desirable theoretical properties for our method, and further justify its superior performance by comparing with existing methods in simulation and real data analysis.

NeurIPS Conference 2021 Conference Paper

Learnable Fourier Features for Multi-dimensional Spatial Positional Encoding

  • Yang Li
  • Si Si
  • Gang Li
  • Cho-Jui Hsieh
  • Samy Bengio

Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e. g. , pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

NeurIPS Conference 2021 Conference Paper

Learning to Adapt via Latent Domains for Adaptive Semantic Segmentation

  • Yunan Liu
  • Shanshan Zhang
  • Yang Li
  • Jian Yang

Domain adaptive semantic segmentation aims to transfer knowledge learned from labeled source domain to unlabeled target domain. To narrow down the domain gap and ease adaptation difficulty, some recent methods translate source images to target-like images (latent domains), which are used as supplement or substitute to the original source data. Nevertheless, these methods neglect to explicitly model the relationship of knowledge transferring across different domains. Alternatively, in this work we break through the standard “source-target” one pair adaptation framework and construct multiple adaptation pairs (e. g. “source-latent” and “latent-target”). The purpose is to use the meta-knowledge (how to adapt) learned from one pair as guidance to assist the adaptation of another pair under a meta-learning framework. Furthermore, we extend our method to a more practical setting of open compound domain adaptation (a. k. a multiple-target domain adaptation), where the target is a compound of multiple domains without domain labels. In this setting, we embed an additional pair of “latent-latent” to reduce the domain gap between the source and different latent domains, allowing the model to adapt well on multiple target domains simultaneously. When evaluated on standard benchmarks, our method is superior to the state-of-the-art methods in both the single target and multiple-target domain adaptation settings.

AAAI Conference 2021 Conference Paper

MFES-HB: Efficient Hyperband with Multi-Fidelity Quality Measurements

  • Yang Li
  • Yu Shen
  • Jiawei Jiang
  • Jinyang Gao
  • Ce Zhang
  • Bin Cui

Hyperparameter optimization (HPO) is a fundamental problem in automatic machine learning (AutoML). However, due to the expensive evaluation cost of models (e. g. , training deep learning models or training models on large datasets), vanilla Bayesian optimization (BO) is typically computationally infeasible. To alleviate this issue, Hyperband (HB) utilizes the early stopping mechanism to speed up configuration evaluations by terminating those badly-performing configurations in advance. This leads to two kinds of quality measurements: (1) many low-fidelity measurements for configurations that get early-stopped, and (2) few high-fidelity measurements for configurations that are evaluated without being early stopped. The state-of-the-art HB-style method, BOHB, aims to combine the benefits of both BO and HB. Instead of sampling configurations randomly in HB, BOHB samples configurations based on a BO surrogate model, which is constructed with the high-fidelity measurements only. However, the scarcity of high-fidelity measurements greatly hampers the efficiency of BO to guide the configuration search. In this paper, we present MFES-HB, an efficient Hyperband method that is capable of utilizing both the high-fidelity and low-fidelity measurements to accelerate the convergence of HPO tasks. Designing MFES-HB is not trivial as the lowfidelity measurements can be biased yet informative to guide the configuration search. Thus we propose to build a Multi- Fidelity Ensemble Surrogate (MFES) based on the generalized Product of Experts framework, which can integrate useful information from multi-fidelity measurements effectively. The empirical studies on the real-world AutoML tasks demonstrate that MFES-HB can achieve 3. 3−8. 9× speedups over the state-of-the-art approach — BOHB.

NeurIPS Conference 2021 Conference Paper

Node Dependent Local Smoothing for Scalable Graph Learning

  • Wentao Zhang
  • Mingyu Yang
  • Zeang Sheng
  • Yang Li
  • Wen Ouyang
  • Yangyu Tao
  • Zhi Yang
  • Bin Cui

Recent works reveal that feature or label smoothing lies at the core of Graph Neural Networks (GNNs). Concretely, they show feature smoothing combined with simple linear regression achieves comparable performance with the carefully designed GNNs, and a simple MLP model with label smoothing of its prediction can outperform the vanilla GCN. Though an interesting finding, smoothing has not been well understood, especially regarding how to control the extent of smoothness. Intuitively, too small or too large smoothing iterations may cause under-smoothing or over-smoothing and can lead to sub-optimal performance. Moreover, the extent of smoothness is node-specific, depending on its degree and local structure. To this end, we propose a novel algorithm called node-dependent local smoothing (NDLS), which aims to control the smoothness of every node by setting a node-specific smoothing iteration. Specifically, NDLS computes influence scores based on the adjacency matrix and selects the iteration number by setting a threshold on the scores. Once selected, the iteration number can be applied to both feature smoothing and label smoothing. Experimental results demonstrate that NDLS enjoys high accuracy -- state-of-the-art performance on node classifications tasks, flexibility -- can be incorporated with any models, scalability and efficiency -- can support large scale graphs with fast training.

AAAI Conference 2021 Conference Paper

Savable but Lost Lives when ICU Is Overloaded: a Model from 733 Patients in Epicenter Wuhan, China

  • Tingting Dan
  • Yang Li
  • Ziwei Zhu
  • Xijie Chen
  • Wuxiu Quan
  • Yu Hu
  • Guihua Tao
  • Lei Zhu

Coronavirus Disease 2019 (COVID-19) causes a sudden turnover to bad at some checkpoints and thus needs the intervention of intensive care unit (ICU). This resulted in urgent and large needs of ICUs posed great risks to the medical system. Estimating the mortality of critical in-patients who were not admitted into the ICU will be valuable to optimize the management and assignment of ICU. Retrospective, 733 in-patients diagnosed with COVID-19 at a local hospital (Wuhan, China), as of March 18, 2020. Demographic, clinical and laboratory results were collected and analyzed using machine learning to build a predictive model. Considering the shortage of ICU beds at the beginning of disease emergence, we defined the mortality for those patients who were predicted to be in needing ICU care yet they did not as Missing-ICU (MI)-mortality. To estimate MI-mortality, a prognostic classification model was built to identify the in-patients who may need ICU care. Its predictive accuracy was 0. 8288, with an AUC of 0. 9119. On our cohort of 733 patients, 25 in-patients who have been predicted by our model that they should need ICU, yet they did not enter ICU due to lack of shorting ICU wards. Our analysis had shown that the MI-mortality is 41%, yet the mortality of ICU is 32%, implying that enough bed of ICU in treating patients in critical conditions.

AAAI Conference 2020 Conference Paper

A Forest from the Trees: Generation through Neighborhoods

  • Yang Li
  • Tianxiang Gao
  • Junier Oliva

In this work, we propose to learn a generative model using both learned features (through a latent space) and memories (through neighbors). Although human learning makes seamless use of both learned perceptual features and instance recall, current generative learning paradigms only make use of one of these two components. Take, for instance, flow models, which learn a latent space that follows a simple distribution. Conversely, kernel density techniques use instances to shift a simple distribution into an aggregate mixture model. Here we propose multiple methods to enhance the latent space of a flow model with neighborhood information. Not only does our proposed framework represent a more human-like approach by leveraging both learned features and memories, but it may also be viewed as a step forward in non-parametric methods. In addition, our proposed framework allows the user to easily control the properties of generated samples by targeting samples based on neighbors. The efficacy of our model is shown empirically with standard image datasets. We observe compelling results and a significant improvement over baselines. Combined further with a contrastive training mechanism, our proposed methods can effectively perform non-parametric novelty detection.

AAAI Conference 2020 Conference Paper

Efficient Automatic CASH via Rising Bandits

  • Yang Li
  • Jiawei Jiang
  • Jinyang Gao
  • Yingxia Shao
  • Ce Zhang
  • Bin Cui

The Combined Algorithm Selection and Hyperparameter optimization (CASH) is one of the most fundamental problems in Automatic Machine Learning (AutoML). The existing Bayesian optimization (BO) based solutions turn the CASH problem into a Hyperparameter Optimization (HPO) problem by combining the hyperparameters of all machine learning (ML) algorithms, and use BO methods to solve it. As a result, these methods suffer from the low-efficiency problem due to the huge hyperparameter space in CASH. To alleviate this issue, we propose the alternating optimization framework, where the HPO problem for each ML algorithm and the algorithm selection problem are optimized alternately. In this framework, the BO methods are used to solve the HPO problem for each ML algorithm separately, incorporating a much smaller hyperparameter space for BO methods. Furthermore, we introduce Rising Bandits, a CASH-oriented Multi-Armed Bandits (MAB) variant, to model the algorithm selection in CASH. This framework can take the advantages of both BO in solving the HPO problem with a relatively small hyperparameter space and the MABs in accelerating the algorithm selection. Moreover, we further develop an efficient online algorithm to solve the Rising Bandits with provably theoretical guarantees. The extensive experiments on 30 OpenML datasets demonstrate the superiority of the proposed approach over the competitive baselines.

AAAI Conference 2020 Conference Paper

Exchangeable Generative Models with Flow Scans

  • Christopher Bender
  • Kevin O'Connor
  • Yang Li
  • Juan Garcia
  • Junier Oliva
  • Manzil Zaheer

In this work, we develop a new approach to generative density estimation for exchangeable, non-i. i. d. data. The proposed framework, FlowScan, combines invertible flow transformations with a sorted scan to flexibly model the data while preserving exchangeability. Unlike most existing methods, FlowScan exploits the intradependencies within sets to learn both global and local structure. FlowScan represents the first approach that is able to apply sequential methods to exchangeable density estimation without resorting to averaging over all possible permutations. We achieve new state-of-the-art performance on point cloud and image set modeling.

NeurIPS Conference 2020 Conference Paper

Exchangeable Neural ODE for Set Modeling

  • Yang Li
  • Haidong Yi
  • Christopher Bender
  • Siyuan Shan
  • Junier B. Oliva

Reasoning over an instance composed of a set of vectors, like a point cloud, requires that one accounts for intra-set dependent features among elements. However, since such instances are unordered, the elements' features should remain unchanged when the input's order is permuted. This property, permutation equivariance, is a challenging constraint for most neural architectures. While recent work has proposed global pooling and attention-based solutions, these may be limited in the way that intradependencies are captured in practice. In this work we propose a more general formulation to achieve permutation equivariance through ordinary differential equations (ODE). Our proposed module, Exchangeable Neural ODE (ExNODE), can be seamlessly applied for both discriminative and generative tasks. We also extend set modeling in the temporal dimension and propose a VAE based model for temporal set modeling. Extensive experiments demonstrate the efficacy of our method over strong baselines.

AAAI Conference 2020 Conference Paper

Geometry-Driven Self-Supervised Method for 3D Human Pose Estimation

  • Yang Li
  • Kan Li
  • Shuai Jiang
  • Ziyue Zhang
  • Congzhentao Huang
  • Richard Yi Da Xu

The neural network based approach for 3D human pose estimation from monocular images has attracted growing interest. However, annotating 3D poses is a labor-intensive and expensive process. In this paper, we propose a novel selfsupervised approach to avoid the need of manual annotations. Different from existing weakly/self-supervised methods that require extra unpaired 3D ground-truth data to alleviate the depth ambiguity problem, our method trains the network only relying on geometric knowledge without any additional 3D pose annotations. The proposed method follows the two-stage pipeline: 2D pose estimation and 2D-to-3D pose lifting. We design the transform re-projection loss that is an effective way to explore multi-view consistency for training the 2Dto-3D lifting network. Besides, we adopt the confidences of 2D joints to integrate losses from different views to alleviate the influence of noises caused by the self-occlusion problem. Finally, we design a two-branch training architecture, which helps to preserve the scale information of re-projected 2D poses during training, resulting in accurate 3D pose predictions. We demonstrate the effectiveness of our method on two popular 3D human pose datasets, Human3. 6M and MPI- INF-3DHP. The results show that our method significantly outperforms recent weakly/self-supervised approaches.

NeurIPS Conference 2020 Conference Paper

Meta-Neighborhoods

  • Siyuan Shan
  • Yang Li
  • Junier B. Oliva

Making an adaptive prediction based on input is an important ability for general artificial intelligence. In this work, we step forward in this direction and propose a semi-parametric method, Meta-Neighborhoods, where predictions are made adaptively to the neighborhood of the input. We show that Meta-Neighborhoods is a generalization of k-nearest-neighbors. Due to the simpler manifold structure around a local neighborhood, Meta-Neighborhoods represent the predictive distribution p(y | x) more accurately. To reduce memory and computation overheads, we propose induced neighborhoods that summarize the training data into a much smaller dictionary. A meta-learning based training mechanism is then exploited to jointly learn the induced neighborhoods and the model. Extensive studies demonstrate the superiority of our method.

AAAI Conference 2020 Conference Paper

Multi-Point Semantic Representation for Intent Classification

  • Jinghan Zhang
  • Yuxiao Ye
  • Yue Zhang
  • Likun Qiu
  • Bin Fu
  • Yang Li
  • Zhenglu Yang
  • Jian Sun

Detecting user intents from utterances is the basis of natural language understanding (NLU) task. To understand the meaning of utterances, some work focuses on fully representing utterances via semantic parsing in which annotation cost is labor-intentsive. While some researchers simply view this as intent classification or frequently asked questions (FAQs) retrieval, they do not leverage the shared utterances among different intents. We propose a simple and novel multi-point semantic representation framework with relatively low annotation cost to leverage the fine-grained factor information, decomposing queries into four factors, i. e. , topic, predicate, object/condition, query type. Besides, we propose a compositional intent bi-attention model under multi-task learning with three kinds of attention mechanisms among queries, labels and factors, which jointly combines coarse-grained intent and fine-grained factor information. Extensive experiments show that our framework and model significantly outperform several state-of-the-art approaches with an improvement of 1. 35%-2. 47% in terms of accuracy.

NeurIPS Conference 2020 Conference Paper

Multi-Stage Influence Function

  • Hongge Chen
  • Si Si
  • Yang Li
  • Ciprian Chelba
  • Sanjiv Kumar
  • Duane Boning
  • Cho-Jui Hsieh

Multi-stage training and knowledge transfer, from a large-scale pretraining task to various finetuning tasks, have revolutionized natural language processing and computer vision resulting in state-of-the-art performance improvements. In this paper, we develop a multi-stage influence function score to track predictions from a finetuned model all the way back to the pretraining data. With this score, we can identify the pretraining examples in the pretraining task that contribute most to a prediction in the finetuning task. The proposed multi-stage influence function generalizes the original influence function for a single model in (Koh &Liang, 2017), thereby enabling influence computation through both pretrained and finetuned models. We study two different scenarios with the pretrained embedding fixed or updated in the finetuning tasks. We test our proposed method in various experiments to show its effectiveness and potential applications.

AAAI Conference 2020 Conference Paper

Self-Attention Enhanced Selective Gate with Entity-Aware Embedding for Distantly Supervised Relation Extraction

  • Yang Li
  • Guodong Long
  • Tao Shen
  • Tianyi Zhou
  • Lina Yao
  • Huan Huo
  • Jing Jiang

Distantly supervised relation extraction intrinsically suffers from noisy labels due to the strong assumption of distant supervision. Most prior works adopt a selective attention mechanism over sentences in a bag to denoise from wrongly labeled data, which however could be incompetent when there is only one sentence in a bag. In this paper, we propose a brand-new light-weight neural framework to address the distantly supervised relation extraction problem and alleviate the defects in previous selective attention framework. Specifically, in the proposed framework, 1) we use an entity-aware word embedding method to integrate both relative position information and head/tail entity embeddings, aiming to highlight the essence of entities for this task; 2) we develop a self-attention mechanism to capture the rich contextual dependencies as a complement for local dependencies captured by piecewise CNN; and 3) instead of using selective attention, we design a pooling-equipped gate, which is based on rich contextual representations, as an aggregator to generate baglevel representation for final relation classification. Compared to selective attention, one major advantage of the proposed gating mechanism is that, it performs stably and promisingly even if only one sentence appears in a bag and thus keeps the consistency across all training examples. The experiments on NYT dataset demonstrate that our approach achieves a new state-of-the-art performance in terms of both AUC and top-n precision metrics.

NeurIPS Conference 2019 Conference Paper

A Unified Framework for Data Poisoning Attack to Graph-based Semi-supervised Learning

  • Xuanqing Liu
  • Si Si
  • Jerry Zhu
  • Yang Li
  • Cho-Jui Hsieh

In this paper, we proposed a general framework for data poisoning attacks to graph-based semi-supervised learning (G-SSL). In this framework, we first unify different tasks, goals and constraints into a single formula for data poisoning attack in G-SSL, then we propose two specialized algorithms to efficiently solve two important cases --- poisoning regression tasks under $\ell_2$-norm constraint and classification tasks under $\ell_0$-norm constraint. In the former case, we transform it into a non-convex trust region problem and show that our gradient-based algorithm with delicate initialization and update scheme finds the (globally) optimal perturbation. For the latter case, although it is an NP-hard integer programming problem, we propose a probabilistic solver that works much better than the classical greedy method. Lastly, we test our framework on real datasets and evaluate the robustness of G-SSL algorithms. For instance, on the MNIST binary classification problem (50000 training data with 50 labeled), flipping two labeled data is enough to make the model perform like random guess (around 50\% error).

AAAI Conference 2019 Conference Paper

Robust Estimation of Similarity Transformation for Visual Object Tracking

  • Yang Li
  • Jianke Zhu
  • Steven C.H. Hoi
  • Wenjie Song
  • Zhefeng Wang
  • Hantang Liu

Most of existing correlation filter-based tracking approaches only estimate simple axis-aligned bounding boxes, and very few of them is capable of recovering the underlying similarity transformation. To tackle this challenging problem, in this paper, we propose a new correlation filter-based tracker with a novel robust estimation of similarity transformation on the large displacements. In order to efficiently search in such a large 4-DoF space in real-time, we formulate the problem into two 2-DoF sub-problems and apply an efficient Block Coordinates Descent solver to optimize the estimation result. Specifically, we employ an efficient phase correlation scheme to deal with both scale and rotation changes simultaneously in log-polar coordinates. Moreover, a variant of correlation filter is used to predict the translational motion individually. Our experimental results demonstrate that the proposed tracker achieves very promising prediction performance compared with the state-of-the-art visual object tracking methods while still retaining the advantages of high efficiency and simplicity in conventional correlation filter-based tracking methods.

NeurIPS Conference 2019 Conference Paper

Robustness Verification of Tree-based Models

  • Hongge Chen
  • Huan Zhang
  • Si Si
  • Yang Li
  • Duane Boning
  • Cho-Jui Hsieh

We study the robustness verification problem of tree based models, including random forest (RF) and gradient boosted decision tree (GBDT). Formal robustness verification of decision tree ensembles involves finding the exact minimal adversarial perturbation or a guaranteed lower bound of it. Existing approaches cast this verification problem into a mixed integer linear programming (MILP) problem, which finds the minimal adversarial distortion in exponential time so is impractical for large ensembles. Although this verification problem is NP-complete in general, we give a more precise complexity characterization. We show that there is a simple linear time algorithm for verifying a single tree, and for tree ensembles the verification problem can be cast as a max-clique problem on a multi-partite boxicity graph. For low dimensional problems when boxicity can be viewed as constant, this reformulation leads to a polynomial time algorithm. For general problems, by exploiting the boxicity of the graph, we devise an efficient verification algorithm that can give tight lower bounds on robustness of decision tree ensembles, and allows iterative improvement and any-time termination. On RF/GBDT models trained on a variety of datasets, we significantly outperform the lower bounds obtained by relaxing the MILP formulation into a linear program (LP), and are hundreds times faster than solving MILPs to get the exact minimal adversarial distortion. Our proposed method is capable of giving tight robustness verification bounds on large GBDTs with hundreds of deep trees.

AAAI Conference 2019 Conference Paper

SADIH: Semantic-Aware DIscrete Hashing

  • Zheng Zhang
  • Guo-Sen Xie
  • Yang Li
  • Sheng Li
  • Zi Huang

Due to its low storage cost and fast query speed, hashing has been recognized to accomplish similarity search in largescale multimedia retrieval applications. Particularly, supervised hashing has recently received considerable research attention by leveraging the label information to preserve the pairwise similarities of data points in the Hamming space. However, there still remain two crucial bottlenecks: 1) the learning process of the full pairwise similarity preservation is computationally unaffordable and unscalable to deal with big data; 2) the available category information of data are not well-explored to learn discriminative hash functions. To overcome these challenges, we propose a unified Semantic- Aware DIscrete Hashing (SADIH) framework, which aims to directly embed the transformed semantic information into the asymmetric similarity approximation and discriminative hashing function learning. Specifically, a semantic-aware latent embedding is introduced to asymmetrically preserve the full pairwise similarities while skillfully handle the cumbersome n × n pairwise similarity matrix. Meanwhile, a semantic-aware autoencoder is developed to jointly preserve the data structures in the discriminative latent semantic space and perform data reconstruction. Moreover, an efficient alternating optimization algorithm is proposed to solve the resulting discrete optimization problem. Extensive experimental results on multiple large-scale datasets demonstrate that our SADIH can clearly outperform the state-of-the-art baselines with the additional benefit of lower computational costs.

IJCAI Conference 2018 Conference Paper

A Novel Neural Network Model based on Cerebral Hemispheric Asymmetry for EEG Emotion Recognition

  • Yang Li
  • Wenming Zheng
  • Zhen Cui
  • Tong Zhang
  • Yuan Zong

In this paper, we propose a novel neural network model, called bi-hemispheres domain adversarial neural network (BiDANN), for EEG emotion recognition. BiDANN is motivated by the neuroscience findings, i. e. , the emotional brain's asymmetries between left and right hemispheres. The basic idea of BiDANN is to map the EEG feature data of both left and right hemispheres into discriminative feature spaces separately, in which the data representations can be classified easily. For further precisely predicting the class labels of testing data, we narrow the distribution shift between training and testing data by using a global and two local domain discriminators, which work adversarially to the classifier to encourage domain-invariant data representations to emerge. After that, the learned classifier from labeled training data can be applied to unlabeled testing data naturally. We conduct two experiments to verify the performance of our BiDANN model on SEED database. The experimental results show that the proposed model achieves the state-of-the-art performance.

IJCAI Conference 2018 Conference Paper

Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation

  • Yang Li
  • Kan Li
  • Xinxin Wang

In this paper, we propose a deeply-supervised CNN model for action recognition that fully exploits powerful hierarchical features of CNNs. In this model, we build multi-level video representations by applying our proposed aggregation module at different convolutional layers. Moreover, we train this model in a deep supervision manner, which brings improvement in both performance and efficiency. Meanwhile, in order to capture the temporal structure as well as preserve more details about actions, we propose a trainable aggregation module. It models the temporal evolution of each spatial location and projects them into a semantic space using the Vector of Locally Aggregated Descriptors (VLAD) technique. This deeply-supervised CNN model integrating the powerful aggregation module provides a promising solution to recognize actions in videos. We conduct experiments on two action recognition datasets: HMDB51 and UCF101. Results show that our model outperforms the state-of-the-art methods.

JBHI Journal 2018 Journal Article

Epileptic Seizure Classification of EEGs Using Time–Frequency Analysis Based Multiscale Radial Basis Functions

  • Yang Li
  • Xu-Dong Wang
  • Mei-Lin Luo
  • Ke Li
  • Xiao-Feng Yang
  • Qi Guo

The automatic detection of epileptic seizures from electroencephalography (EEG) signals is crucial for the localization and classification of epileptic seizure activity. However, seizure processes are typically dynamic and nonstationary, and thus, distinguishing rhythmic discharges from nonstationary processes is one of the challenging problems. In this paper, an adaptive and localized time–frequency representation in EEG signals is proposed by means of multiscale radial basis functions (MRBF) and a modified particle swarm optimization (MPSO) to improve both time and frequency resolution simultaneously, which is a novel MRBF-MPSO framework of the time–frequency feature extraction for epileptic EEG signals. The dimensionality of extracted features can be greatly reduced by the principle component analysis algorithm before the most discriminative features selected are fed into a support vector machine (SVM) classifier with the radial basis function (RBF) in order to separate epileptic seizure from seizure-free EEG signals. The classification performance of the proposed method has been evaluated by using several state-of-art feature extraction algorithms and other five different classifiers like linear discriminant analysis, and logistic regression. The experimental results indicate that the proposed MRBF-MPSO-SVM classification method outperforms competing techniques in terms of classification accuracy, and shows the effectiveness of the proposed method for classification of seizure epochs and seizure-free epochs.

NeurIPS Conference 2018 Conference Paper

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

  • Patrick Chen
  • Si Si
  • Yang Li
  • Ciprian Chelba
  • Cho-Jui Hsieh

Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e. g. , using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). We start by grouping words into $c$ blocks based on their frequency, and then refine the clustering iteratively by constructing weighted low-rank approximation for each block, where the weights are based the frequencies of the words in the block. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6. 6x compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26x compression rate without losing prediction accuracy.

IJCAI Conference 2017 Conference Paper

CFNN: Correlation Filter Neural Network for Visual Object Tracking

  • Yang Li
  • Zhan Xu
  • Jianke Zhu

Albeit convolutional neural network (CNN) has shown promising capacity in many computer vision tasks, applying it to visual tracking is yet far from solved. Existing methods either employ a large external dataset to undertake exhaustive pre-training or suffer from less satisfactory results in terms of accuracy and robustness. To track single target in a wide range of videos, we present a novel Correlation Filter Neural Network architecture, as well as a complete visual tracking pipeline, The proposed approach is a special case of CNN, whose initialization does not need any pre-training on the external dataset. The initialization of network enjoys the merits of cyclic sampling to achieve the appealing discriminative capability, while the network updating scheme adopts advantages from back-propagation in order to capture new appearance variations. The tracking pipeline integrates both aspects well by making them complementary to each other. We validate our tracker on OTB-2013 benchmark. The proposed tracker obtains the promising results compared to most of existing representative trackers.

TIST Journal 2017 Journal Article

Personalized Microtopic Recommendation on Microblogs

  • Yang Li
  • Jing Jiang
  • Ting Liu
  • Minghui Qiu
  • Xiaofei Sun

Microblogging services such as Sina Weibo and Twitter allow users to create tags explicitly indicated by the # symbol. In Sina Weibo, these tags are called microtopics, and in Twitter, they are called hashtags. In Sina Weibo, each microtopic has a designate page and can be directly visited or commented on. Recommending these microtopics to users based on their interests can help users efficiently acquire information. However, it is non-trivial to recommend microtopics to users to satisfy their information needs. In this article, we investigate the task of personalized microtopic recommendation, which exhibits two challenges. First, users usually do not give explicit ratings to microtopics. Second, there exists rich information about users and microtopics, for example, users' published content and biographical information, but it is not clear how to best utilize such information. To address the above two challenges, we propose a joint probabilistic latent factor model to integrate rich information into a matrix factorization-based solution to microtopic recommendation. Our model builds on top of collaborative filtering, content analysis, and feature regression. Using two real-world datasets, we evaluate our model with different kinds of content and contextual information. Experimental results show that our model significantly outperforms a few competitive baseline methods, especially in the circumstance where users have few adoption behaviors.

IS Journal 2015 Journal Article

Knowledge Engineering with Big Data

  • Xindong Wu
  • Huanhuan Chen
  • Gongqing Wu
  • Jun Liu
  • Qinghua Zheng
  • Xiaofeng He
  • Aoying Zhou
  • Zhong-Qiu Zhao

In the era of big data, knowledge engineering faces fundamental challenges induced by fragmented knowledge from heterogeneous, autonomous sources with complex and evolving relationships. The knowledge representation, acquisition, and inference techniques developed in the 1970s and 1980s, driven by research and development of expert systems, must be updated to cope with both fragmented knowledge from multiple sources in the big data revolution and in-depth knowledge from domain experts. This article presents BigKE, a knowledge engineering framework that handles fragmented knowledge modeling and online learning from multiple information sources, nonlinear fusion on fragmented knowledge, and automated demand-driven knowledge navigation.

IJCAI Conference 2013 Conference Paper

Automatic Name-Face Alignment to Enable Cross-Media News Retrieval

  • Yuejie Zhang
  • Wei Wu
  • Yang Li
  • Cheng Jin
  • Xiangyang Xue
  • Jianping Fan

A new algorithm is developed in this paper to support automatic name-face alignment for achieving more accurate cross-media news retrieval. We focus on extracting valuable information from large amounts of news images and their captions, where multi-level image-caption pairs are constructed for characterizing both significant names with higher salience and their cohesion with human faces extracted from news images. To remedy the issue of lacking enough related information for rare name, Web mining is introduced to acquire the extra multimodal information. We also emphasize on an optimization mechanism by our Improved Self-Adaptive Simulated Annealing Genetic Algorithm to verify the feasibility of alignment combinations. Our experiments have obtained very positive results.

IROS Conference 2009 Conference Paper

SUEFUL-7: A 7DOF upper-limb exoskeleton robot with muscle-model-oriented EMG-based control

  • Ranathunga Arachchilage Ruwan Chandra Gopura
  • Kazuo Kiguchi
  • Yang Li

This paper proposes an electromyography (EMG) signal based control method for a seven degrees of freedom (7DOF) upper-limb motion assist exoskeleton robot (SUEFUL-7). The SUEFUL-7 is able to assist the motions of shoulder vertical and horizontal flexion/extension, shoulder internal/external rotation, elbow flexion/extension, forearm supination/pronation, wrist flexion/extension, and wrist radial/ulnar deviation of physically weak individuals. In the proposed control method, an impedance controller is applied to the muscle-model-oriented control method by considering the end effector force vector. Impedance parameters are adjusted in real time by considering the upper-limb posture and EMG activity levels. Experiments have been performed to evaluate the effectiveness of the proposed robotic system.