Arrow Research search

Author name cluster

Ke Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

77 papers
1 author row

Possible papers

77

AAAI Conference 2026 Conference Paper

Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs

  • Shasha Zhou
  • Mingyu Huang
  • Jack Cole
  • Charles Britton
  • Ming Yin
  • Jan Wolber
  • Ke Li

The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare, with strong capabilities in diverse medical tasks. Yet, deploying LLMs in high-stakes healthcare settings requires rigorous verification and validation to understand any potential harm. This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses. To ground this investigation, we introduce FAITH, a framework designed to systematically probe the strengths and limitations of this KG-based approach. FAITH operates without reference answers by decomposing responses into atomic claims, linking them to a medical KG, and scoring them based on evidence paths. Experiments on diverse medical tasks with human subjective evaluations demonstrate that KG-grounded evaluation achieves considerably higher correlations with clinician judgments and can effectively distinguish LLMs with varying capabilities. It is also robust to textual variances. The inherent explainability of its scoring can further help users understand and mitigate the limitations of current LLMs. We conclude that while limitations exist, leveraging KGs is a prominent direction for automated factuality assessment in healthcare.

AAAI Conference 2026 Conference Paper

Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

  • Tianmeng Hu
  • Yongzheng Cui
  • Rui Tang
  • Biao Luo
  • Ke Li

Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.

AAAI Conference 2026 Conference Paper

LAMDA: Two-Phase HPO via Learning Prior from Low-Fidelity Data

  • Fan Li
  • Shengbo Wang
  • Ke Li

Hyperparameter Optimization (HPO) is crucial in machine learning, aiming to optimize hyperparameters to enhance model performance. Although existing methods that leverage prior knowledge—drawn from either previous experiments or expert insights—can accelerate optimization, acquiring a correct prior for a specific HPO task is non-trivial. In this work, we propose to relieve the reliance on external knowledge by learning a reliable prior {directly} from low-fidelity (LF) problems. We introduce {Lamda}, an algorithm-agnostic framework designed to boost any baseline HPO algorithm. Specifically, {Lamda} operates in two phases: (1) it learns a reliable prior by exploring the LF landscape under limited computational budgets, and (2) it leverages this learned prior to guide the HPO process. We showcase how the {Lamda} framework can be integrated with various HPO algorithms to boost their performance, and further conduct theoretical analysis towards the integrated Bayesian optimization and bandit-based Hyperband. We conduct experiments on 56 HPO problems spanning diverse domains and model scales. Results show that {Lamda} consistently enhances its baseline algorithms. Compared to nine state-of-the-art HPO algorithms, our {Lamda} variant achieves the best performance in 51 out of 56 HPO tasks while it is the second best algorithm in the other 5 cases.

AAAI Conference 2026 Conference Paper

Preference Is More than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback

  • Shengbo Wang
  • Hong Sun
  • Ke Li

Interactive preference elicitation (IPE) aims to substantially reduce human effort while acquiring human preferences in wide personalization systems. Dueling bandit (DB) algorithms enable optimal decision-making in IPE building on pairwise comparisons. However, they remain inefficient when human feedback is sparse. Existing methods address sparsity by heavily relying on parametric reward models, whose rigid assumptions are vulnerable to misspecification. In contrast, we explore an alternative perspective based on feedback augmentation, and introduce critical improvements to the model-free DB framework. Specifically, we introduce augmented confidence bounds to integrate augmented human feedback under generalized concentration properties, and analyze the multi-factored performance trade-off via regret analysis. Our prototype algorithm achieves competitive performance across several IPE benchmarks, including recommendation, multi-objective optimization, and response optimization for large language models, demonstrating the potential of our approach for provably efficient IPE in broader applications.

AAAI Conference 2026 Conference Paper

RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

  • Ke Li
  • Di Wang
  • Ting Wang
  • Fuyu Dong
  • Yiming Zhang
  • Luyao Zhang
  • Xiangyu Wang
  • Shaofeng Li

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose RSVG-ZeroOV, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

IJCAI Conference 2025 Conference Paper

App2Exa: Accelerating Exact kNN Search via Dynamic Cache-Guided Approximation

  • Ke Li
  • Leong Hou U
  • Shuo Shang

The k-nearest neighbor (kNN) query is a cornerstone of similarity-based applications across various domains. While prior work has enhanced kNN search efficiency, it typically focuses on approximate methods for high-dimensional data or exact methods for low-dimensional data, often assuming static query and data distributions. This creates a significant gap in accelerating exact kNN search for low-to-medium dimensional data with dynamic query distributions. To fill this gap, we propose App2Exa, a cache-guided framework that integrates approximate and exact kNN search. App2Exa utilizes a dynamically maintained cache graph index to retrieve approximate results, which subsequently guide exact search using a VP-Tree with a best-first strategy. A benefit-driven caching mechanism further optimizes performance by prioritizing vectors based on frequency, recency, and computational cost. Experimental results demonstrate that App2Exa significantly boosts efficiency, providing a robust and scalable solution for evolving query patterns and enabling exact kNN search to support higher dimensionality more effectively.

NeurIPS Conference 2025 Conference Paper

Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

  • Mingyu Huang
  • Shasha Zhou
  • Ke Li

Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from diverse modalities (DNA, RNA, protein, and beyond. ), accommodating datasets up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5, 300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. Additionally, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2. 2 million sequences across various modalities. All the codes and datasets are available at https: //github. com/COLA-Laboratory/GraphFLA.

AAAI Conference 2025 Conference Paper

Bridging Sequence-Structure Alignment in RNA Foundation Models

  • Heng Yang
  • Renzhi Chen
  • Ke Li

The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequence-structure alignment, hindering the seamless flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures through structure-contextualized modelling. This alignment enables free and bidirectional mappings between sequences and structures by utilizing a flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capabilities of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs solved only up to 3% of the puzzles due to the lack of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of downstream genome tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.

IJCAI Conference 2025 Conference Paper

Conversational Exploration of Literature Landscape with LitChat

  • Mingyu Huang
  • Shasha Zhou
  • Yuxuan Chen
  • Ke Li

We are living in an era of "big literature", where the volume of digital scientific publications is growing exponentially. While offering new opportunities, this also poses challenges for understanding literature landscapes, as traditional manual reviewing is no longer feasible. Recent large language models (LLMs) have shown strong capabilities for literature comprehension, yet they are incapable of offering "comprehensive, objective, open and transparent" views desired by systematic reviews due to their limited context windows and trust issues like hallucinations. Here we present LitChat, an end-to-end, interactive and conversational literature agent that augments LLM agents with data-driven discovery tools to facilitate literature exploration. LitChat automatically interprets user queries, retrieves relevant sources, constructs knowledge graphs, and employs diverse data-mining techniques to generate evidence-based insights addressing user needs. We illustrate the effectiveness of LitChat via a case study on AI4Health, highlighting its capacity to quickly navigate the users through large-scale literature landscape with data-based evidence that is otherwise infeasible with traditional means.

AAAI Conference 2025 Conference Paper

Destroy and Repair Using Hyper-Graphs for Routing

  • Ke Li
  • Fei Liu
  • Zhenkun Wang
  • Qingfu Zhang

Recent advancements in Neural Combinatorial Optimization (NCO) have shown promise in solving routing problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) without handcrafted designs. Research in this domain has explored two primary categories of methods: iterative and non-iterative. While non-iterative methods struggle to generate near-optimal solutions directly, iterative methods simplify the task by learning local search steps. However, existing iterative methods are often limited by restricted neighborhood searches, leading to suboptimal results. To address this limitation, we propose a novel approach that extends the search to larger neighborhoods by learning a destroy-and-repair strategy. Specifically, we introduce a Destroy-and-Repair framework based on Hyper-Graphs (DRHG). This framework reduces consecutive intact edges to hyper-edges, allowing the model to pay more attention to the destroyed part and decrease the complexity of encoding all nodes. Experiments demonstrate that DRHG achieves state-of-the-art performance on TSP with up to 10,000 nodes and shows strong generalization to real-world TSPLib and CVRPLib problems.

NeurIPS Conference 2025 Conference Paper

EDBench: Large-Scale Electron Density Data for Molecular Modeling

  • Hongxin Xiang
  • Ke Li
  • Mingquan Liu
  • Zhixiang Cheng
  • Bin Yao
  • Wenjie Du
  • Jun Xia
  • Li Zeng

Existing molecular machine learning force fields (MLFFs) generally focus on the learning of atoms, molecules, and simple quantum chemical properties (such as energy and force), but ignore the importance of electron density (ED) $\rho(r)$ in accurately understanding molecular force fields (MFFs). ED describes the probability of finding electrons at specific locations around atoms or molecules, which uniquely determines all ground state properties (such as energy, molecular structure, etc. ) of interactive multi-particle systems according to the Hohenberg-Kohn theorem. However, the calculation of ED relies on the time-consuming first-principles density functional theory (DFT), which leads to the lack of large-scale ED data and limits its application in MLFFs. In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED designed to advance learning-based research at the electronic scale. Built upon the PCQM4Mv2, EDBench provides accurate ED data, covering 3. 3 million molecules. To comprehensively evaluate the ability of models to understand and utilize electronic information, we design a suite of ED-centric benchmark tasks spanning prediction, retrieval, and generation. Our evaluation of several state-of-the-art methods demonstrates that learning from EDBench is not only feasible but also achieves high accuracy. Moreover, we show that learning-based methods can efficiently calculate ED with comparable precision while significantly reducing the computational cost relative to traditional DFT calculations. All data and benchmarks from EDBench will be freely available, laying a robust foundation for ED-driven drug discovery and materials science.

AAAI Conference 2025 Conference Paper

ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance

  • Yucheng Zhao
  • Gengyu Lyu
  • Ke Li
  • Zihao Wang
  • Hao Chen
  • Zhen Yang
  • Yongjian Deng

Event-based semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

AAAI Conference 2025 Conference Paper

FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection

  • Ke Li
  • Di Wang
  • Zhangyuan Hu
  • Shaofeng Li
  • Weiping Ni
  • Lin Zhao
  • Quan Wang

Infrared-visible object detection (IVOD) seeks to harness the complementary information in infrared and visible images, thereby enhancing the performance of detectors in complex environments. However, existing methods often neglect the frequency characteristics of complementary information, such as the abundant high-frequency details in visible images and the valuable low-frequency thermal information in infrared images, thus constraining detection performance. To solve this problem, we introduce a novel Frequency-Driven Feature Decomposition Network for IVOD, called FD2-Net, which effectively captures the unique frequency representations of complementary information across multimodal visual spaces. Specifically, we propose a feature decomposition encoder, wherein the high-frequency unit (HFU) utilizes discrete cosine transform to capture representative high-frequency features, while the low-frequency unit (LFU) employs dynamic receptive fields to model the multi-scale context of diverse objects. Next, we adopt a parameter-free complementary strengths strategy to enhance multimodal features through seamless inter-frequency recoupling. Furthermore, we innovatively design a multimodal reconstruction mechanism that recovers image details lost during feature extraction, further leveraging the complementary information from infrared and visible images to enhance overall representational capacity. Extensive experiments demonstrate that FD2-Net outperforms state-of-the-art (SoTA) models across various IVOD benchmarks, i.e. LLVIP (96.2% mAP), FLIR (82.9% mAP), and M3FD (83.5% mAP).

AAAI Conference 2025 Conference Paper

Feature Denoising Diffusion Model for Blind Image Quality Assessment

  • Xudong Li
  • Yan Zhang
  • Yunhang Shen
  • Ke Li
  • Runze Hu
  • Xiawu Zheng
  • Sicheng Zhao

Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line with human perception, without reference benchmarks. Currently, deep learning BIQA methods typically depend on using features from high-level tasks for transfer learning. However, the inherent differences between BIQA and these high-level tasks inevitably introduce noise into the quality-aware features. In this paper, we take an initial step toward exploring the diffusion model for feature denoising in BIQA, namely Perceptual Feature Diffusion for IQA (PFD-IQA), which aims to remove noise from quality-aware features. Specifically, 1) we propose a Perceptual Prior Discovery and Aggregation module to establish two auxiliary tasks to discover potential low-level features in images that are used to aggregate perceptual textual prompt conditions for the diffusion model. 2) we propose a Perceptual Conditional Feature Refinement strategy, which matches noisy features to predefined denoising trajectories and then performs exact feature denoising based on textual prompt conditions. By incorporating a lightweight denoiser and requiring only a few feature denoising steps (e.g., just five iterations), our PFD-IQA framework achieves superior performance across eight standard BIQA datasets, validating its effectiveness.

AAAI Conference 2025 Conference Paper

Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

  • Dingkang Yang
  • Dongling Xiao
  • Jinjie Wei
  • Mingcheng Li
  • Zhaoyu Chen
  • Ke Li
  • Lihua Zhang

Despite their remarkable capabilities, Large Language Models (LLMs) are prone to generate responses that contradict verifiable facts, i.e., unfaithful hallucination content. Existing efforts generally focus on optimizing model parameters or editing semantic representations, which compromise the internal factual knowledge of target LLMs. In addition, hallucinations typically exhibit multifaceted patterns in downstream tasks, limiting the model's holistic performance across tasks. In this paper, we propose a Comparator-driven Decoding-Time (CDT) framework to alleviate the response hallucination. Firstly, we construct hallucinatory and truthful comparators with multi-task fine-tuning samples. In this case, we present an instruction prototype-guided mixture of experts strategy to enhance the ability of the corresponding comparators to capture different hallucination or truthfulness patterns in distinct task instructions. CDT constrains next-token predictions to factuality-robust distributions by contrasting the logit differences between the target LLMs and these comparators. Systematic experiments on multiple downstream tasks show that our framework can significantly improve the model performance and response factuality.

NeurIPS Conference 2025 Conference Paper

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

  • Yulei Qin
  • Gang Li
  • Zongyi Li
  • Zihan Xu
  • Yuchen Shi
  • Zhekai Lin
  • Xiao Cui
  • Ke Li

Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1. 5B LLM achieves 11. 74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF.

AAAI Conference 2025 Conference Paper

Know Where You Are From: Event-Based Segmentation via Spatio-Temporal Propagation

  • Ke Li
  • Gengyu Lyu
  • Hao Chen
  • Bochen Xie
  • Zhen Yang
  • Youfu Li
  • Yongjian Deng

Event cameras have gained attention in segmentation due to their higher temporal resolution and dynamic range compared to traditional cameras. However, they struggle with issues like lack of color perception and triggering only at motion edges, making it hard to distinguish objects with similar contours or segment spatially continuous objects. Our work aims to address these often overlooked issues. Based on the assumption that various objects exhibit different motion patterns, we believe that embedding the historical motion states of objects into segmented scenes can effectively address these challenges. Inspired by this, we propose the ESS framework ``Know Where You Are From" (KWYAF), which incorporates past motion cues through spatio-temporal propagation embedding. This framework features two core components: the Sequential Motion Encoding Module (SME) and the Event-Based Reliable Region Selection Mechanism (ER²SM). SMEs construct prior motion features through spatio-temporal correlation modeling for boosting final segmentation, while ER²SM adapts to identify high-confidence regions, embedding motion more precisely through local window masks and reliable region selection. A large number of experiments have demonstrated the effectiveness of our proposed framework in terms of both quantity and quality.

NeurIPS Conference 2025 Conference Paper

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

  • Liuhao Lin
  • Ke Li
  • Zihan Xu
  • Yuchen Shi
  • Yulei Qin
  • Yan Zhang
  • Xing Sun
  • Rongrong Ji

Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research—relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concepts—a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity. Our dataset and codes are available at https: //github. com/walktaster/LTD-Bench.

NeurIPS Conference 2025 Conference Paper

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

  • Chaoyou Fu
  • Peixian Chen
  • Yunhang Shen
  • Yulei Qin
  • Mengdan Zhang
  • Xu Lin
  • Jinrui Yang
  • Xiawu Zheng

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page: https: //github. com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

AAAI Conference 2025 Conference Paper

Probability-Density-aware Semi-supervised Learning

  • Shuyang Liu
  • Ruiqiu Zheng
  • Yunhang Shen
  • Zhou Yu
  • Ke Li
  • Xing Sun
  • Shaohui Lin

In Semi-supervised learning(SSL), we always accept cluster assumption, assuming features in different high-density regions belong to other categories. However, it is always ignored by existing algorithms and needs mathematical explanations. This paper first proposes a theorem to statistically explain cluster assumption and prove that the probability density can significantly help to use the prior fully. A Probability-Density-Aware Measure(PM) is proposed based on the theorem to discern the similarity between neighbor points. The PM is deployed to improve Label Propagation and a new pseudo-labeling algorithm, the Probability-Density-Aware Label Propagation(PMLP), is proposed. We also prove that traditional first-order similarity pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

NeurIPS Conference 2025 Conference Paper

Transforming Gaps into Gains: Bridging Model and Data Heterogeneity in Federated Learning via Knowledge Weak-Aware Zones

  • Ke Li
  • Yan Ding
  • Zhiqin Zhu
  • Shenhai Zheng

Heterogeneous federated learning enables collaborative training across clients under dual heterogeneity of models and data, posing challenges for effective knowledge transfer. Federated mutual learning employs proxy models to bridge cross-model knowledge exchange; however, existing methods remain limited to direct alignment between the outputs of private and proxy models, ignoring the deep discrepancies in representation and decision spaces between them. Such cognitive biases cause knowledge to be transferred only at shallow levels and trigger performance bottlenecks. To address this, this paper proposes FedKWAZ to identify and exploit Knowledge Weak-Aware Zones (KWAZ)—spatial zones of deep knowledge misalignment between private and proxy models, further refined into Semantic Weak-Aware Zones and Decision Weak-Aware Zones, which characterize cognitive misalignments in representation and decision spaces as focal targets for enhanced bidirectional distillation. FedKWAZ designs a Hierarchical Adaptive Patch Mixing (HAPM) mechanism to generate multiple mixed samples and employs a Knowledge Discrepancy Perceptron (KDP) to select the samples exhibiting the largest representation and decision discrepancies, thereby mining critical KWAZ. These modules are integrated into a two-stage mutual learning framework, achieving global class-level representation-decision consistency alignment and local KWAZ-guided refinement, structurally bridging cognitive biases across heterogeneous mutual learning models. Experimental results on multiple datasets and model configurations demonstrate the superior performance of FedKWAZ.

TMLR Journal 2025 Journal Article

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

  • Yulei Qin
  • Yuncheng Yang
  • Pengcheng Guo
  • Gang Li
  • Hang Shao
  • Yuchen Shi
  • Zihan Xu
  • Yun Gu

Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between the latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.

AAAI Conference 2025 Conference Paper

VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

  • Jiangning Wei
  • Lixiong Qin
  • Bo Yu
  • Tianjian Zou
  • Chuhan Yan
  • Dandan Xiao
  • Yang Yu
  • Lan Yang

Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

AAAI Conference 2025 Conference Paper

VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

  • Zhipeng Chen
  • Lan Yang
  • Yonggang Qi
  • Honggang Zhang
  • Kaiyue Pang
  • Ke Li
  • Yi-Zhe Song

Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

NeurIPS Conference 2025 Conference Paper

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

  • Chaoyou Fu
  • Haojia Lin
  • Xiong Wang
  • Yifan Zhang
  • Yunhang Shen
  • Xiaoyu Liu
  • Haoyu Cao
  • Zuwei Long

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.

NeurIPS Conference 2025 Conference Paper

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

  • Zuwei Long
  • Yunhang Shen
  • Chaoyou Fu
  • Heting Gao
  • Lijiang Li
  • Peixian Chen
  • Mengdan Zhang
  • Hang Shao

With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

NeurIPS Conference 2025 Conference Paper

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

  • Xudong Li
  • Mengdan Zhang
  • Peixian Chen
  • Xiawu Zheng
  • Yan Zhang
  • Jingyuan Zheng
  • Yunhang Shen
  • Ke Li

Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. To address this, we propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues—from sequential context to local details. Our approach features two sequentially dependent components: (i) Context-Level Optimization: By introducing low-cost sequence preference pairs, we optimize the model to distinguish between complete and disrupted multi-image contexts, thereby correcting cognitive biases in MLLMs’ multi-image understanding. (ii) Needle-Level Optimization: By integrating region-specific visual prompts with multimodal preference supervision, we direct the model’s attention to critical visual details, effectively suppressing perceptual biases toward fine-grained visual information. To support scalable optimization, we also construct MultiScope-42k, an automatically generated multi-image dataset with hierarchical preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks. Codes are available at https: //github. com/LXDxmu/CcDPO.

JBHI Journal 2024 Journal Article

A Semi-Supervised Multi-Scale Arbitrary Dilated Convolution Neural Network for Pediatric Sleep Staging

  • Zhiqiang Chen
  • Xue Pan
  • Zhifei Xu
  • Ke Li
  • Yudan Lv
  • Yuan Zhang
  • Hongqiang Sun

Sleep staging is essential for assessing sleep quality and diagnosing sleep disorders. However, sleep staging is a labor-intensive process, making it arduous to obtain large quantities of high-quality labeled data for automatic sleep staging. Meanwhile, most of the research on automatic sleep staging pays little attention to pediatric sleep staging. To address these challenges, we propose a semi-supervised multi-scale arbitrary dilated convolution neural network (SMADNet) for pediatric sleep staging using the scalogram with a high height-to-width ratio generated by the continuous wavelet transform (CWT) as input. To extract more extended time dimensional feature representations and adapt to scalograms with a high height-to-width ratio in SMADNet, we introduce a multi-scale arbitrary dilation convolution block (MADBlock) based on our proposed arbitrary dilated convolution (ADConv). Finally, we also utilize semi-supervised learning as the training scheme for our network in order to alleviate the reliance on labeled data. Our proposed model has achieved performance comparable to state-of-the-art supervised learning methods with 30% labels. Our model is tested on a private pediatric dataset and achieved 79% accuracy, 72% kappa, and 75% MF1. Therefore, our model demonstrates a powerful feature extraction capability and has achieved performance comparable to state-of-the-art supervised learning methods with a small number of labels.

AAAI Conference 2024 Conference Paper

Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence

  • Shengbo Wang
  • Ke Li

The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduce balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose Gaussian processes embedding different likelihoods as the surrogate model for partially observable constraints. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.

NeurIPS Conference 2024 Conference Paper

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandits

  • Tian Huang
  • Shengbo Wang
  • Ke Li

The ultimate goal of multi-objective optimization (MO) is to assist human decision-makers (DMs) in identifying solutions of interest (SOI) that optimally reconcile multiple objectives according to their preferences. Preference-based evolutionary MO (PBEMO) has emerged as a promising framework that progressively approximates SOI by involving human in the optimization-cum-decision-making process. Yet, current PBEMO approaches are prone to be inefficient and misaligned with the DM’s true aspirations, especially when inadvertently exploiting mis-calibrated reward models. This is further exacerbated when considering the stochastic nature of human feedback. This paper proposes a novel framework that navigates MO to SOI by directly leveraging human feedback without being restricted by a predefined reward model nor cumbersome model selection. Specifically, we developed a clustering-based stochastic dueling bandits algorithm that strategically scales well to high-dimensional dueling bandits, and achieves a regret of $\mathcal{O}(K^2\log T)$, where $K$ is the number of clusters and $T$ is the number of rounds. The learned preferences are then transformed into a unified probabilistic format that can be readily adapted to prevalent EMO algorithms. This also leads to a principled termination criterion that strategically manages human cognitive loads and computational budget. Experiments on $48$ benchmark test problems, including synthetic problems, RNA inverse design and protein structure prediction, fully demonstrate the effectiveness of our proposed approach.

NeurIPS Conference 2024 Conference Paper

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

  • Dingkang Yang
  • Jinjie Wei
  • Dongling Xiao
  • Shunli Wang
  • Tong Wu
  • Gang Li
  • Mingcheng Li
  • Shuaibing Wang

Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300, 000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. The project and data will be released at https: //github. com/ydk122024/PediatricsGPT.

NeurIPS Conference 2024 Conference Paper

ProvNeRF: Modeling per Point Provenance in NeRFs as a Stochastic Field

  • Kiyohiro Nakayama
  • Mikaela A. Uy
  • Yang You
  • Ke Li
  • Leonida J. Guibas

Neural radiance fields (NeRFs) have gained popularity with multiple works showing promising results across various applications. However, to the best of our knowledge, existing works do not explicitly model the distribution of training camera poses, or consequently the triangulation quality, a key factor affecting reconstruction quality dating back to classical vision literature. We close this gap with ProvNeRF, an approach that models the provenance for each point -- i. e. , the locations where it is likely visible -- of NeRFs as a stochastic field. We achieve this by extending implicit maximum likelihood estimation (IMLE) to functional space with an optimizable objective. We show that modeling per-point provenance during the NeRF optimization enriches the model with information on triangulation leading to improvements in novel view synthesis and uncertainty estimation under the challenging sparse, unconstrained view setting against competitive baselines. The code will be available at https: //github. com/georgeNakayama/ProvNeRF.

AAAI Conference 2024 Conference Paper

Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning

  • Wensheng Pan
  • Timin Gao
  • Yan Zhang
  • Xiawu Zheng
  • Yunhang Shen
  • Ke Li
  • Runze Hu
  • Yutao Liu

Blind Image Quality Assessment (BIQA) aims to simulate human assessment of image quality. It has a great demand for labeled data, which is often insufficient in practice. Some researchers employ unsupervised methods to address this issue, which is challenging to emulate the human subjective system. To this end, we introduce a unified framework that combines semi-supervised and incremental learning to address the mentioned issue. Specifically, when training data is limited, semi-supervised learning is necessary to infer extensive unlabeled data. To facilitate semi-supervised learning, we use knowledge distillation to assign pseudo-labels to unlabeled data, preserving analytical capability. To gradually improve the quality of pseudo labels, we introduce incremental learning. However, incremental learning can lead to catastrophic forgetting. We employ Experience Replay by selecting representative samples during multiple rounds of semi-supervised learning, to alleviate forgetting and ensure model stability. Experimental results show that the proposed approach achieves state-of-the-art performance across various benchmark datasets. After being trained on the LIVE dataset, our method can be directly transferred to the CSIQ dataset. Compared with other methods, it significantly outperforms unsupervised methods on the CSIQ dataset with a marginal performance drop (-0.002) on the LIVE dataset. In conclusion, our proposed method demonstrates its potential to tackle the challenges in real-world production processes.

JBHI Journal 2024 Journal Article

Signed Curvature Graph Representation Learning of Brain Networks for Brain Age Estimation

  • Jingming Li
  • Zhengyuan Lyu
  • Hu Yu
  • Si Fu
  • Ke Li
  • Li Yao
  • Xiaojuan Guo

Graph Neural Networks (GNNs) play a pivotal role in learning representations of brain networks for estimating brain age. However, the over-squashing impedes interactions between long-range nodes, hindering the ability of message-passing mechanism-based GNNs to learn the topological structure of brain networks. Graph rewiring methods and curvature GNNs have been proposed to alleviate over-squashing. However, most graph rewiring methods overlook node features and curvature GNNs neglect the geometric properties of signed curvature. In this study, a Signed Curvature GNN (SCGNN) was proposed to rewire the graph based on node features and curvature, and learn the representation of signed curvature. First, a Mutual Information Ollivier-Ricci Flow (MORF) was proposed to add connections in the neighborhood of edge with the minimal negative curvature based on the maximum mutual information between node features, improving the efficiency of information interaction between nodes. Then, a Signed Curvature Convolution (SCC) was proposed to aggregate node features based on positive and negative curvature, facilitating the model's ability to capture the complex topological structures of brain networks. Additionally, an Ollivier-Ricci Gradient Pooling (ORG-Pooling) was proposed to select the key nodes and topology structures by curvature gradient and attention mechanism, accurately obtaining the global representation for brain age estimation. Experiments conducted on six public datasets with structural magnetic resonance imaging (sMRI), spanning ages from 18 to 91 years, validate that our method achieves promising performance compared with existing methods. Furthermore, we employed the gaps between brain age and chronological age for identifying Alzheimer's Disease (AD), yielding the best classification performance.

AAAI Conference 2024 Conference Paper

SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger

  • Yuting Gao
  • Jinfeng Liu
  • Zihan Xu
  • Tong Wu
  • Enwei Zhang
  • Ke Li
  • Jie Yang
  • Wei Liu

During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.

AAAI Conference 2024 Conference Paper

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

  • Yunchen Li
  • Zhou Yu
  • Gaoqi He
  • Yunhang Shen
  • Ke Li
  • Xing Sun
  • Shaohui Lin

Symmetric positive definite(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on E(X|y), where y is a vector and X is an SPD matrix. However, these methods are challenging to handle for large-scale data. In this paper, inspired by denoising diffusion probabilistic model(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate E(X|y). Moreover, our model can estimate p(X) unconditionally and flexibly without giving y. On the one hand, the model conditionally learns p(X|y) and utilizes the mean of samples to obtain E(X|y) as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data p(X) and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and conditionally.

AAAI Conference 2024 Conference Paper

Weakly Supervised Open-Vocabulary Object Detection

  • Jianghang Lin
  • Yunhang Shen
  • Bingquan Wang
  • Shaohui Lin
  • Ke Li
  • Liujuan Cao

Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).

AAAI Conference 2023 Conference Paper

Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

  • Linrui Gong
  • Shaohui Lin
  • Baochang Zhang
  • Yunhang Shen
  • Ke Li
  • Ruizhi Qiao
  • Bo Ren
  • Muqing Li

Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at https://github.com/linruigong965/AHBF.

NeurIPS Conference 2023 Conference Paper

CamoPatch: An Evolutionary Strategy for Generating Camoflauged Adversarial Patches

  • Phoenix Williams
  • Ke Li

Deep neural networks (DNNs) have demonstrated vulnerabilities to adversarial examples, which raises concerns about their reliability in safety-critical applications. While the majority of existing methods generate adversarial examples by making small modifications to the entire image, recent research has proposed a practical alternative known as adversarial patches. Adversarial patches have shown to be highly effective in causing DNNs to misclassify by distorting a localized area (patch) of the image. However, existing methods often produce clearly visible distortions since they do not consider the visibility of the patch. To address this, we propose a novel method for constructing adversarial patches that approximates the appearance of the area it covers. We achieve this by using a set of semi-transparent, RGB-valued circles, drawing inspiration from the computational art community. We utilize an evolutionary strategy to optimize the properties of each shape, and employ a simulated annealing approach to optimize the patch's location. Our approach achieves better or comparable performance to state-of-the-art methods on ImageNet DNN classifiers while achieving a lower $l_2$ distance from the original image. By minimizing the visibility of the patch, this work further highlights the vulnerabilities of DNNs to adversarial patches.

NeurIPS Conference 2023 Conference Paper

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes

  • Yulei Qin
  • Xingyu Chen
  • Yunhang Shen
  • Chaoyou Fu
  • Yun Gu
  • Ke Li
  • Xing Sun
  • Rongrong Ji

Webly supervised learning has attracted increasing attention for its effectiveness in exploring publicly accessible data at scale without manual annotation. However, most existing methods of learning with web datasets are faced with challenges from label noise, and they have limited assumptions on clean samples under various noise. For instance, web images retrieved with queries of ”tiger cat“ (a cat species) and ”drumstick“ (a musical instrument) are almost dominated by images of tigers and chickens, which exacerbates the challenge of fine-grained visual concept learning. In this case, exploiting both web images and their associated texts is a requisite solution to combat real-world noise. In this paper, we propose Cross-modality Aligned Prototypes (CAPro), a unified prototypical contrastive learning framework to learn visual representations with correct semantics. For one thing, we leverage textual prototypes, which stem from the distinct concept definition of classes, to select clean images by text matching and thus disambiguate the formation of visual prototypes. For another, to handle missing and mismatched noisy texts, we resort to the visual feature space to complete and enhance individual texts and thereafter improve text matching. Such semantically aligned visual prototypes are further polished up with high-quality samples, and engaged in both cluster regularization and noise removal. Besides, we propose collective bootstrapping to encourage smoother and wiser label reference from appearance-similar instances in a manner of dictionary look-up. Extensive experiments on WebVision1k and NUS-WIDE (Web) demonstrate that CAPro well handles realistic noise under both single-label and multi-label scenarios. CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition. Codes are available at https: //github. com/yuleiqin/capro.

AAAI Conference 2023 Conference Paper

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

  • Mengzhao Chen
  • Mingbao Lin
  • Ke Li
  • Yunhang Shen
  • Yongjian Wu
  • Fei Chao
  • Rongrong Ji

Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance in this paper. Our proposed CF-ViT is motivated by two important observations in modern ViT models: (1) The coarse-grained patch splitting can locate informative regions of an input image. (2) Most images can be well recognized by a ViT model in a small-length token sequence. Therefore, our CF-ViT implements network inference in a two-stage manner. At coarse inference stage, an input image is split into a small-length patch sequence for a computationally economical classification. If not well recognized, the informative patches are identified and further re-split in a fine-grained granularity. Extensive experiments demonstrate the efficacy of our CF-ViT. For example, without any compromise on performance, CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput. Code of this project is at https://github.com/ChenMnZ/CF-V

IJCAI Conference 2023 Conference Paper

Exploring Structural Similarity in Fitness Landscapes via Graph Data Mining: A Case Study on Number Partitioning Problems

  • Mingyu Huang
  • Ke Li

One of the most common problem-solving heuristics is by analogy. For a given problem, a solver can be viewed as a strategic walk on its fitness landscape. Thus if a solver works for one problem instance, we expect it will also be effective for other instances whose fitness landscapes essentially share structural similarities with each other. However, due to the black-box nature of combinatorial optimization, it is far from trivial to infer such similarity in real-world scenarios. To bridge this gap, by using local optima network as a proxy of fitness landscapes, this paper proposed to leverage graph data mining techniques to conduct qualitative and quantitative analyses to explore the latent topological structural information embedded in those landscapes. In our experiments, we use the number partitioning problem as the case and our empirical results are inspiring to support the overall assumption of the existence of structural similarity between landscapes within neighboring dimensions. Besides, experiments on simulated annealing demonstrate that the performance of a meta-heuristic solver is similar on structurally similar landscapes.

NeurIPS Conference 2023 Conference Paper

Learning from Visual Observation via Offline Pretrained State-to-Go Transformer

  • Bohan Zhou
  • Ke Li
  • Jiechuan Jiang
  • Zongqing Lu

Learning from visual observation (LfVO), aiming at recovering policies from only visual observation data, is promising yet a challenging problem. Existing LfVO approaches either only adopt inefficient online learning schemes or require additional task-specific information like goal states, making them not suited for open-ended tasks. To address these issues, we propose a two-stage framework for learning from visual observation. In the first stage, we introduce and pretrain State-to-Go (STG) Transformer offline to predict and differentiate latent transitions of demonstrations. Subsequently, in the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks where an agent learns merely from intrinsic rewards. Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards. The project’s website and code can befound at https: //sites. google. com/view/stgtransformer.

NeurIPS Conference 2023 Conference Paper

Multi-modal Queried Object Detection in the Wild

  • Yifan Xu
  • Mengdan Zhang
  • Chaoyou Fu
  • Peixian Chen
  • Xiaoshan Yang
  • Ke Li
  • Changsheng Xu

We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7. 8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6. 3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https: //github. com/YifanXu74/MQ-Det.

NeurIPS Conference 2023 Conference Paper

NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

  • Mikaela Angelina Uy
  • Kiyohiro Nakayama
  • Guandao Yang
  • Rahul Thomas
  • Leonidas J. Guibas
  • Ke Li

Neural radiance fields (NeRF) rely on volume rendering to synthesize novel views. Volume rendering requires evaluating an integral along each ray, which is numerically approximated with a finite sum that corresponds to the exact integral along the ray under piecewise constant volume density. As a consequence, the rendered result is unstable w. r. t. the choice of samples along the ray, a phenomenon that we dub quadrature instability. We propose a mathematically principled solution by reformulating the sample-based rendering equation so that it corresponds to the exact integral under piecewise linear volume density. This simultaneously resolves multiple issues: conflicts between samples along different rays, imprecise hierarchical sampling, and non-differentiability of quantiles of ray termination distances w. r. t. model parameters. We demonstrate several benefits over the classical sample-based rendering equation, such as sharper textures, better geometric reconstruction, and stronger depth supervision. Our proposed formulation can be also be used as a drop-in replacement to the volume rendering equation of existing NeRF-based methods. Our project page can be found at pl-nerf. github. io.

NeurIPS Conference 2023 Conference Paper

PAPR: Proximity Attention Point Rendering

  • Yanshu Zhang
  • Shichong Peng
  • Alireza Moazeni
  • Ke Li

Learning accurate and parsimonious point cloud representations of scene surfaces from scratch remains a challenge in 3D representation learning. Existing point-based methods often suffer from the vanishing gradient problem or require a large number of points to accurately model scene geometry and texture. To address these limitations, we propose Proximity Attention Point Rendering (PAPR), a novel method that consists of a point-based scene representation and a differentiable renderer. Our scene representation uses a point cloud where each point is characterized by its spatial position, influence score, and view-independent feature vector. The renderer selects the relevant points for each ray and produces accurate colours using their associated features. PAPR effectively learns point cloud positions to represent the correct scene geometry, even when the initialization drastically differs from the target geometry. Notably, our method captures fine texture details while using only a parsimonious set of points. We also demonstrate four practical applications of our method: zero-shot geometry editing, object manipulation, texture transfer, and exposure control. More results and code are available on our project website at https: //zvict. github. io/papr/.

AAAI Conference 2023 Conference Paper

Practical Cross-System Shilling Attacks with Limited Access to Data

  • Meifang Zeng
  • Ke Li
  • Bingchuan Jiang
  • Liujuan Cao
  • Hui Li

In shilling attacks, an adversarial party injects a few fake user profiles into a Recommender System (RS) so that the target item can be promoted or demoted. Although much effort has been devoted to developing shilling attack methods, we find that existing approaches are still far from practical. In this paper, we analyze the properties a practical shilling attack method should have and propose a new concept of Cross-system Attack. With the idea of Cross-system Attack, we design a Practical Cross-system Shilling Attack (PC-Attack) framework that requires little information about the victim RS model and the target RS data for conducting attacks. PC-Attack is trained to capture graph topology knowledge from public RS data in a self-supervised manner. Then, it is fine-tuned on a small portion of target data that is easy to access to construct fake profiles. Extensive experiments have demonstrated the superiority of PC-Attack over state-of-the-art baselines. Our implementation of PC-Attack is available at https://github.com/KDEGroup/PC-Attack.

NeurIPS Conference 2023 Conference Paper

“Why Not Looking backward?” A Robust Two-Step Method to Automatically Terminate Bayesian Optimization

  • Shuang Li
  • Ke Li
  • Wei Li

Bayesian Optimization (BO) is a powerful method for tackling expensive black-box optimization problems. As a sequential model-based optimization strategy, BO iteratively explores promising solutions until a predetermined budget, either iterations or time, is exhausted. The decision on when to terminate BO significantly influences both the quality of solutions and its computational efficiency. In this paper, we propose a simple, yet theoretically grounded, two-step method for automatically terminating BO. Our core concept is to proactively identify if the search is within a convex region by examining previously observed samples. BO is halted once the local regret within this convex region falls below a predetermined threshold. To enhance numerical stability, we propose an approximation method for calculating the termination indicator by solving a bilevel optimization problem. We conduct extensive empirical studies on diverse benchmark problems, including synthetic functions, reinforcement learning, and hyperparameter optimization. Experimental results demonstrate that our proposed method saves up to $\approx 80\%$ computational budget yet is with an order of magnitude smaller performance degradation, comparing against the other peer methods. In addition, our proposed termination method is robust in terms of the setting of its termination criterion.

NeurIPS Conference 2022 Conference Paper

CHIMLE: Conditional Hierarchical IMLE for Multimodal Conditional Image Synthesis

  • Shichong Peng
  • Seyed Alireza Moazenipourasil
  • Ke Li

A persistent challenge in conditional image synthesis has been to generate diverse output images from the same input image despite only one output image being observed per input image. GAN-based methods are prone to mode collapse, which leads to low diversity. To get around this, we leverage Implicit Maximum Likelihood Estimation (IMLE) which can overcome mode collapse fundamentally. IMLE uses the same generator as GANs but trains it with a different, non-adversarial objective which ensures each observed image has a generated sample nearby. Unfortunately, to generate high-fidelity images, prior IMLE-based methods require a large number of samples, which is expensive. In this paper, we propose a new method to get around this limitation, which we dub Conditional Hierarchical IMLE (CHIMLE), which can generate high-fidelity images without requiring many samples. We show CHIMLE significantly outperforms the prior best IMLE, GAN and diffusion-based methods in terms of image fidelity and mode coverage across four tasks, namely night-to-day, 16x single image super-resolution, image colourization and image decompression. Quantitatively, our method improves Fréchet Inception Distance (FID) by 36. 9% on average compared to the prior best IMLE-based method, and by 27. 5% on average compared to the best non-IMLE-based general-purpose methods. More results and code are available on the project website at https: //niopeng. github. io/CHIMLE/.

TIST Journal 2022 Journal Article

DeepExpress: Heterogeneous and Coupled Sequence Modeling for Express Delivery Prediction

  • Siyuan Ren
  • Bin Guo
  • Longbing Cao
  • Ke Li
  • Jiaqi Liu
  • Zhiwen Yu

The prediction of express delivery sequence, i.e., modeling and estimating the volumes of daily incoming and outgoing parcels for delivery, is critical for online business, logistics, and positive customer experience, and specifically for resource allocation optimization and promotional activity arrangement. A precise estimate of consumer delivery requests has to involve sequential factors such as shopping behaviors, weather conditions, events, business campaigns, and their couplings. Despite that various methods have integrated external features to enhance the effects, extant works fail to address complex feature-sequence couplings in the following aspects: weaken the inter-dependencies when processing heterogeneous data and ignore the cumulative and evolving situation of coupling relationships. To address these issues, we propose DeepExpress—a deep-learning-based express delivery sequence prediction model, which extends the classic seq2seq framework to learn feature-sequence couplings. DeepExpress leverages an express delivery seq2seq learning, a carefully designed heterogeneous feature representation, and a novel joint training attention mechanism to adaptively handle heterogeneity issues and capture feature-sequence couplings for accurate prediction. Experimental results on real-world data demonstrate that the proposed method outperforms both shallow and deep baseline models.

TIST Journal 2022 Journal Article

Dynamic Probabilistic Graphical Model for Progressive Fake News Detection on Social Media Platform

  • Ke Li
  • Bin Guo
  • Jiaqi Liu
  • Jiangtao Wang
  • Haoyang Ren
  • Fei Yi
  • Zhiwen Yu

Recently, fake news has been readily spread by massive amounts of users in social media, and automatic fake news detection has become necessary. The existing works need to prepare the overall data to perform detection, losing important information about the dynamic evolution of crowd opinions, and usually neglect the issue of uneven arrival of data in the real world. To address these issues, in this article, we focus on a kind of approach for fake news detection, namely progressive detection, which can be achieved by the dynamic Probabilistic Graphical Model. Based on the observation on real-world datasets, we adaptively improve the Kalman Filter to the Labeled Variable Dimension Kalman Filter (LVDKF) that learns two universal patterns from true and fake news, respectively, which can capture the temporal information of time-series data that arrive unevenly. It can take sequential data as input, distill the dynamic evolution knowledge regarding a post, and utilize crowd wisdom from users’ responses to achieve progressive detection. Then we derive the formulas using the Forward, Backward, and EM Algorithm, and we design a dynamic detection algorithm using Bayes’ theorem. Finally, we design experimental scenarios simulating progressive detection and evaluate LVDKF on two public datasets. It outperforms the baseline methods in these experimental scenarios, which indicates that it is adequate for progressive detection.

AAAI Conference 2022 Conference Paper

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

  • Yifan Xu
  • Zhijie Zhang
  • Mengdan Zhang
  • Kekai Sheng
  • Ke Li
  • Weiming Dong
  • Liqing Zhang
  • Changsheng Xu

Vision transformers (ViTs) have attracted considerable research attention recently, but the huge computational cost is still a severe issue. A mainstream paradigm for computation reduction aims to reduce the number of tokens given that the computation complexity of ViT is quadratic with respect to the input sequence length. Existing designs include structured spatial compression that uses a progressive shrinking pyramid to reduce the computations of large feature maps, and unstructured token pruning that dynamically drops redundant tokens. However, limitations of existing token pruning lie in the following aspects: 1) the incomplete spatial structure caused by pruning is incompatible with structured spatial compression that is commonly used in modern deep-narrow transformers; 2) it usually requires a time-consuming pretraining procedure. To address the limitations and expand the applicable scenario of token pruning, we present Evo-ViT, a self-motivated slow-fast token evolution approach for vision transformers. Specifically, we conduct unstructured instancewise token selection by taking advantage of the simple and effective global class attention that is native to vision transformers. Then, we propose to update the selected informative tokens and uninformative tokens with different computation paths, namely, slow-fast updating. Since slow-fast updating mechanism maintains the spatial structure and information flow, Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrated that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiT-S by over 60% throughput while only sacrificing 0. 4% top-1 accuracy on ImageNet-1K, outperforming current token pruning methods on both accuracy and efficiency.

NeurIPS Conference 2022 Conference Paper

Learning Best Combination for Efficient N:M Sparsity

  • Yuxin Zhang
  • Mingbao Lin
  • Zhihang Lin
  • Yiting Luo
  • Ke Li
  • Fei Chao
  • Yongjian Wu
  • Rongrong Ji

By forcing N out of M consecutive weights to be non-zero, the recent N: M fine-grained network sparsity has received increasing attention with its two attractive advantages over traditional irregular network sparsity methods: 1) Promising performance at a high sparsity. 2) Significant speedups when performed on NVIDIA A100 GPUs. Current implementation on N: M sparsity requires a tedious pre-training phase or computationally heavy from-scratch training. To circumvent these problems, this paper presents an efficient solution for achieving N: M fine-grained sparsity from scratch. Specifically, we first make a re-formulation to convert the N: M fine-grained sparsity into a combinatorial problem, in which, the object falls into choosing the best weight combination among $C_M^N$ candidates. Then, we equip each combination with a learnable importance score, which can be jointly optimized along with its associated weights. Through rigorous proof, we demonstrate that the magnitude of the optimized score well reflects the importance of its corresponding weights combination to the training loss. Therefore, by gradually removing combinations with smaller scores till the best one is left, N: M fine-grained sparsity can be efficiently optimized during the normal training phase without any extra expenditure. Comprehensive experimental results have demonstrated that our proposed method for learning best combination, dubbed as LBC, consistently increases the efficacy of the off-the-shelf N: M methods across varying networks and datasets. Our project is released at https: //github. com/zyxxmu/LBC.

NeurIPS Conference 2022 Conference Paper

Micro and Macro Level Graph Modeling for Graph Variational Auto-Encoders

  • Kiarash Zahirnia
  • Oliver Schulte
  • Parmis Naddaf
  • Ke Li

Generative models for graph data are an important research topic in machine learning. Graph data comprise two levels that are typically analyzed separately: node-level properties such as the existence of a link between a pair of nodes, and global aggregate graph-level statistics, such as motif counts. This paper proposes a new multi-level framework that jointly models node-level properties and graph-level statistics, as mutually reinforcing sources of information. We introduce a new micro-macro training objective for graph generation that combines node-level and graph-level losses. We utilize the micro-macro objective to improve graph generation with a GraphVAE, a well-established model based on graph-level latent variables, that provides fast training and generation time for medium-sized graphs. Our experiments show that adding micro-macro modeling to the GraphVAE model improves graph quality scores up to 2 orders of magnitude on five benchmark datasets, while maintaining the GraphVAE generation speed advantage.

NeurIPS Conference 2022 Conference Paper

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

  • Yuting Gao
  • Jinfeng Liu
  • Zihan Xu
  • Jun Zhang
  • Ke Li
  • Rongrong Ji
  • Chunhua Shen

Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of forcing the model to distinguish compatible negative pairs. Experiments on five downstream tasks demonstrate the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of 15 million pre-training image-text pairs, PyramidCLIP exceeds CLIP on ImageNet zero-shot classification top-1 accuracy by 10. 6%/13. 2%/10. 0% with ResNet50/ViT-B32/ViT-B16 based image encoder respectively. When scaling to larger datasets, PyramidCLIP achieves the state-of-the-art results on several downstream tasks. In particular, the results of PyramidCLIP-ResNet50 trained on 143M image-text pairs surpass that of CLIP using 400M data on ImageNet zero-shot classification task, significantly improving the data efficiency of CLIP.

IJCAI Conference 2022 Conference Paper

Towards Controlling the Transmission of Diseases: Continuous Exposure Discovery over Massive-Scale Moving Objects

  • Ke Li
  • Lisi Chen
  • Shuo Shang
  • Haiyan Wang
  • Yang Liu
  • Panos Kalnis
  • Bin Yao

Infectious diseases have been recognized as major public health concerns for decades. Close contact discovery is playing an indispensable role in preventing epidemic transmission. In this light, we study the continuous exposure search problem: Given a collection of moving objects and a collection of moving queries, we continuously discover all objects that have been directly and indirectly exposed to at least one query over a period of time. Our problem targets a variety of applications, including but not limited to disease control, epidemic pre-warning, information spreading, and co-movement mining. To answer this problem, we develop an exact group processing algorithm with optimization strategies. Further, we propose an approximate algorithm that substantially improves the efficiency without false dismissal. Extensive experiments offer insight into effectiveness and efficiency of our proposed algorithms.

AAAI Conference 2021 Conference Paper

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

  • Jinpeng Wang
  • Yuting Gao
  • Ke Li
  • Jianguo Hu
  • Xinyang Jiang
  • Xiaowei Guo
  • Rongrong Ji
  • Xing Sun

One significant factor we expect the video representation learning to capture, especially in contrast with the image representation learning, is the object motion. However, we found that in the current mainstream video datasets, some action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. For example, a trained model may predict a video as playing football simply because it sees the field, neglecting that the subject is dancing as a cheerleader on the field. This is against our original intention towards the video representation learning and may bring scene bias on a different dataset that can not be ignored. In order to tackle this problem, we propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid. Specifically, we construct a positive clip and a negative clip for each video. Compared to the original video, the positive/negative is motion-untouched/broken but scenebroken/untouched by Spatial Local Disturbance and Temporal Local Disturbance. Our objective is to pull the positive closer while pushing the negative farther to the original clip in the latent space. In this way, the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced. We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8. 1% and 8. 8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.

AAAI Conference 2021 Conference Paper

One for More: Selecting Generalizable Samples for Generalizable ReID Model

  • Enwei Zhang
  • Xinyang Jiang
  • Hao Cheng
  • Ancong Wu
  • Fufu Yu
  • Ke Li
  • Xiaowei Guo
  • Feng Zheng

Current training objectives of existing person Re- IDentification (ReID) models only ensure that the loss of the model decreases on selected training batch, with no regards to the performance on samples outside the batch. It will inevitably cause the model to over-fit the data in the dominant position (e. g. , head data in imbalanced class, easy samples or noisy samples). The latest resampling methods address the issue by designing specific criterion to select specific samples that trains the model generalize more on certain type of data (e. g. , hard samples, tail data), which is not adaptive to the inconsistent real world ReID data distributions. Therefore, instead of simply presuming on what samples are generalizable, this paper proposes a one-for-more training objective that directly takes the generalization ability of selected samples as a loss function and learn a sampler to automatically select generalizable samples. More importantly, our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework which is able to simultaneously train ReID models and the sampler in an end-to-end fashion. The experimental results show that our method can effectively improve the ReID model training and boost the performance of ReID models.

IJCAI Conference 2021 Conference Paper

Traffic Congestion Alleviation over Dynamic Road Networks: Continuous Optimal Route Combination for Trip Query Streams

  • Ke Li
  • Lisi Chen
  • Shuo Shang
  • Panos Kalnis
  • Bin Yao

Route planning and recommendation have attracted much attention for decades. In this paper, we study a continuous optimal route combination problem: Given a dynamic road network and a stream of trip queries, we continuously find an optimal route combination for each new query batch over the query stream such that the total travel time for all routes is minimized. Each route corresponds to a planning result for a particular trip query in the current query batch. Our problem targets a variety of applications, including traffic-flow management, real-time route planning and continuous congestion prevention. The exact algorithm bears exponential time complexity and is computationally prohibitive for application scenarios in dynamic traffic networks. To address this problem, a self-aware batch processing algorithm is developed in this paper. Extensive experiments offer insight into the accuracy and efficiency of our proposed algorithms.

NeurIPS Conference 2021 Conference Paper

Variational Model Inversion Attacks

  • Kuan-Chieh Wang
  • Yan Fu
  • Ke Li
  • Ashish Khisti
  • Richard Zemel
  • Alireza Makhzani

Given the ubiquity of deep neural networks, it is important that these models do not reveal information about sensitive data that they have been trained on. In model inversion attacks, a malicious user attempts to recover the private dataset used to train a supervised neural network. A successful model inversion attack should generate realistic and diverse samples that accurately describe each of the classes in the private dataset. In this work, we provide a probabilistic interpretation of model inversion attacks, and formulate a variational objective that accounts for both diversity and accuracy. In order to optimize this variational objective, we choose a variational family defined in the code space of a deep generative model, trained on a public auxiliary dataset that shares some structural similarity with the target dataset. Empirically, our method substantially improves performance in terms of target attack accuracy, sample realism, and diversity on datasets of faces and chest X-ray images.

AAAI Conference 2020 Conference Paper

Asymmetric Co-Teaching for Unsupervised Cross-Domain Person Re-Identification

  • Fengxiang Yang
  • Ke Li
  • Zhun Zhong
  • Zhiming Luo
  • Xing Sun
  • Hao Cheng
  • Xiaowei Guo
  • Feiyue Huang

Person re-identification (re-ID), is a challenging task due to the high variance within identity samples and imaging conditions. Although recent advances in deep learning have achieved remarkable accuracy in settled scenes, i. e. , source domain, few works can generalize well on the unseen target domain. One popular solution is assigning unlabeled target images with pseudo labels by clustering, and then retraining the model. However, clustering methods tend to introduce noisy labels and discard low confidence samples as outliers, which may hinder the retraining process and thus limit the generalization ability. In this study, we argue that by explicitly adding a sample filtering procedure after the clustering, the mined examples can be much more efficiently used. To this end, we design an asymmetric co-teaching framework, which resists noisy labels by cooperating two models to select data with possibly clean labels for each other. Meanwhile, one of the models receives samples as pure as possible, while the other takes in samples as diverse as possible. This procedure encourages that the selected training samples can be both clean and miscellaneous, and that the two models can promote each other iteratively. Extensive experiments show that the proposed framework can consistently benefit most clustering based methods, and boost the state-of-the-art adaptation accuracy. Our code is available at https: //github. com/FlyingRoastDuck/ACT AAAI20.

NeurIPS Conference 2020 Conference Paper

Pruning Filter in Filter

  • Fanxu Meng
  • Hao Cheng
  • Ke Li
  • Huixiang Luo
  • Xiaowei Guo
  • Guangming Lu
  • Xing Sun

Pruning has become a very powerful and effective technique to compress and accelerate modern neural networks. Existing pruning methods can be grouped into two categories: filter pruning (FP) and weight pruning (WP). FP wins at hardware compatibility but loses at the compression ratio compared with WP. To converge the strength of both methods, we propose to prune the filter in the filter. Specifically, we treat a filter F, whose size is C K K, as K K stripes, i. e. , 1 1 filters, then by pruning the stripes instead of the whole filter, we can achieves finer granularity than traditional FP while being hardware friendly. We term our method as SWP (Stripe-Wise Pruning). SWP is implemented by introducing a novel learnable matrix called Filter Skeleton, whose values reflect the optimal shape of each filter. As some recent work has shown that the pruned architecture is more crucial than the inherited important weights, we argue that the architecture of a single filter, i. e. , the Filter Skeleton, also matters. Through extensive experiments, we demonstrate that SWP is more effective compared to the previous FP-based methods and achieves the state-of-art pruning ratio on CIFAR-10 and ImageNet datasets without obvious accuracy drop.

IJCAI Conference 2020 Conference Paper

Towards Alleviating Traffic Congestion: Optimal Route Planning for Massive-Scale Trips

  • Ke Li
  • Lisi Chen
  • Shuo Shang

We investigate the problem of optimal route planning for massive-scale trips: Given a traffic-aware road network and a set of trip queries Q, we aim to find a route for each trip such that the global travel time cost for all queries in Q is minimized. Our problem is designed for a range of applications such as traffic-flow management, route planning and congestion prevention in rush hours. The exact algorithm bears exponential time complexity and is computationally prohibitive for application scenarios in dynamic traffic networks. To address the challenge, we propose a greedy algorithm and an epsilon-refining algorithm. Extensive experiments offer insight into the accuracy and efficiency of our proposed algorithms.

NeurIPS Conference 2019 Conference Paper

Approximate Feature Collisions in Neural Nets

  • Ke Li
  • Tianhao Zhang
  • Jitendra Malik

Work on adversarial examples has shown that neural nets are surprisingly sensitive to adversarially chosen changes of small magnitude. In this paper, we show the opposite: neural nets could be surprisingly insensitive to adversarially chosen changes of large magnitude. We observe that this phenomenon can arise from the intrinsic properties of the ReLU activation function. As a result, two very different examples could share the same feature activation and therefore the same classification decision. We refer to this phenomenon as feature collision and the corresponding examples as colliding examples. We find that colliding examples are quite abundant: we empirically demonstrate the existence of polytopes of approximately colliding examples in the neighbourhood of practically any example.

JBHI Journal 2018 Journal Article

Epileptic Seizure Classification of EEGs Using Time–Frequency Analysis Based Multiscale Radial Basis Functions

  • Yang Li
  • Xu-Dong Wang
  • Mei-Lin Luo
  • Ke Li
  • Xiao-Feng Yang
  • Qi Guo

The automatic detection of epileptic seizures from electroencephalography (EEG) signals is crucial for the localization and classification of epileptic seizure activity. However, seizure processes are typically dynamic and nonstationary, and thus, distinguishing rhythmic discharges from nonstationary processes is one of the challenging problems. In this paper, an adaptive and localized time–frequency representation in EEG signals is proposed by means of multiscale radial basis functions (MRBF) and a modified particle swarm optimization (MPSO) to improve both time and frequency resolution simultaneously, which is a novel MRBF-MPSO framework of the time–frequency feature extraction for epileptic EEG signals. The dimensionality of extracted features can be greatly reduced by the principle component analysis algorithm before the most discriminative features selected are fed into a support vector machine (SVM) classifier with the radial basis function (RBF) in order to separate epileptic seizure from seizure-free EEG signals. The classification performance of the proposed method has been evaluated by using several state-of-art feature extraction algorithms and other five different classifiers like linear discriminant analysis, and logistic regression. The experimental results indicate that the proposed MRBF-MPSO-SVM classification method outperforms competing techniques in terms of classification accuracy, and shows the effectiveness of the proposed method for classification of seizure epochs and seizure-free epochs.

TAAS Journal 2008 Journal Article

Ant-based distributed constrained steiner tree algorithm for jointly conserving energy and bounding delay in ad hoc multicast routing

  • Chien-Chung Shen
  • Ke Li
  • Chaiporn Jaikaeo
  • Vinay Sridhara

The minimum-energy multicast tree problem aims to construct a multicast tree rooted at the source node and spanning all the destination nodes such that the sum of transmission power at non-leaf nodes is minimized. However, aggressive power assignment at non-leaf nodes, although conserving more energy, results in multicast trees that suffer from higher hop count and jeopardizes delay-sensitive applications, signifying a clear tradeoff between energy efficiency and delay. This article formulates these issues as a constrained Steiner tree problem, and describes a distributed constrained Steiner tree algorithm, which jointly conserves energy and bounds delay for multicast routing in ad hoc networks. In particular, the proposed algorithm concurrently constructs a constrained Steiner tree, performs transmission power assignment at non-leaf nodes, and strives to minimize the sum of transmission power of non-leaf nodes, subject to the given maximum hop count constraint. Simulation results validate the effectiveness and reveal the characteristics of the proposed algorithm.