Arrow Research search

Author name cluster

Kai Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers
2 author rows

Possible papers

43

AAAI Conference 2026 Conference Paper

A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

  • Jiyue Jiang
  • Yanyu Chen
  • Pengan CHEN
  • Kai Liu
  • Jingqi Zhou
  • Zheyong Zhu
  • He Hu
  • Fei Ma

Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.

IJCAI Conference 2025 Conference Paper

Beyond Fixed Length: Bucket Pre-training is All You Need

  • Qing Yang
  • Qiyao Peng
  • Hongtao Liu
  • Kai Liu
  • Bing Qin
  • Ting Liu

Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, with pre-training stage serving as the cornerstone of their capabilities. However, the conventional fixed-length data composition strategy for pre-training presents several practical challenges. When using shorter sequences, documents are often truncated, potentially leading to information loss and affecting the model's ability to capture long-range dependencies. Conversely, longer sequences require concatenation of multiple documents, which can introduce noise and affect the natural document boundaries and semantic coherence as well as require substantial computational overhead. To address these challenges, we first establish three quantitative metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. Building upon these metrics, we propose a novel multi-bucket data composition method that transcends the fixed-length paradigm. Our approach adaptively organizes training data to achieve optimal composition quality as measured by the proposed metrics, offering a more flexible and efficient approach for pre-training. We conduct extensive experiments and the results demonstrate that our proposed method significantly enhances both the efficiency and effectiveness of LLM pre-training. Our proposed method has been adopted in the Du Xiaoman–XuanYuan series of financial large language models at https: //github. com/Duxiaoman-DI/XuanYuan.

JBHI Journal 2025 Journal Article

Channel-Gated Transformers With Affinity CAM for Weakly Supervised Multi-Class Brain Tumor Segmentation

  • Yan Han
  • Kai Liu
  • Lingling Yuan
  • Md Rahaman
  • Marcin Grzegorzek
  • Hongzan Sun
  • Chen Li
  • Huiling Chen

Precise tumor localization and sub-region identification are critical for disease diagnosis. However, current Weakly Supervised Semantic Segmentation (WSSS) methods for brain tumor segmentation are primarily single-class, neglecting differences between tumor sub-regions. We observed that when mainstream transformer-based WSSS methods are applied to multi-class brain tumor segmentation, they encounter two major challenges: sub-region discrimination errors and over-segmentation of small lesions. To address these challenges and advance multi-class WSSS methods for brain tumor analysis, this paper proposes Channel-gated Transformers with Affinity CAM (CTAC). CTAC first employs channel-gated multi-head self-attention to overcome the over-smoothing tendency of the transformer, thereby enhancing inter-class discriminability and improving the model's subclass differentiation capability. Then, CTAC uses multi-scale smoothed affinity to adaptively suppress low-confidence responses in the Class Activation Map (CAM), mitigating over-activation in the CAM, and alleviating the over-segmentation phenomena of small lesions. The proposed CTAC significantly outperformed the baseline method on the BraTS2021 glioma and BraTS2023-MEN meningioma datasets. On Brats2021, it achieved a multi-class mean IoU (mIoU) of 61. 718%, an increase of 4. 964 percentage points (pp), with the whole-tumor mIoU reaching 79. 798% (+6. 882 pp). On Brats2023-MEN, CTAC attained 72. 887% mIoU (+4. 676 pp) for multi-class segmentation and 75. 394% (+7. 839 pp) for whole-tumor. Furthermore, CTAC surpasses recent state-of-the-art methods. Code is available at https://github.com/yhan94-lab/CTAC.

NeurIPS Conference 2025 Conference Paper

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

  • Kai Liu
  • Jungang Li
  • Yuchong Sun
  • Shengqiong Wu
  • jianzhang gao
  • Daoan Zhang
  • Wei Zhang
  • Sheng Jin

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

ICML Conference 2025 Conference Paper

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

  • Kai Liu
  • Bowen Xu
  • Shaoyu Wu
  • Xin Chen
  • Hao Zhou
  • Yongliang Tao
  • Lulu Hu

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA ( La yerwise Ro tated S parse A ctivation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0. 17 perplexity gap with a consistent 1. 30$\times$ wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0. 54%, while surpassing TEAL by 1. 77% and CATS by 17. 14%.

IJCAI Conference 2025 Conference Paper

Mask Does Not Matter: A Unified Latent Diffusion-Enhanced Framework for Mask-Free Virtual Try-On

  • Chenghu Du
  • Junyin Wang
  • Kai Liu
  • Shengwu Xiong
  • Yi Rong

A good virtual try-on model should introduce minimal redundant conditional information to avoid instability and increase inference efficiency. Existing methods rely on inpainting masks to guide the generation of the object, but the masks, generated by unstable human parsers, often produce unreliable results with fabric residues due to wrong segmentation. Moreover, large mask regions can lose spatial structure and identity information, requiring extra conditional inputs to compensate, which increases model instability and reduces efficiency. To tackle the problem, we present a novel Mask-Free virtual Try-ON (MFTON) framework. Specifically, we propose a mask-free strategy to eliminate all denoising conditions except for clothing and person images, thereby directly extracting spatial structure and identity information from the person image to improve efficiency and reduce instability. Additionally, to optimize the generated clothing regions, we propose a clothing texture-aware attention mechanism to enable the model to focus on texture generation with significant visual differences. We then introduce a geometric detail capture loss to further enable the model to capture more high-frequency information. Finally, we propose an appearance consistency inference method to reduce the initial randomness of the sampling process significantly. Extensive experiments on popular datasets demonstrate that our method outperforms state-of-the-art virtual try-on methods.

NeurIPS Conference 2025 Conference Paper

OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

  • Jinpei Guo
  • Yifei Ji
  • Zheng Chen
  • Kai Liu
  • Min Liu
  • Wang Rao
  • Wenbo Li
  • Yong Guo

Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models are available at https: //github. com/jp-guo/OSCAR/.

NeurIPS Conference 2025 Conference Paper

RoFt-Mol: Benchmarking Robust Fine-tuning with Molecular Graph Foundation Models

  • Shikun Liu
  • Deyu Zou
  • Nima Shoghi
  • Victor Fung
  • Kai Liu
  • Pan Li

In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Moleculargraph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severedata scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including bothregression and classification tasks. To better understand and improve fine-tuningtechniques under these conditions, we classify eight fine-tuning methods into threemechanisms: weight-based, representation-based, and partial fine-tuning. Webenchmark these methods on downstream regression and classification tasks acrosssupervised and self-supervised pre-trained models in diverse labeling settings. Thisextensive evaluation provides valuable insights and informs the design of a refinedrobust fine-tuning method, ROFT-MOL. This approach combines the strengths ofsimple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types whilemaintaining the ease of use inherent in post-hoc weight interpolation.

IROS Conference 2025 Conference Paper

SEI3D: CPU-only 3D Object Tracking Fusing Sparse-flow-filtered Edge and Interior Alignment

  • Jixiang Chen
  • Jing Chen
  • Kai Liu
  • Ting Lei
  • Leshan Wang

Monocular 3D object tracking methods are widely employed in robotic applications, however, they often struggle with low-contrast image sequences. In this paper, we introduce a novel approach to filtering redundant edges in images by leveraging sparse interior correspondences. Our method features a sparse-flow-based probability segmentation model that comprises both coarse and fine components. The coarse model evaluates the ratio of interior correspondences within a circular region centered on each pixel, while the fine model employs a binary Gaussian kernel based on the nearest interior correspondences. This probability framework facilitates the identification of control points for object edges. Additionally, we implement a robust gradient consistency-based edge connection algorithm to generate refined object edges. Utilizing these filtered edges, we formulate an edge-based energy function that accounts for object contour shape and noise uncertainty, seamlessly integrating into a multi-feature pose optimization framework. Our multi-feature fusion strategy achieves state-of-the-art performance in both public datasets and real-world applications, operating at 60 Hz using only CPU.

IROS Conference 2025 Conference Paper

UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery

  • Huaxiang Zhang 0002
  • Hao Zhang
  • Kai Liu
  • Zhongxue Gan
  • Guo-Niu Zhu

Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i. e. , UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused downsampling module is presented to retain critical spatial details during downsampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3. 1% and AP50 by 4. 2% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page is available at https://github.com/ValiantDiligent/UAV-DETR.

ICML Conference 2025 Conference Paper

UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

  • Xiangzhe Kong
  • Zishen Zhang
  • Ziting Zhang
  • Rui Jiao
  • Jianzhu Ma
  • Wenbing Huang 0001
  • Kai Liu
  • Yang Liu 0005

The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduce Uni fied generative Mo deling of 3D Mo lecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

NeurIPS Conference 2024 Conference Paper

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

  • Kai Liu
  • Haotong Qin
  • Yong Guo
  • Xin Yuan
  • Linghe Kong
  • Guihai Chen
  • Yulun Zhang

Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment, which allows advanced SR models to enjoy compact low-bit parameters and efficient integer/bitwise constructions for storage compression and inference acceleration, respectively. However, it is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. Despite several efforts to alleviate the degradation, the transformer-based SR model still suffers severe degradation due to its distinctive activation distribution. In this work, we present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization. The proposed method first investigates the weight and activation and finds that the distribution is characterized by coexisting symmetry and asymmetry, long tails. Specifically, we propose Distribution-Oriented Bound Initialization (DOBI), using different searching strategies to search a coarse bound for quantizers. To obtain refined quantizer parameters, we further propose Distillation Quantization Calibration (DQC), which employs a distillation approach to make the quantized model learn from its FP counterpart. Through extensive experiments on different bits and scaling factors, the performance of DOBI can reach the state-of-the-art (SOTA) while after stage two, our method surpasses existing PTQ in both metrics and visual effects. 2DQuant gains an increase in PSNR as high as 4. 52dB on Set5 (x2) compared with SOTA when quantized to 2-bit and enjoys a 3. 60x compression ratio and 5. 08x speedup ratio. The code and models are available at https: //github. com/Kai-Liu001/2DQuant.

AAAI Conference 2024 Conference Paper

CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On

  • Chenghu Du
  • Junyin Wang
  • Yi Rong
  • Shuqing Liu
  • Kai Liu
  • Shengwu Xiong

Image-based virtual try-on aims to transfer a target clothing onto a specific person. A significant challenge is arbitrarily matched clothing and person lack corresponding ground truth to supervised learning. A recent pioneering work leveraged an improved cycleGAN to enable one network to generate the desired image for another network during training. However, there is no difference in the result distribution before and after the clothing changes. Therefore, using two different networks is unnecessary and may even increase the difficulty of convergence. Furthermore, the introduced human parsing used to provide body structure information in the input also have a negative impact on the try-on result. How to employ a single network for supervised learning while eliminating human parsing? To tackle these issues, we present a Cycle mapping Virtual Try-On Network (CycleVTON), which can produce photo-realistic try-on results by using a cycle mapping framework without the parser. In particular, we introduce a flow constraint loss to achieve supervised learning of arbitrarily matched clothing and person as inputs to the deformer, thus naturally mimicking the interaction between clothing and the human body. Additionally, we design a skin generation strategy that can adapt to the shape of the target clothing by dynamically adjusting the skin region, i.e., by first removing and then filling skin areas. Extensive experiments conducted on challenging benchmarks demonstrate that our proposed method exhibits superior performance compared to state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

Delving into the Reversal Curse: How Far Can Large Language Models Generalize?

  • Zhengkai Lin
  • Zhihang Fu
  • Kai Liu
  • Liang Xie
  • Binbin Lin
  • Wenxiao Wang
  • Deng Cai
  • Yue Wu

While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact "A is B" in the training documents. For example, this generalization only applies to biographies structured in "[Name] is [Description]" but not to "[Description] is [Name]". (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. Based on these intriguing findings, our work not only presents a novel perspective for interpreting LLMs' generalization abilities from their intrinsic working mechanism but also provides new insights for the development of more effective learning methods for LLMs.

ICML Conference 2024 Conference Paper

Differentiable Model Scaling using Differentiable Topk

  • Kai Liu
  • Ruohui Wang
  • Jianfei Gao 0003
  • Kai Chen 0026

Over the past few years, as large language models have ushered in an era of intelligence emergence, there has been an intensified focus on scaling networks. Although Neural Architecture Search (NAS) methods have been proposed to automate this process, they suffer from low search efficiency. This study introduces Differentiable Model Scaling (DMS), increasing the efficiency for searching optimal width and depth in networks. DMS can model both width and depth in a direct and fully differentiable way, making it easy to optimize. We have evaluated our DMS across diverse tasks, ranging from vision tasks to NLP tasks and various network architectures, including CNNs and Transformers. Results consistently indicate that our DMS can find improved structures and outperforms state-of-the-art NAS methods. Specifically, for image classification on ImageNet, our DMS improves the top-1 accuracy of EfficientNet-B0 and Deit-Tiny by 1. 4% and 0. 6%, respectively, and outperforms the state-of-the-art zero-shot NAS method, ZiCo, by 1. 3% while requiring only 0. 4 GPU days for searching. For object detection on COCO, DMS improves the mAP of Yolo-v8-n by 2. 0%. For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zero-shot classification accuracy. Our code is available at https: //github. com/LKJacky/Differentiable-Model-Scaling.

NeurIPS Conference 2024 Conference Paper

Enhancing LLM’s Cognition via Structurization

  • Kai Liu
  • Zhihang Fu
  • Chao Chen
  • Wei Zhang
  • Rongxin Jiang
  • Fan Zhou
  • Yaowu Chen
  • Yue Wu

When reading long-form text, human cognition is complex and structurized. While large language models (LLMs) process input contexts through a causal and sequential perspective, this approach can potentially limit their ability to handle intricate and complex inputs effectively. To enhance LLM’s cognition capability, this paper presents a novel concept of context structurization. Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements. By doing so, LLMs can better grasp intricate and extended contexts through precise attention and information-seeking along the organized structures. Extensive evaluations are conducted across various model architectures and sizes (including a series of auto-regressive LLMs as well as BERT-like masking models) on a diverse set of NLP tasks (e. g. , context-based question-answering, exhaustive hallucination evaluation, and passage-level dense retrieval). Empirical results show consistent and significant performance gains afforded by a single-round structurization. In particular, we boost the open-sourced LLaMA2-70B model to achieve comparable performance against GPT-3. 5-Turbo as the halluci- nation evaluator. Besides, we show the feasibility of distilling advanced LLMs’ language processing abilities to a smaller yet effective StruXGPT-7B to execute structurization, addressing the practicality of our approach. Code is available at https: //github. com/alibaba/struxgpt.

NeurIPS Conference 2024 Conference Paper

Learning Identifiable Factorized Causal Representations of Cellular Responses

  • Haiyi Mad
  • Romain Lopez
  • Kai Liu
  • Jan-Christian Huetter
  • David Richmond
  • Panayiotis V. Benos
  • Lin Qiu

The study of cells and their responses to genetic or chemical perturbations promises to accelerate the discovery of therapeutics targets. However, designing adequate and insightful models for such data is difficult because the response of a cell to perturbations essentially depends on contextual covariates (e. g. , genetic background or type of the cell). There is therefore a need for models that can identify interactions between drugs and contextual covariates. This is crucial for discovering therapeutics targets, as such interactions may reveal drugs that affect certain cell types but not others. We tackle this problem with a novel Factorized Causal Representation (FCR) learning method, an identifiable deep generative model that reveals causal structure in single-cell perturbation data from several cell lines. FCR learns multiple cellular representations that are disentangled, comprised of covariate-specific (Z x), treatment-specific (Z t) and interaction-specific (Z tx) representations. Based on recent advances of non-linear ICA theory, we prove the component-wise identifiability of Z tx and block-wise identifiability of Z t and Z x. Then, we present our implementation of FCR, and empirically demonstrate that FCR outperforms state-of-the-art baselines in various tasks across four single-cell datasets.

AAAI Conference 2024 Conference Paper

LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

  • Hai Zhu
  • Qingyang Zhao
  • Weiwei Shang
  • Yuren Wu
  • Kai Liu

Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.

JBHI Journal 2024 Journal Article

pathCLIP: Detection of Genes and Gene Relations From Biological Pathway Figures Through Image-Text Contrastive Learning

  • Fei He
  • Kai Liu
  • Zhiyuan Yang
  • Yibo Chen
  • Richard D. Hammer
  • Dong Xu
  • Mihail Popescu

In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. Two case studies on extracting pathway information from literature of non-small cell lung cancer and Alzheimer's disease further demonstrate the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.

IJCAI Conference 2024 Conference Paper

Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud Scenes

  • Kang You
  • Kai Liu
  • Li Yu
  • Pan Gao
  • Dandan Ding

Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i. e. , up to 90~160× faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e. g. , one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2. 9MB), which is attractive for industrial practitioners.

NeurIPS Conference 2024 Conference Paper

Rethinking Out-of-Distribution Detection on Imbalanced Data Distribution

  • Kai Liu
  • Zhihang Fu
  • Sheng Jin
  • Chao Chen
  • Ze Chen
  • Rongxin Jiang
  • Fan Zhou
  • Yaowu Chen

Detecting and rejecting unknown out-of-distribution (OOD) samples is critical for deployed neural networks to void unreliable predictions. In real-world scenarios, however, the efficacy of existing OOD detection methods is often impeded by the inherent imbalance of in-distribution (ID) data, which causes significant performance decline. Through statistical observations, we have identified two common challenges faced by different OOD detectors: misidentifying tail class ID samples as OOD, while erroneously predicting OOD samples as head class from ID. To explain this phenomenon, we introduce a generalized statistical framework, termed ImOOD, to formulate the OOD detection problem on imbalanced data distribution. Consequently, the theoretical analysis reveals that there exists a class-aware bias item between balanced and imbalanced OOD detection, which contributes to the performance gap. Building upon this finding, we present a unified training-time regularization technique to mitigate the bias and boost imbalanced OOD detectors across architecture designs. Our theoretically grounded method translates into consistent improvements on the representative CIFAR10-LT, CIFAR100-LT, and ImageNet-LT benchmarks against several state-of-the-art OOD detection ap- proaches. Code is available at https: //github. com/alibaba/imood.

AAAI Conference 2024 Conference Paper

SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection

  • Xin Jin
  • Kai Liu
  • Cong Ma
  • Ruining Yang
  • Fei Hui
  • Wei Wu

Lidar-based 3D Detection is one of the significant components of Autonomous Driving. However, current methods over-focus on improving the performance of 3D Lidar perception, which causes the architecture of networks becoming complicated and hard to deploy. Thus, the methods are difficult to apply in Autonomous Driving for real-time processing. In this paper, we propose a high-efficiency network, SwiftPillars, which includes Swift Pillar Encoder (SPE) and Multi-scale Aggregation Decoder (MAD). The SPE is constructed by a concise Dual-attention Module with lightweight operators. The Dual-attention Module utilizes feature pooling, matrix multiplication, etc. to speed up point-wise and channel-wise attention extraction and fusion. The MAD interconnects multiple scale features extracted by SPE with minimal computational cost to leverage performance. In our experiments, our proposal accomplishes 61.3% NDS and 53.2% mAP in nuScenes dataset. In addition, we evaluate inference time on several platforms (P4, T4, A2, MLU370, RTX3080), where SwiftPillars achieves up to 13.3ms (75FPS) on NVIDIA Tesla T4. Compared with PointPillars, SwiftPillars is on average 26.58% faster in inference speed with equivalent GPUs and a higher mAP of approximately 3.2% in the nuScenes dataset.

NeurIPS Conference 2023 Conference Paper

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

  • Kai Liu
  • Zhihang Fu
  • Chao Chen
  • Sheng Jin
  • Ze Chen
  • Mingyuan Tao
  • Rongxin Jiang
  • Jieping Ye

The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e. g. , cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e. g. , cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX’s effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications.

NeurIPS Conference 2023 Conference Paper

Optimal Parameter and Neuron Pruning for Out-of-Distribution Detection

  • Chao Chen
  • Zhihang Fu
  • Kai Liu
  • Ze Chen
  • Mingyuan Tao
  • Jieping Ye

For a machine learning model deployed in real world scenarios, the ability of detecting out-of-distribution (OOD) samples is indispensable and challenging. Most existing OOD detection methods focused on exploring advanced training skills or training-free tricks to prevent the model from yielding overconfident confidence score for unknown samples. The training-based methods require expensive training cost and rely on OOD samples which are not always available, while most training-free methods can not efficiently utilize the prior information from the training data. In this work, we propose an \textbf{O}ptimal \textbf{P}arameter and \textbf{N}euron \textbf{P}runing (\textbf{OPNP}) approach, which aims to identify and remove those parameters and neurons that lead to over-fitting. The main method is divided into two steps. In the first step, we evaluate the sensitivity of the model parameters and neurons by averaging gradients over all training samples. In the second step, the parameters and neurons with exceptionally large or close to zero sensitivities are removed for prediction. Our proposal is training-free, compatible with other post-hoc methods, and exploring the information from all training data. Extensive experiments are performed on multiple OOD detection tasks and model architectures, showing that our proposed OPNP consistently outperforms the existing methods by a large margin.

IJCAI Conference 2022 Conference Paper

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

  • Kai Liu
  • Tianyi Wu
  • Cong Liu
  • Guodong Guo

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attend to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.

JBHI Journal 2022 Journal Article

Interpretability Analysis of One-Year Mortality Prediction for Stroke Patients Based on Deep Neural Network

  • Shuo Zhang
  • Jing Wang
  • Lulu Pei
  • Kai Liu
  • Yuan Gao
  • Hui Fang
  • Rui Zhang
  • Lu Zhao

Clinically, physicians collect the benchmark medical data to establish archives for a stroke patient and then add the follow up data regularly. It has great significance on prognosis prediction for stroke patients. In this paper, we present an interpretable deep learning model to predict the one-year mortality risk on stroke. We design sub-modules to reconstruct features from original clinical data that highlight the dissimilarity and temporality of different variables. The model consists of Bidirectional Long Short-Term Memory (Bi-LSTM), in which a novel correlation attention module is proposed that takes the correlation of variables into consideration. In experiments, datasets are collected clinically from the department of neurology in a local AAA hospital. It consists of 2, 275 stroke patients hospitalized in the department of neurology from 2014 to 2016. Our model achieves a precision of 0. 9414, a recall of 0. 9502 and an F1-score of 0. 9415. In addition, we provide the analysis of the interpretability by visualizations with reference to clinical professional guidelines.

AAAI Conference 2021 Conference Paper

Spatio-Temporal Difference Descriptor for Skeleton-Based Action Recognition

  • Chongyang Ding
  • Kai Liu
  • Jari Korhonen
  • Evgeny Belyaev

In skeletal representation, intra-frame differences between body joints, as well as inter-frame dynamics between body skeletons contain discriminative information for action recognition. Conventional methods for modeling human skeleton sequences generally depend on motion trajectory and body joint dependency information, thus lacking the ability to identify the inherent differences of human skeletons. In this paper, we propose a spatio-temporal difference descriptor based on a directional convolution architecture that enables us to learn the spatio-temporal differences and contextual dependencies between different body joints simultaneously. The overall model is built on a deep symmetric positive definite (SPD) metric learning architecture designed to learn discriminative manifold features with the well-designed non-linear mapping operation. Experiments on several action datasets show that our proposed method achieves up to 3% accuracy improvement over state-of-the-art methods.

AAAI Conference 2021 Conference Paper

Unsupervised Domain Adaptation for Person Re-identification via Heterogeneous Graph Alignment

  • Minying Zhang
  • Kai Liu
  • Yidong Li
  • Shihui Guo
  • Hongtao Duan
  • Yimin Long
  • Yi Jin

Unsupervised person re-identification (re-ID) is becoming increasingly popular due to its power in real-world systems such as public security and intelligent transportation systems. However, the person re-ID task is challenged by the problems of data distribution discrepancy across cameras and lack of label information. In this paper, we propose a coarse-tofine heterogeneous graph alignment (HGA) method to find cross-camera person matches by characterizing the unlabeled data as a heterogeneous graph for each camera. In the coarsealignment stage, we assign a projection for each camera and utilize an adversarial learning based method to align coarsegrained node groups from different cameras into a shared space, which consequently alleviates the distribution discrepancy between cameras. In the fine-alignment stage, we exploit potential fine-grained node groups in the shared space and introduce conservative alignment loss functions to constrain the graph aligning process, resulting in reliable pseudo labels as learning guidance. The proposed domain adaptation framework not only improves model generalization on target domain, but also facilitates mining and integrating the potential discriminative information across different cameras. Extensive experiments on benchmark datasets demonstrate that the proposed approach outperforms the state-of-the-arts.

AAAI Conference 2020 Conference Paper

A Robust Adversarial Training Approach to Machine Reading Comprehension

  • Kai Liu
  • Xin Liu
  • An Yang
  • Jing Liu
  • Jinsong Su
  • Sujian Li
  • Qiaoqiao She

Lacking robustness is a serious problem for Machine Reading Comprehension (MRC) models. To alleviate this problem, one of the most promising ways is to augment the training dataset with sophisticated designed adversarial examples. Generally, those examples are created by rules according to the observed patterns of successful adversarial attacks. Since the types of adversarial examples are innumerable, it is not adequate to manually design and enrich training data to defend against all types of adversarial attacks. In this paper, we propose a novel robust adversarial training approach to improve the robustness of MRC models in a more generic way. Given an MRC model well-trained on the original dataset, our approach dynamically generates adversarial examples based on the parameters of current model and further trains the model by using the generated examples in an iterative schedule. When applied to the state-of-the-art MRC models, including QANET, BERT and ERNIE2. 0, our approach obtains significant and comprehensive improvements on 5 adversarial datasets constructed in different ways, without sacrificing the performance on the original SQuAD development set. Moreover, when coupled with other data augmentation strategy, our approach further boosts the overall performance on adversarial datasets and outperforms the state-of-the-art methods.

IJCAI Conference 2020 Conference Paper

An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension

  • Xin Liu
  • Kai Liu
  • Xiang Li
  • Jinsong Su
  • Yubin Ge
  • Bin Wang
  • Jiebo Luo

The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner. Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.

AAAI Conference 2020 Conference Paper

Incentivized Exploration for Multi-Armed Bandits under Reward Drift

  • Zhiyuan Liu
  • Huazheng Wang
  • Fan Shen
  • Kai Liu
  • Lijun Chen

We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of this drifted reward feedback by analyzing the performance of three instantiations of the incentivized MAB algorithm: UCB, ε-Greedy, and Thompson Sampling. Our results show that they all achieve O(log T) regret and compensation under the drifted reward, and are therefore effective in incentivizing exploration. Numerical examples are provided to complement the theoretical analysis.

IJCAI Conference 2019 Conference Paper

Learning Robust Distance Metric with Side Information via Ratio Minimization of Orthogonally Constrained L21-Norm Distances

  • Kai Liu
  • Lodewijk Brand
  • Hua Wang
  • Feiping Nie

Metric Learning, which aims at learning a distance metric for a given data set, plays an important role in measuring the distance or similarity between data objects. Due to its broad usefulness, it has attracted a lot of interest in machine learning and related areas in the past few decades. This paper proposes to learn the distance metric from the side information in the forms of must-links and cannot-links. Given the pairwise constraints, our goal is to learn a Mahalanobis distance that minimizes the ratio of the distances of the data pairs in the must-links to those in the cannot-links. Different from many existing papers that use the traditional squared L2-norm distance, we develop a robust model that is less sensitive to data noise or outliers by using the not-squared L2-norm distance. In our objective, the orthonormal constraint is enforced to avoid degenerate solutions. To solve our objective, we have derived an efficient iterative solution algorithm. We have conducted extensive experiments, which demonstrated the superiority of our method over state-of-the-art.

IJCAI Conference 2019 Conference Paper

Learning Strictly Orthogonal p-Order Nonnegative Laplacian Embedding via Smoothed Iterative Reweighted Method

  • Haoxuan Yang
  • Kai Liu
  • Hua Wang
  • Feiping Nie

Laplacian Embedding (LE) is a powerful method to reveal the intrinsic geometry of high-dimensional data by using graphs. Imposing the orthogonal and nonnegative constraints onto the LE objective has proved to be effective to avoid degenerate and negative solutions, which, though, are challenging to achieve simultaneously because they are nonlinear and nonconvex. In addition, recent studies have shown that using the p-th order of the L2-norm distances in LE can find the best solution for clustering and promote the robustness of the embedding model against outliers, although this makes the optimization objective nonsmooth and difficult to efficiently solve in general. In this work, we study LE that uses the p-th order of the L2-norm distances and satisfies both orthogonal and nonnegative constraints. We introduce a novel smoothed iterative reweighted method to tackle this challenging optimization problem and rigorously analyze its convergence. We demonstrate the effectiveness and potential of our proposed method by extensive empirical studies on both synthetic and real data sets.

AAAI Conference 2019 Conference Paper

Visual Place Recognition via Robust ℓ2-Norm Distance Based Holism and Landmark Integration

  • Kai Liu
  • Hua Wang
  • Fei Han
  • Hao Zhang

Visual place recognition is essential for large-scale simultaneous localization and mapping (SLAM). Long-term robot operations across different time of the days, months, and seasons introduce new challenges from significant environment appearance variations. In this paper, we propose a novel method to learn a location representation that can integrate the semantic landmarks of a place with its holistic representation. To promote the robustness of our new model against the drastic appearance variations due to long-term visual changes, we formulate our objective to use non-squared ℓ2-norm distances, which leads to a difficult optimization problem that minimizes the ratio of the ℓ2,1-norms of matrices. To solve our objective, we derive a new efficient iterative algorithm, whose convergence is rigorously guaranteed by theory. In addition, because our solution is strictly orthogonal, the learned location representations can have better place recognition capabilities. We evaluate the proposed method using two large-scale benchmark data sets, the CMU-VL and Nordland data sets. Experimental results have validated the effectiveness of our new method in long-term visual place recognition applications.

NeurIPS Conference 2018 Conference Paper

Dropping Symmetry for Fast Symmetric Nonnegative Matrix Factorization

  • Zhihui Zhu
  • Xiao Li
  • Kai Liu
  • Qiuwei Li

Symmetric nonnegative matrix factorization (NMF)---a special but important class of the general NMF---is demonstrated to be useful for data analysis and in particular for various clustering tasks. Unfortunately, designing fast algorithms for Symmetric NMF is not as easy as for the nonsymmetric counterpart, the latter admitting the splitting property that allows efficient alternating-type algorithms. To overcome this issue, we transfer the symmetric NMF to a nonsymmetric one, then we can adopt the idea from the state-of-the-art algorithms for nonsymmetric NMF to design fast algorithms solving symmetric NMF. We rigorously establish that solving nonsymmetric reformulation returns a solution for symmetric NMF and then apply fast alternating based algorithms for the corresponding reformulated problem. Furthermore, we show these fast algorithms admit strong convergence guarantee in the sense that the generated sequence is convergent at least at a sublinear rate and it converges globally to a critical point of the symmetric NMF. We conduct experiments on both synthetic data and image clustering to support our result.

IJCAI Conference 2018 Conference Paper

High-Order Co-Clustering via Strictly Orthogonal and Symmetric L1-Norm Nonnegative Matrix Tri-Factorization

  • Kai Liu
  • Hua Wang

Different to traditional clustering methods that deal with one single type of data, High-Order Co- Clustering (HOCC) aims to cluster multiple types of data simultaneously by utilizing the inter- or/and intra-type relationships across different data types. In existing HOCC methods, data points routinely enter the objective functions with squared residual errors. As a result, outlying data samples can dominate the objective functions, which may lead to incorrect clustering results. Moreover, existing methods usually suffer from soft clustering, where the probabilities to different groups can be very close. In this paper, we propose an L1 -norm symmetric nonnegative matrix tri-factorization method to solve the HOCC problem. Due to the orthogonal constraints and the symmetric L1 -norm formulation in our new objective, conventional auxiliary function approach no longer works. Thus we derive the solution algorithm using the alternating direction method of multipliers. Extensive experiments have been conducted on a real world data set, in which promising empirical results, including less time consumption, strictly orthogonal membership matrix, lower local minima etc. , have demonstrated the effectiveness of our proposed method.