Arrow Research search

Author name cluster

Kai Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

77 papers
2 author rows

Possible papers

77

JBHI Journal 2026 Journal Article

Airs-Net: Adversarial-Improved Reversible Steganography Network for CT Images in the Internet of Medical Things and Telemedicine

  • Kai Chen
  • Mu Nie
  • Jean-Louis Coatrieux
  • Yang Chen
  • Shipeng Xie

Medical imaging has developed from an auxiliary means of clinical examination into a significant method and intuitive basis for clinical diagnosis of diseases, providing all-around and full-cycle health protection for the people. The Internet of Medical Things (IoMT) allows medical equipment, intelligent terminals, medical infrastructure, and other elements of medical production to be interconnected, eliminating information silos and data fragmentation. Medical images disseminated in IoMT contain a wide diversity of sensitive patient information, which means protecting the patient’s personal information is vital. In this work, an Adversarial-improved reversible steganography network (Airs-Net) for computed tomography (CT) images in the IoMT is presented. Specifically, the Airs-Net adopting the prediction-embedding strategy mainly consists of an image restoration network, an embedded pixel location network, and a discriminator. The image restoration network is effective in restoring the pixel prediction error of the restoration set in integer and non-integer scaled images of arbitrary size when information is concealed. The embedded information location network can automatically select pixel locations for information embedding based on the interpolated image features of the degraded image. The restored image, embedding location map, and embedding information are fed into the embedder for information embedding, and the subsequent secret-carrying image is continuously optimized for the quality of the information-embedded image by the discriminator. Quantitative results show that Airs-Net outperforms state-of-the-art methods in both PSNR and SSIM. Further, the qualitative and quantitative results and analyses under specific clinical application scenarios and in coping with multiple types of medical image information hiding demonstrate the excellent generalization performance and practical application capability of the Airs-Net.

AAAI Conference 2026 Conference Paper

Enhancing Logical Expressiveness in Graph Neural Networks via Path-Neighbor Aggregation

  • Han Yu
  • Xiaojuan Zhao
  • Aiping Li
  • Kai Chen
  • Ziniu Liu
  • Zhichao Peng

Graph neural networks (GNNs) can effectively model structural information of graphs, making them widely used in knowledge graph (KG) reasoning. However, existing studies on the expressive power of GNNs mainly focuses on simple single-relation graphs, and there is still insufficient discussion on the power of GNN to express logical rules in KGs. How to enhance the logical expressive power of GNNs is still a key issue. Motivated by this, we propose Path-Neighbor enhanced GNN (PN-GNN), a method to enhance the logical expressive power of GNN by aggregating node-neighbor embeddings on the reasoning path. First, we analyze the logical expressive power of existing GNN-based methods and point out the shortcomings of the expressive power of these methods. Then, we theoretically investigate the logical expressive power of PN-GNN, showing that it not only has strictly stronger expressive power than C-GNN but also that its (k+1)-hop logical expressiveness is strictly superior to that of k-hop. Finally, we evaluate the logical expressive power of PN-GNN on six synthetic datasets and two real-world datasets. Both theoretical analysis and extensive experiments confirm that PN-GNN enhances the expressive power of logical rules without compromising generalization, as evidenced by its competitive performance in KG reasoning tasks.

AAAI Conference 2026 Conference Paper

Rethinking Flow and Diffusion Bridge Models for Speech Enhancement

  • Dahan Wang
  • Jun Gao
  • Tong Lei
  • Yuxiang Hu
  • Changbao Zhu
  • Kai Chen
  • Jing Lu

Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.

JBHI Journal 2026 Journal Article

SkeDiff: Skeleton 3D CT Diffusion Reconstruction using 2D X-ray

  • Yuan Gao
  • Rongjun Ge
  • Yunbo Gu
  • Zhan Wu
  • Yuanhang Li
  • Mingle Zhou
  • Kai Chen
  • Jean-Louis Coatrieux

For orthopedic diagnostics, both 2D X-ray and 3D CT imaging play essential roles. X-ray imaging is widely accessible, clinically effective, easy to operate, and has lower radiation exposure than CT. However, its inherent 2D nature limits comprehensive visualization of skeletal structures, which 3D CT provides. To bridge this gap, we propose SkeDiff, an algorithm for reconstructing 3D CT images of the skeleton from orthogonal 2D X-ray projections. To fully leverage the information in X-ray images for guiding the diffusion process, we design a cross-dimensional conditional encoder, $E\_{Cond}$, to extract 2D priors for the 3D diffusion model, $DM\_{3DL}$. This encoder integrates a CNN-Mamba hybrid architecture to enhance feature extraction and nonlinear mapping. Additionally, we introduce a 3D UKAN diffusion backbone, which employs Kolmogorov-Arnold network (KAN) to improve feature representation through learnable nonlinear activations. Furthermore, we propose a diffusion-based scoliosis classifier, $D\_{SC}$, enabling scoliosis classification during the 3D CT reconstruction process. Experiments show that SkeDiff outperforms recent algorithms on spine, hip, and knee datasets.

NeurIPS Conference 2025 Conference Paper

Contact Map Transfer with Conditional Diffusion Model for Generalizable Dexterous Grasp Generation

  • Yiyao Ma
  • Kai Chen
  • Kexin Zheng
  • DOU QI

Dexterous grasp generation is a fundamental challenge in robotics, requiring both grasp stability and adaptability across diverse objects and tasks. Analytical methods ensure stable grasps but are inefficient and lack task adaptability, while generative approaches improve efficiency and task integration but generalize poorly to unseen objects and tasks due to data limitations. In this paper, we propose a transfer-based framework for dexterous grasp generation, leveraging a conditional diffusion model to transfer high-quality grasps from shape templates to novel objects within the same category. Specifically, we reformulate the grasp transfer problem as the generation of an object contact map, incorporating object shape similarity and task specifications into the diffusion process. To handle complex shape variations, we introduce a dual mapping mechanism, capturing intricate geometric relationship between shape templates and novel objects. Beyond the contact map, we derive two additional object-centric maps, the part map and direction map, to encode finer contact details for more stable grasps. We then develop a cascaded conditional diffusion model framework to jointly transfer these three maps, ensuring their intra-consistency. Finally, we introduce a robust grasp recovery mechanism, identifying reliable contact points and optimizing grasp configurations efficiently. Extensive experiments demonstrate the superiority of our proposed method. Our approach effectively balances grasp quality, generation efficiency, and generalization performance across various tasks. Project homepage: https: //cmtdiffusion. github. io/

ICLR Conference 2025 Conference Paper

CryoGEN: Generative Energy-based Models for Cryogenic Electron Tomography Reconstruction

  • Yunfei Teng
  • Yuxuan Ren
  • Kai Chen
  • Xi Chen
  • Zhaoming Chen
  • Qiwei Ye

Cryogenic electron tomography (Cryo-ET) is a powerful technique for visualizing subcellular structures in their native states. Nonetheless, its effectiveness is compromised by anisotropic resolution artifacts caused by the missing-wedge effect. To address this, IsoNet, a deep learning-based method, proposes iteratively reconstructing the missing-wedge information. While successful, IsoNet's dependence on recursive prediction updates often leads to training instability and model divergence. In this study, we introduce CryoGEN—an energy-based probabilistic model that not only mitigates resolution anisotropy but also removes the need for recursive subtomogram averaging, delivering an approximate *10*$\times$ speedup for training. Evaluations across various biological datasets, including immature HIV-1 virions and ribosomes, demonstrate that CryoGEN significantly enhances structural completeness and interpretability of the reconstructed samples.

AAAI Conference 2025 Conference Paper

DuMo: Dual Encoder Modulation Network for Precise Concept Erasure

  • Feng Han
  • Kai Chen
  • Chao Gong
  • Zhipeng Wei
  • Jingjing Chen
  • Yu-Gang Jiang

The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model's ability to retain irrelevant concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the demage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module's outputs across different layers and timesteps, automatically balancing the erasure effects and model's generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure (detecting only 34 nude parts), Cartoon Concept Removal (with an average LPIPS_da of 0.428, 0.113 higher than SOTA at 0.315), and Artistic Style Erasure (with an average LPIPS_da of 0.387, 0.088 higher than SOTA at 0.299), clearly outperforming alternative methods.

JBHI Journal 2025 Journal Article

EDG-Net: Encryption and Decryption based Gan-attention Network for CT images in the Internet of Medical Things and Telemedicine

  • Kai Chen
  • Yuchen Li
  • Shipeng Xie
  • Zhan Wu
  • Yikun Zhang
  • Jean-Louis Coatrieux
  • Wei Yan
  • Yang Chen

CT images provide medical practitioners with a scientific and intuitive rationale for the diagnosis of clinical diseases. The Internet of Medical Things (IoMT) and telemedicine facilitate the preservation, transmission, and application of medical data and drive the sharing of medical data, especially medical images. Encryption and decryption of CT images distributed in the IoMT and telemedicine are becoming critical because they contain a large amount of private patient–ensitive information and are vulnerable to third-party attacks, resulting in information exposure and privacy leakage. In this paper, we propose an Encryption and Decryption based Gan-attention network (EDG-Net) for CT images in the IoMT and telemedicine. EDG-Net consists of a generator, two discriminators, a domain transfer of attention, and adaptive normalization. In addition, a double encryption and decryption strategy is introduced by EDG-Net to effectively improve the security of the ciphertext image and the fidelity of the decrypted plaintext image. Specifically, during the encryption or decryption phase, the generator transforms the CT images mutually in the plaintext and ciphertext domains. Two discriminators to identify and modify the differences between these two domain transformations, especially improve the accuracy of the reconstruction during decryption. The parameters of the trained encryption and decryption network are considered as the secret keys of encryption and decryption. Qualitative and quantitative analysis of public and private datasets demonstrates the superior performance of EDG-Net regarding encryption security and robustness as well as decryption accuracy.

AAAI Conference 2025 Conference Paper

LLM-DR: A Novel LLM-Aided Diffusion Model for Rule Generation on Temporal Knowledge Graphs

  • Kai Chen
  • Xin Song
  • Ye Wang
  • Liqun Gao
  • Aiping Li
  • Xiaojuan Zhao
  • Bin Zhou
  • Yalong Xie

Among various temporal knowledge graph (TKG) extrapolation methods, rule-based approaches stand out for their explicit rules and transparent reasoning paths. However, the vast search space for rule extraction poses a challenge in identifying high-quality logic rules. To navigate this challenge, we explore the use of generation models to generate new rules, thereby enriching our rule base and enhancing our reasoning capabilities. In this paper, we introduce LLM-DR, an innovative rule-based method for TKG extrapolation, which harnesses diffusion models to generate rules that are consistent with the distribution of the source data, while also amalgamating the rich semantic insights of Large Language Models (LLMs). Specifically, our LLM-DR generates semantically relevant and high-quality rules, employing conditional diffusion models in a classifier-free guidance fashion and refining them with LLM-based constraints. To assess rule efficacy, we meticulously design a coarse-to-fine evaluation strategy that initiates with coarse-grained filtering to eliminate less plausible rules and proceeds with fine-grained scoring to quantify the reliability of the retained. Extensive experiments demonstrate the promising capacity of our LLM-DR.

NeurIPS Conference 2025 Conference Paper

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

  • Jiaqi Cao
  • Jiarui Wang
  • Rubin Wei
  • Qipeng Guo
  • Kai Chen
  • Bowen Zhou
  • Zhouhan Lin

Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces \textit{Memory Decoder}, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6. 17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

NeurIPS Conference 2025 Conference Paper

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

  • Yichuan Ma
  • Linyang Li
  • Yongkang Chen
  • Peiji Li
  • Jiasheng Ye
  • Qipeng Guo
  • Dahua Lin
  • Kai Chen

Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs' general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present LoGos, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human expert-level performance in Go.

TMLR Journal 2025 Journal Article

NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities

  • Mo Li
  • Songyang Zhang
  • Taolin Zhang
  • Haodong Duan
  • Yunxin Liu
  • Kai Chen

The capability of large language models to handle long-context information plays a crucial role across various real-world applications. Existing methods for evaluating long-context abilities often rely either on real-world long texts, making it difficult to exclude the influence of models' inherent knowledge, or introduce large amounts of irrelevant filler content to artificially reach target lengths, reducing the relevance and effectiveness of assessments. To address these limitations, we introduce NeedleBench, a comprehensive synthetic framework designed to assess retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths (e.g., 32k, 128k, and beyond). NeedleBench systematically embeds key data points at varying depths to rigorously test models' capabilities in diverse settings. Tasks within NeedleBench are categorized into two distinct scenarios: information-sparse, characterized by minimal relevant details embedded within extensive irrelevant text to simulate simpler real-world retrieval tasks; and information-dense, implemented as the Ancestral Trace Challenge, where relevant information is continuously distributed throughout the context to simulate more complex real-world reasoning tasks. Our experiments show that, while recent reasoning models such as Deepseek-R1 and OpenAI's o3 have demonstrated strong performance on mathematical reasoning benchmarks, they still struggle to generalize their reasoning abilities and perform poorly on our information-dense tasks, frequently encountering difficulties with continuous retrieval and reasoning even at relatively shorter context lengths.Furthermore, we identify and characterize a phenomenon termed `under-thinking', wherein models prematurely conclude their reasoning processes despite the availability of relevant information. NeedleBench thus provides critical insights and targeted evaluation tools essential for understanding and improving the long-context capabilities of LLMs. All codes and resources are publicly available at https://github.com/open-compass/opencompass.

NeurIPS Conference 2025 Conference Paper

OOD-Barrier: Build a Middle-Barrier for Open-Set Single-Image Test Time Adaptation via Vision Language Models

  • Boyang Peng
  • Sanqing Qu
  • Tianpei Zou
  • Fan Lu
  • Ya Wu
  • Kai Chen
  • Siheng Chen
  • Yong Wu

In real-world environments, a well-designed model must be capable of handling dynamically evolving distributions, where both in-distribution (ID) and out-of-distribution (OOD) samples appear unpredictably and individually, making real-time adaptation particularly challenging. While open-set test-time adaptation has demonstrated effectiveness in adjusting to distribution shifts, existing methods often rely on batch processing and struggle to manage single-sample data stream in open-set environments. To address this limitation, we propose Open-IRT, a novel open-set Intermediate-Representation-based Test-time adaptation framework tailored for single-image test-time adaptation with vision-language models. Open-IRT comprises two key modules designed for dynamic, single-sample adaptation in open-set scenarios. The first is Polarity-aware Prompt-based OOD Filter module, which fully constructs the ID-OOD distribution, considering both the absolute semantic alignment and relative semantic polarity. The second module, Intermediate Domain-based Test-time Adaptation module, constructs an intermediate domain and indirectly decomposes the ID-OOD distributional discrepancy to refine the separation boundary during the test-time. Extensive experiments on a range of domain adaptation benchmarks demonstrate the superiority of Open-IRT. Compared to previous state-of-the-art methods, it achieves significant improvements on representative benchmarks, such as CIFAR-100C and SVHN — with gains of +8. 45\% in accuracy, -10. 80\% in FPR95, and +11. 04\% in AUROC.

NeurIPS Conference 2025 Conference Paper

Pre-Trained Policy Discriminators are General Reward Models

  • Shihan Dou
  • Shichun Liu
  • Yuming Yang
  • Yicheng Zou
  • Yunhua Zhou
  • Shuhao Xing
  • Chenhao Huang
  • Qiming Ge

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named POLicy DiscriminAtive LeaRning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1. 8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54. 8% to 81. 0% on STEM tasks and from 57. 9% to 85. 5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance—improving LLaMa3. 1-8B from an average of 47. 36% to 56. 33% and Qwen2. 5-32B from 64. 49% to 70. 47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0. 99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

AAAI Conference 2025 Conference Paper

Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning

  • Hui-Yue Yang
  • Hui Chen
  • Ao Wang
  • Kai Chen
  • Zijia Lin
  • Yongliang Tang
  • Pengcheng Gao
  • Yuming Quan

Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perception Tuning (SPT) method, aiming to enhance SAM's perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness.

AAAI Conference 2025 Conference Paper

RepeatLeakage: Leak Prompts from Repeating as Large Language Model Is a Good Repeater

  • Yu Peng
  • Lijie Zhang
  • Peizhuo Lv
  • Kai Chen

With the development of large language models (LLMs), numerous online applications based on these models have emerged. As system prompts significantly influence the performance of LLMs, many such applications conceal their system prompts and regard them as intellectual property. Consequently, numerous efforts have been made to steal these system prompts. However, for applications that do not publicly disclose their system prompts, previously stolen prompts have low confidence. This is because previous methods rely on confirmation from application developers, which is unrealistic since developers may be unwilling to acknowledge that their system prompts have been leaked. We observed a phenomenon: when an LLM performs repetitive tasks, it accurately repeats based on the context rather than relying on its internal model parameters. We validated this phenomenon by comparing the results of two different inputs—repetitive tasks and knowledge-based tasks—under conditions of normal execution, contaminated execution, and partially restored execution. By contaminating the input nouns and then partially restoring them using data from the normal execution's intermediate layers, we measured the accuracies of both task types across these three execution processes. Based on this phenomenon, we propose a high-confidence leakage method called RepeatLeakage. By specifying the range that the model needs to repeat and encouraging the model not to change the format, we manage to extract its system prompt and conversation contexts. We validated the repetition phenomenon on multiple open-source models and successfully designed prompts using RepeatLeakage to leak contents from the actual system prompts of GPT-Store and publicly available ChatGPT conversation contexts. Finally, we tested RepeatLeakage in real environments such as ChatGPT web, successfully leaking their system prompts and conversation contexts.

NeurIPS Conference 2025 Conference Paper

Rethinking Verification for LLM Code Generation: From Generation to Testing

  • Zihan Ma
  • Taolin Zhang
  • Junnan Liu
  • Wenwei Zhang
  • Minnan Luo
  • Songyang Zhang
  • Kai Chen

Large language models (LLMs) have recently achieved notable success in code‑generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90. 62\% and a verifier accuracy of 32. 58\% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10. 78\% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.

AAAI Conference 2025 Conference Paper

Semantic-guided Masked Mutual Learning for Multi-modal Brain Tumor Segmentation with Arbitrary Missing Modalities

  • Guoyan Liang
  • Qin Zhou
  • Zhe Wang
  • Jingyuan Chen
  • Lin Gu
  • Chang Yao
  • Sai Wu
  • Bingcang Huang

Malignant brain tumors have become an aggressive and dangerous disease that leads to death worldwide. Multi-modal MRI data is crucial for accurate brain tumor segmentation, but missing modalities common in clinical practice can severely degrade the segmentation performance. While incomplete multi-modal learning methods attempt to address this, learning robust and discriminative features from arbitrary missing modalities remains challenging. To address this challenge, we propose a novel Semantic-guided Masked Mutual Learning (SMML) approach to distill robust and discriminative knowledge across diverse missing modality scenarios. Specifically, we propose a novel dual-branch masked mutual learning scheme guided by Hierarchical Consistency Constraints (HCC) to ensure multi-level consistency, thereby enhancing mutual learning in incomplete multi-modal scenarios. The HCC framework comprises a pixel-level constraint that selects and exchanges reliable knowledge to guide the mutual learning process. Additionally, it includes a feature-level constraint that uncovers robust inter-sample and inter-class relational knowledge within the latent feature space. To further enhance multi-modal learning from missing modality data, we integrate a refinement network into each student branch. This network leverages semantic priors from the Segment Anything Model (SAM) to provide supplementary information, effectively complementing the masked mutual learning strategy in capturing auxiliary discriminative knowledge. Extensive experiments on three challenging brain tumor segmentation datasets demonstrate that our method significantly improves performance over state-of-the-art methods in diverse missing modality settings.

NeurIPS Conference 2025 Conference Paper

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

  • Junhao Shen
  • Haiteng Zhao
  • Yuzhe Gu
  • Songyang Gao
  • Kuikun Liu
  • Haian Huang
  • Jianfei Gao
  • Dahua Lin

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable S emi- O ff- P olicy RL for vision-language slow-t HI nking re A soning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2. 5 and InternVL3. 0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3. 0-38B by 8. 50\% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e. g. , GPT-4. 1) on the challenging MathVision and OlympiadBench, achieving 49. 08\% and 49. 95\% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

AAAI Conference 2025 Conference Paper

Social Recommendation via Graph-Level Counterfactual Augmentation

  • Yinxuan Huang
  • Ke Liang
  • Yanyi Huang
  • Xiang Zeng
  • Kai Chen
  • Bin Zhou

Traditional recommendation system focus more on the correlations between users and items (user-item relationships), while research on user-user relationships has received significant attention these years, which is also known as social recommendation. Graph-based models have achieved a great success in this task by utilizing the complex topological information of the social networks. However, these models still face the insufficient expressive and overfitting problems. Counterfactual approaches are proven effective as information augmentation strategies towards above issues in various scenarios, but not fully utilized in social recommendations. To this end, we propose a novel social recommendation method, termed SR-GCA, via a plug-and-play Graph-Level Counterfactual Augmentation mechanism. Specifically, we first generate counterfactual social and item links by constructing a counterfactual matrix for data aug- mentation. Then, we employ a supervised learning strategy to refine data both factual and counterfactual links. Thirdly, we enhance representations learning between users via an alignment and self-supervised optimization techniques. Extensive experiments demonstrate the promising capacity of our model from five aspects, including superiority, effectively, transfer- ability, complexity, sensitively. In particular, the transferability is well-proven by extending our GCA module to three typical social recommendation models.

NeurIPS Conference 2025 Conference Paper

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

  • Yuhao Yang
  • ZhI JI
  • Zhaopeng Li
  • Yi Li
  • Zhonglin Mo
  • Yue Ding
  • Kai Chen
  • Zijian Zhang

Generative models have recently gained attention in recommendation systems by directly predicting item identifiers from user interaction sequences. However, existing methods suffer from significant information loss due to the separation of stages such as quantization and sequence modeling, hindering their ability to achieve the modeling precision and accuracy of sequential dense retrieval techniques. Integrating generative and dense retrieval methods remains a critical challenge. To address this, we introduce the Cascaded Organized Bi-Represented generAtive retrieval (COBRA) framework, which innovatively integrates sparse semantic IDs and dense vectors through a cascading process. Our method alternates between generating these representations by first generating sparse IDs, which serve as conditions to aid in the generation of dense vectors. End-to-end training enables dynamic refinement of dense representations, capturing both semantic insights and collaborative signals from user-item interactions. During inference, COBRA employs a coarse-to-fine strategy, starting with sparse ID generation and refining them into dense vectors via the generative model. We further propose BeamFusion, an innovative approach combining beam search with nearest neighbor scores to enhance inference flexibility and recommendation diversity. Extensive experiments on public datasets and offline tests validate our method's robustness. Online A/B tests on a real-world advertising platform with over 200 million daily users demonstrate substantial improvements in key metrics, highlighting COBRA's practical advantages.

TMLR Journal 2025 Journal Article

Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

  • Junjie Wu
  • Tsz Ting Chung
  • Kai Chen
  • Dit-Yan Yeung

Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations in (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to various vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs.

JBHI Journal 2025 Journal Article

Unsupervised Brain Anomaly Detection Using Structure-Preserving Noise Generation and Multi-Scale Dual-Expert Ensembles

  • Qianyi Yang
  • Bingcang Huang
  • Qin Zhou
  • Zhe Wang
  • Kai Chen
  • Xiu Tang
  • Chang Yao
  • Sai Wu

Detecting early brain anomalies is crucial for patient prognosis and recovery, but obtaining expert-annotated data is challenging, especially for clinically silent early brain anomalies. Unsupervised brain anomaly detection, which identifies anomalous regions by modeling normal brain patterns, has gained interest for its label efficiency. However, the inherent variability in normal brains and subtle anomalies that closely resemble normal tissue pose challenges for traditional autoencoders in distinguishing anomalies. Denoising AutoEncoder (DAE) methods have been explored to enhance the model's ability, while their success hinges on effective noise generation strategies. In this paper, we introduce a novel, structure-preserving noise generation scheme based on cross-modal CutMix, aiming to enhance the diversity of noise patterns while preserving the anatomical structure of the brain. To enhance the robustness of DAE learning, we propose an ensemble approach featuring dual experts, each incorporating distinct scale of noise. This dual-expert scheme effectively amplifies reconstruction errors in anomalous regions and suppresses false alarms in healthy areas. Additionally, we propose an anatomically-aware bidirectional consistency loss to ensure high-fidelity reconstruction at the regional level, using superpixels for anatomy perception and bidirectional distillation for reliable knowledge transfer. Extensive experiments across two different settings demonstrate the effectiveness and generalization ability of our proposed method.

TIST Journal 2024 Journal Article

A Game-theoretic Framework for Privacy-preserving Federated Learning

  • Xiaojin Zhang
  • Lixin Fan
  • Siwei Wang
  • Wenjie Li
  • Kai Chen
  • Qiang Yang

In federated learning, benign participants aim to optimize a global model collaboratively. However, the risk of privacy leakage cannot be ignored in the presence of semi-honest adversaries. Existing research has focused either on designing protection mechanisms or on inventing attacking mechanisms. While the battle between defenders and attackers seems never-ending, we are concerned with one critical question: Is it possible to prevent potential attacks in advance? To address this, we propose the first game-theoretic framework that considers both FL defenders and attackers in terms of their respective payoffs, which include computational costs, FL model utilities, and privacy leakage risks. We name this game the federated learning privacy game (FLPG), in which neither defenders nor attackers are aware of all participants’ payoffs. To handle the incomplete information inherent in this situation, we propose associating the FLPG with an oracle that has two primary responsibilities. First, the oracle provides lower and upper bounds of the payoffs for the players. Second, the oracle acts as a correlation device, privately providing suggested actions to each player. With this novel framework, we analyze the optimal strategies of defenders and attackers. Furthermore, we derive and demonstrate conditions under which the attacker, as a rational decision-maker, should always follow the oracle’s suggestion not to attack.

TIST Journal 2024 Journal Article

A Meta-Learning Framework for Tuning Parameters of Protection Mechanisms in Trustworthy Federated Learning

  • Xiaojin Zhang
  • Yan Kang
  • Lixin Fan
  • Kai Chen
  • Qiang Yang

Trustworthy federated learning typically leverages protection mechanisms to guarantee privacy. However, protection mechanisms inevitably introduce utility loss or efficiency reduction while protecting data privacy. Therefore, protection mechanisms and their parameters should be carefully chosen to strike an optimal tradeoff among privacy leakage, utility loss, and efficiency reduction. To this end, federated learning practitioners need tools to measure the three factors and optimize the tradeoff between them to choose the protection mechanism that is most appropriate to the application at hand. Motivated by this requirement, we propose a framework that (1) formulates trustworthy federated learning as a problem of finding a protection mechanism to optimize the tradeoff among privacy leakage, utility loss, and efficiency reduction and (2) formally defines bounded measurements of the three factors. We then propose a meta-learning algorithm to approximate this optimization problem and find optimal protection parameters for representative protection mechanisms, including randomization, homomorphic encryption, secret sharing, and compression. We further design estimation algorithms to quantify these found optimal protection parameters in a practical horizontal federated learning setting and provide a theoretical analysis of the estimation error.

NeurIPS Conference 2024 Conference Paper

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

  • Zifan Song
  • Yudong Wang
  • Wenwei Zhang
  • Kuikun Liu
  • Chengqi Lyu
  • Demin Song
  • Qipeng Guo
  • Hang Yan

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6. 7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence. Source code and models are available at https: //github. com/InternLM/AlchemistCoder.

NeurIPS Conference 2024 Conference Paper

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

  • Yuzhe Gu
  • Ziwei Ji
  • Wenwei Zhang
  • Chengqi Lyu
  • Dahua Lin
  • Kai Chen

Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domain and size, which struggle to scale due to prohibitive labor costs and insufficient reliability of existing hallucination annotators. To facilitate the scalable oversight of LLM hallucinations, this paper introduces an iterative self-training framework that simultaneously and progressively scales up the annotation dataset and improves the accuracy of the annotator. Based on the Expectation Maximization algorithm, in each iteration, the framework first applies an automatic hallucination annotation pipeline for a scaled dataset and then trains a more accurate annotator on the dataset. This new annotator is adopted in the annotation pipeline for the next iteration. Extensive experimental results demonstrate that the finally obtained hallucination annotator with only 7B parameters surpasses GPT-4 and obtains new state-of-the-art hallucination detection results on HaluEval and HalluQA by zero-shot inference. Such an annotator can not only evaluate the hallucination levels of various LLMs on the large-scale dataset but also help to mitigate the hallucination of LLMs generations, with the Natural Language Inference metric increasing from 25% to 37% on HaluEval.

ECAI Conference 2024 Conference Paper

Context Enhancement with Reconstruction as Sequence for Unified Unsupervised Anomaly Detection

  • Hui-Yue Yang
  • Hui Chen 0013
  • Lihao Liu
  • Zijia Lin
  • Kai Chen
  • Liejun Wang
  • Jungong Han
  • Guiguang Ding

Unsupervised anomaly detection (AD) aims to train robust detection models using only normal samples, while can generalize well to unseen anomalies. Recent research focuses on a unified unsupervised AD setting in which only one model is trained for all classes, i. e. , n-class-one-model paradigm. Feature-reconstruction-based methods achieve state-of-the-art performance in this scenario. However, existing methods often suffer from a lack of sufficient contextual awareness, thereby compromising the quality of the reconstruction. To address this issue, we introduce a novel Reconstruction as Sequence (RAS) method, which enhances the contextual correspondence during feature reconstruction from a sequence modeling perspective. In particular, based on the transformer technique, we integrate a specialized RASFormer block into RAS. This block enables the capture of spatial relationships among different image regions and enhances sequential dependencies throughout the reconstruction process. By incorporating the RASFormer block, our RAS method achieves superior contextual awareness capabilities, leading to remarkable performance. Experimental results show that our RAS significantly outperforms competing methods, well demonstrating the effectiveness and superiority of our method. Our code is available at https: //github. com/Nothingtolose9979/RAS

NeurIPS Conference 2024 Conference Paper

CriticEval: Evaluating Large-scale Language Model as Critic

  • Tian Lan
  • Wenwei Zhang
  • Chen Xu
  • Heyan Huang
  • Dahua Lin
  • Kai Chen
  • Xian-Ling Mao

Critique ability, i. e. , the capability of Large Language Models (LLMs) to identify and rectify flaws in responses, is crucial for their applications in self-improvement and scalable oversight. While numerous studies have been proposed to evaluate critique ability of LLMs, their comprehensiveness and reliability are still limited. To overcome this problem, we introduce CriticEval, a novel benchmark designed to comprehensively and reliably evaluate critique ability of LLMs. Specifically, to ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios. It evaluates both scalar-valued and textual critiques, targeting responses of varying quality. To ensure the reliability, a large number of critiques are annotated to serve as references, enabling GPT-4 to evaluate textual critiques reliably. Extensive evaluations of open-source and closed-source LLMs first validate the reliability of evaluation in CriticEval. Then, experimental results demonstrate the promising potential of open-source LLMs, the effectiveness of critique datasets and several intriguing relationships between the critique ability and some critical factors, including task types, response qualities and critique dimensions.

AAAI Conference 2024 Conference Paper

DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models

  • Jiachen Zhou
  • Peizhuo Lv
  • Yibing Lan
  • Guozhu Meng
  • Kai Chen
  • Hualong Ma

Dataset sanitization is a widely adopted proactive defense against poisoning-based backdoor attacks, aimed at filtering out and removing poisoned samples from training datasets. However, existing methods have shown limited efficacy in countering the ever-evolving trigger functions, and often leading to considerable degradation of benign accuracy. In this paper, we propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Specifically, with multiple iterations of the forward and reverse process, we extract intermediary images and their predicted labels for each sample in the original dataset. Then, we identify anomalous samples in terms of the presence of label transition of the intermediary images, detect the target label by quantifying distribution discrepancy, select their purified images considering pixel and feature distance, and determine their ground-truth labels by training a benign model. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy, surpassing the performance of baseline defense methods.

NeurIPS Conference 2024 Conference Paper

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

  • Kai Hu
  • Weichen Yu
  • Yining Li
  • Tianjun Yao
  • Xiang Li
  • Wenhe Liu
  • Lijun Yu
  • Zhiqiang Shen

Recent research indicates that large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content. This paper introduces a novel token-level attack method, Adaptive Dense-to-Sparse Constrained Optimization (ADC), which has been shown to successfully jailbreak multiple open-source LLMs. Drawing inspiration from the difficulties of discrete token optimization, our method relaxes the discrete jailbreak optimization into a continuous optimization process while gradually increasing the sparsity of the optimizing vectors. This technique effectively bridges the gap between discrete and continuous space optimization. Experimental results demonstrate that our method is more effective and efficient than state-of-the-art token-level methods. On Harmbench, our approach achieves the highest attack success rate on seven out of eight LLMs compared to the latest jailbreak methods. \textcolor{red}{Trigger Warning: This paper contains model behavior that can be offensive in nature. }

AAAI Conference 2024 Conference Paper

Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis

  • Zhaoxin Fan
  • Longbin Ji
  • Pengxin Xu
  • Fan Shen
  • Kai Chen

In the dynamic field of film and game development, the emergence of human motion synthesis methods has revolutionized avatar animation. Traditional methodologies, typically reliant on single modality inputs like text or audio, employ modality-specific model frameworks, posing challenges for unified model deployment and application. To address this, we propose Everything2Motion, a unified model framework. Everything2Motion consists of three key modules. The Input-Output Modality Modulation module tailors structures for specific multimodal inputs, eliminating the need for modality-specific frameworks. The Query-aware Autoencoder, based on the transformer encoder-decoder architecture, enables efficient latent motion generation. Lastly, the Prior Motion Distillation Decoder, a pretrained module, enhances the final skeleton sequence's naturalness and fluidity. Comprehensive experiments on several public datasets demonstrate the effectiveness of Everything2Motion, highlighting its potential for practical applications and setting a new benchmark in human motion synthesis.

NeurIPS Conference 2024 Conference Paper

GTA: A Benchmark for General Tool Agents

  • Jize Wang
  • Zerun Ma
  • Yining Li
  • Songyang Zhang
  • Cailian Chen
  • Kai Chen
  • Xinyi Le

In developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for G eneral T ool A gents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We designed 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50\% of the tasks and most LLMs achieving below 25\%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which is beneficial for the advancement of general-purpose tool agents. Dataset and code are available at https: //github. com/open-compass/GTA.

NeurIPS Conference 2024 Conference Paper

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

  • Zhenzhi Wang
  • Yixuan Li
  • Yanhong Zeng
  • Youqing Fang
  • Yuwei Guo
  • Wenran Liu
  • Jing Tan
  • Kai Chen

Human image animation involves generating videos from a character photo, allowing user control and unlocking the potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of real-world videos from the internet. We developed and applied careful filtering rules to ensure video quality, resulting in a curated collection of 20K high-resolution (1080P) human-centric videos. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. To expand our synthetic dataset, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Demo, data and code could be found in the project website: https: //humanvid. github. io/.

NeurIPS Conference 2024 Conference Paper

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

  • Xiaoyi Dong
  • Pan Zhang
  • Yuhang Zang
  • Yuhang Cao
  • Bin Wang
  • Linke Ouyang
  • Songyang Zhang
  • Haodong Duan

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 $\times$ 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 × 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 $\times$ 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks.

NeurIPS Conference 2024 Conference Paper

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

  • Huaiyuan Ying
  • Zijian Wu
  • Yihan Geng
  • Jiayu Wang
  • Dahua Lin
  • Kai Chen

Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at \url{https: //github. com/InternLM/InternLM-Math} and our data at \url{https: //huggingface. co/datasets/InternLM/Lean-Workbook}.

NeurIPS Conference 2024 Conference Paper

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

  • Xinyu Fang
  • Kangrui Mao
  • Haodong Duan
  • Xiangyu Zhao
  • Yining Li
  • Dahua Lin
  • Kai Chen

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding.

NeurIPS Conference 2024 Conference Paper

MotionBooth: Motion-Aware Customized Text-to-Video Generation

  • Jianzong Wu
  • Xiangtai Li
  • Yanhong Zeng
  • Jiangning Zhang
  • Qianyu Zhou
  • Yining Li
  • Kai Chen
  • Yunhai Tong

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Models and codes will be made publicly available.

TIST Journal 2024 Journal Article

Optimizing Privacy, Utility, and Efficiency in a Constrained Multi-Objective Federated Learning Framework

  • Yan Kang
  • Hanlin Gu
  • Xingxing Tang
  • Yuanqin He
  • Yuzhu Zhang
  • Jinnan He
  • Yuxing Han
  • Lixin Fan

Conventionally, federated learning aims to optimize a single objective, typically the utility. However, for a federated learning system to be trustworthy, it needs to simultaneously satisfy multiple objectives, such as maximizing model performance, minimizing privacy leakage and training costs, and being robust to malicious attacks. Multi-Objective Optimization (MOO) aiming to optimize multiple conflicting objectives simultaneously is quite suitable for solving the optimization problem of Trustworthy Federated Learning (TFL). In this article, we unify MOO and TFL by formulating the problem of constrained multi-objective federated learning (CMOFL). Under this formulation, existing MOO algorithms can be adapted to TFL straightforwardly. Different from existing CMOFL algorithms focusing on utility, efficiency, fairness, and robustness, we consider optimizing privacy leakage along with utility loss and training cost, the three primary objectives of a TFL system. We develop two improved CMOFL algorithms based on NSGA-II and PSL, respectively, to effectively and efficiently find Pareto optimal solutions and provide theoretical analysis on their convergence. We design quantitative measurements of privacy leakage, utility loss, and training cost for three privacy protection mechanisms: Randomization, BatchCrypt (an efficient homomorphic encryption), and Sparsification. Empirical experiments conducted under the three protection mechanisms demonstrate the effectiveness of our proposed algorithms.

TMLR Journal 2024 Journal Article

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

  • Yuan Liu
  • Songyang Zhang
  • Jiacheng Chen
  • Kai Chen
  • Dahua Lin

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pre-trained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, PixMIM, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network's focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. PixMIM can be easily integrated into most existing pixel-based MIM approaches (i.e., using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves four MIM approaches, MAE, MFF, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework. Code and models will be available.

NeurIPS Conference 2024 Conference Paper

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

  • Yuxuan Qiao
  • Haodong Duan
  • Xinyu Fang
  • Junming Yang
  • Lin Chen
  • Songyang Zhang
  • Jiaqi Wang
  • Dahua Lin

Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3. 5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar.

JBHI Journal 2024 Journal Article

RED-Net: Residual and Enhanced Discriminative Network for Image Steganalysis in the Internet of Medical Things and Telemedicine

  • Kai Chen
  • Zhengyuan Zhou
  • Yuchen Li
  • Xu Ji
  • Jiasong Wu
  • Jean-Louis Coatrieux
  • Yang Chen
  • Gouenou Coatrieux

Internet of Medical Things (IoMT) and telemedicine technologies utilize computers, communications, and medical devices to facilitate off-site exchanges between specialists and patients, specialists, and medical staff. If the information communicated in IoMT is illegally steganography, tampered or leaked during transmission and storage, it will directly impact patient privacy or the consultation results with possible serious medical incidents. Steganalysis is of great significance for the identification of medical images transmitted illegally in IoMT and telemedicine. In this article, we propose a Residual and Enhanced Discriminative Network (RED-Net) for image steganalysis in the internet of medical things and telemedicine. RED-Net consists of a steganographic information enhancement module, a deep residual network, and steganographic information discriminative mechanism. Specifically, a steganographic information enhancement module is adopted by the RED-Net to boost the illegal steganographic signal in texturally complex high-dimensional medical image features. A deep residual network is utilized for steganographic feature extraction and compression. A steganographic information discriminative mechanism is employed by the deep residual network to enable it to recalibrate the steganographic features and drop high-frequency features that are mistaken for steganographic information. Experiments conducted on public and private datasets with data hiding payloads ranging from 0. 1bpp/bpnzac-0. 5bpp/bpnzac in the spatial and JPEG domain led to RED-Net's steganalysis error $P_{\mathrm{E}}$ in the range of 0. 0732-0. 0010 and 0. 231-0. 026, respectively. In general, qualitative and quantitative results on public and private datasets demonstrate that the RED-Net outperforms 8 state-of-art steganography detectors.

NeurIPS Conference 2024 Conference Paper

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

  • Yilun Jin
  • Zheng Li
  • Chenwei Zhang
  • Tianyu Cao
  • Yifan Gao
  • Pratik Jayarao
  • Mao Li
  • Xin Liu

Online shopping is a complex multi-task, few-shot learning problem with a wide and evolving range of entities, relations, and tasks. However, existing models and benchmarks are commonly tailored to specific tasks, falling short of capturing the full complexity of online shopping. Large Language Models (LLMs), with their multi-task and few-shot learning abilities, have the potential to profoundly transform online shopping by alleviating task-specific engineering efforts and by providing users with interactive conversations. Despite the potential, LLMs face unique challenges in online shopping, such as domain-specific concepts, implicit knowledge, and heterogeneous user behaviors. Motivated by the potential and challenges, we propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality, and can thus comprehensively evaluate the abilities of LLMs as general shop assistants. With Shoppping MMLU, we benchmark over 20 existing LLMs and uncover valuable insights about practices and prospects of building versatile LLM-based shop assistants. Shopping MMLU can be publicly accessed at https: //github. com/KL4805/ShoppingMMLU. In addition, with Shopping MMLU, we are hosting a competition in KDD Cup 2024 with over 500 participating teams. The winning solutions and the associated workshop can be accessed at our website https: //amazon-kddcup24. github. io/.

AAAI Conference 2024 Conference Paper

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

  • Yueqian Wang
  • Yuxuan Wang
  • Kai Chen
  • Dongyan Zhao

Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

IJCAI Conference 2024 Conference Paper

Temporal Knowledge Graph Extrapolation via Causal Subhistory Identification

  • Kai Chen
  • Ye Wang
  • Xin Song
  • Siwei Chen
  • Han Yu
  • Aiping Li

Temporal knowledge graph extrapolation has become a prominent area of study interest in recent years. Numerous methods for extrapolation have been put forth, mining query-relevant information from history to generate forecasts. However, existing approaches normally do not discriminate between causal and non-causal effects in reasoning; instead, they focus on analyzing the statistical correlation between the future events to be predicted and the historical data given, which may be deceptive and hinder the model's capacity to learn real causal information that actually affects the reasoning conclusions. To tackle it, we propose a novel approach called Causal Subhistory Identification (CSI), which focuses on extracting the causal subhistory for reasoning purposes from a large amount of historical data. CSI can improve the clarity and transparency of the reasoning process and more effectively convey the logic behind conclusions by giving priority to the causal subhistory and eliminating non-causal correlations. Extensive experiments demonstrate the remarkable potential of our CSI in the following aspects: superiority, improvement, explainability, and robustness.

AAAI Conference 2024 Conference Paper

UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation

  • Yue Zhao
  • Congyi Li
  • Kai Chen

Recent advances in backdoor attacks, like leveraging complex triggers or stealthy implanting techniques, have introduced new challenges in backdoor scanning, limiting the usability of Deep Neural Networks (DNNs) in various scenarios. In this paper, we propose Unlearning-based Model Ablation (UMA), a novel approach to facilitate backdoor scanning and defend against advanced backdoor attacks. UMA filters out backdoor-irrelevant features by ablating the inherent features of the target class within the model and subsequently reveals the backdoor through dynamic trigger optimization. We evaluate our method on 1700 models (700 benign and 1000 trojaned) with 6 model structures, 7 different backdoor attacks and 4 datasets. Our results demonstrate that the proposed methodology effectively detect these advanced backdoors. Specifically, our method can achieve 91% AUC-ROC and 86.6% detection accuracy on average, which outperforms the baselines, including Neural Cleanse, ABS, K-Arm and MNTD.

NeurIPS Conference 2024 Conference Paper

Vision Foundation Model Enables Generalizable Object Pose Estimation

  • Kai Chen
  • Yiyao Ma
  • Xingyu Lin
  • Stephen James
  • Jianshu Zhou
  • Yun-Hui Liu
  • Pieter Abbeel
  • Qi Dou

Object pose estimation plays a crucial role in robotic manipulation, however, its practical applicability still suffers from limited generalizability. This paper addresses the challenge of generalizable object pose estimation, particularly focusing on category-level object pose estimation for unseen object categories. Current methods either require impractical instance-level training or are confined to predefined categories, limiting their applicability. We propose VFM-6D, a novel framework that explores harnessing existing vision and language models, to elaborate object pose estimation into two stages: category-level object viewpoint estimation and object coordinate map estimation. Based on the two-stage framework, we introduce a 2D-to-3D feature lifting module and a shape-matching module, both of which leverage pre-trained vision foundation models to improve object representation and matching accuracy. VFM-6D is trained on cost-effective synthetic data and exhibits superior generalization capabilities. It can be applied to both instance-level unseen object pose estimation and category-level object pose estimation for novel categories. Evaluations on benchmark datasets demonstrate the effectiveness and versatility of VFM-6D in various real-world scenarios.

NeurIPS Conference 2024 Conference Paper

YOLOv10: Real-Time End-to-End Object Detection

  • Ao Wang
  • Hui Chen
  • Lihao Liu
  • Kai Chen
  • Zijia Lin
  • Jungong Han
  • Guiguang Ding

Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs and adversely impacts the inference latency. Besides, the design of various components in YOLOs lacks the comprehensive and thorough inspection, resulting in noticeable computational redundancy and limiting the model's capability. It renders the suboptimal efficiency, along with considerable potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both the post-processing and the model architecture. To this end, we first present the consistent dual assignments for NMS-free training of YOLOs, which brings the competitive performance and low inference latency simultaneously. Moreover, we introduce the holistic efficiency-accuracy driven model design strategy for YOLOs. We comprehensively optimize various components of YOLOs from both the efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability. The outcome of our effort is a new generation of YOLO series for real-time end-to-end object detection, dubbed YOLOv10. Extensive experiments show that YOLOv10 achieves the state-of-the-art performance and efficiency across various model scales. For example, our YOLOv10-S is 1. 8$\times$ faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2. 8$\times$ smaller number of parameters and FLOPs. Compared with YOLOv9-C, YOLOv10-B has 46\% less latency and 25\% fewer parameters for the same performance. Code and models are available at https: //github. com/THU-MIG/yolov10.

AAAI Conference 2023 Conference Paper

Boosting Point Clouds Rendering via Radiance Mapping

  • Xiaoyang Huang
  • Yi Zhang
  • Bingbing Ni
  • Teng Li
  • Kai Chen
  • Wenjun Zhang

Recent years we have witnessed rapid development in NeRF-based image rendering due to its high quality. However, point clouds rendering is somehow less explored. Compared to NeRF-based rendering which suffers from dense spatial sampling, point clouds rendering is naturally less computation intensive, which enables its deployment in mobile computing device. In this work, we focus on boosting the image quality of point clouds rendering with a compact model design. We first analyze the adaption of the volume rendering formulation on point clouds. Based on the analysis, we simplify the NeRF representation to a spatial mapping function which only requires single evaluation per pixel. Further, motivated by ray marching, we rectify the the noisy raw point clouds to the estimated intersection between rays and surfaces as queried coordinates, which could avoid spatial frequency collapse and neighbor point disturbance. Composed of rasterization, spatial mapping and the refinement stages, our method achieves the state-of-the-art performance on point clouds rendering, outperforming prior works by notable margins, with a smaller model size. We obtain a PSNR of 31.74 on NeRF-Synthetic, 25.88 on ScanNet and 30.81 on DTU. Code and data are publicly available in https://github.com/seanywang0408/RadianceMapping.

TIST Journal 2023 Journal Article

Federated Clique Percolation for Privacy-preserving Overlapping Community Detection

  • Kun Guo
  • Wenzhong Guo
  • Enjie Ye
  • Yutong Fang
  • Jiachen Zheng
  • Ximeng Liu
  • Kai Chen

Community structure is a typical characteristic of complex networks. Finding communities in complex networks has many important applications, such as the advertisement and recommendation based on social networks and the discovery of new protein molecules in biological networks, which make it a hot topic in the field of complex network analysis. With the increasing concerns about the leakage of personal privacy, discovering communities spread across the local networks owned by multiple participants accurately while preserving each participant’s privacy has become an emerging challenge in distributed community detection. In this article, we propose a general federated graph learning model for privacy-preserving distributed graph learning and develop two federated clique percolation algorithms (CPAs) based on it to discover overlapping communities distributed across multiple participants’ local networks without disclosing any participant’s network privacy. Homomorphic encryption and hash operation are used in combination to protect the privacy of the vertices and edges of each local network. Furthermore, vertex attributes are involved in the calculation of clique similarity and clique percolation when dealing with attributed networks. The experimental results on real-world and artificial datasets demonstrate that the proposed algorithms achieve identical results to those of their stand-alone counterparts and more than 200% higher accuracy than the simple distributed CPAs without federating learning.

IJCAI Conference 2023 Conference Paper

Globally Consistent Federated Graph Autoencoder for Non-IID Graphs

  • Kun Guo
  • Yutong Fang
  • Qingqing Huang
  • Yuting Liang
  • Ziyao Zhang
  • Wenyu He
  • Liu Yang
  • Kai Chen

Graph neural networks (GNNs) have been applied successfully in many machine learning tasks due to their advantages in utilizing neighboring information. Recently, with the global enactment of privacy protection regulations, federated GNNs have gained increasing attention in academia and industry. However, the graphs owned by different participants could be non-independently-and-identically distributed (non-IID), leading to the deterioration of federated GNNs' accuracy. In this paper, we propose a globally consistent federated graph autoencoder (GCFGAE) to overcome the non-IID problem in unsupervised federated graph learning via three innovations. First, by integrating federated learning with split learning, we train a unique global model instead of FedAvg-styled global and local models, yielding results consistent with that of the centralized GAE. Second, we design a collaborative computation mechanism considering overlapping vertices to reduce communication overhead during forward propagation. Third, we develop a layer-wise and block-wise gradient computation strategy to reduce the space and communication complexity during backward propagation. Experiments on real-world datasets demonstrate that GCFGAE achieves not only higher accuracy but also around 500 times lower communication overhead and 1000 times smaller space overhead than existing federated GNN models.

NeurIPS Conference 2023 Conference Paper

GlyphControl: Glyph Conditional Control for Visual Text Generation

  • Yukang Yang
  • Dongnan Gui
  • Yuhui Yuan
  • Weicong Liang
  • Haisong Ding
  • Han Hu
  • Kai Chen

Recently, there has been an increasing interest in developing diffusion-based text-to-image generative models capable of generating coherent and well-formed visual text. In this paper, we propose a novel and efficient approach called GlyphControl to address this task. Unlike existing methods that rely on character-aware text encoders like ByT5 and require retraining of text-to-image models, our approach leverages additional glyph conditional information to enhance the performance of the off-the-shelf Stable-Diffusion model in generating accurate visual text. By incorporating glyph instructions, users can customize the content, location, and size of the generated text according to their specific requirements. To facilitate further research in visual text generation, we construct a training benchmark dataset called LAION-Glyph. We evaluate the effectiveness of our approach by measuring OCR-based metrics, CLIP score, and FID of the generated visual text. Our empirical evaluations demonstrate that GlyphControl outperforms the recent DeepFloyd IF approach in terms of OCR accuracy, CLIP score, and FID, highlighting the efficacy of our method.

NeurIPS Conference 2023 Conference Paper

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

  • Youquan Liu
  • Lingdong Kong
  • Jun CEN
  • Runnan Chen
  • Wenwei Zhang
  • Liang Pan
  • Kai Chen
  • Ziwei Liu

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, obviating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment regularization stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45. 0% mIoU on nuScenes after linear probing, surpassing random initialization by 36. 9% mIoU and outperforming prior arts by 6. 1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets. The code is available at this link.

AAAI Conference 2023 Conference Paper

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

  • Zhao Yang
  • Jiaqi Wang
  • Yansong Tang
  • Kai Chen
  • Hengshuang Zhao
  • Philip H.S. Torr

Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal features. The core idea of our approach is to leverage a continuously updated query as the representation of the target object and at each iteration, strengthen multi-modal features strongly correlated to the query while weakening less related ones. As the query is initialized by language features and successively updated by object features, our algorithm gradually shifts from being localization-centric to segmentation-centric. This strategy enables the incremental recovery of missing object parts and/or removal of extraneous parts through iteration. Compared to its counterparts, our method is more versatile—it can be plugged into prior arts straightforwardly and consistently bring improvements. Experimental results on the challenging datasets of RefCOCO, RefCOCO+, and G-Ref demonstrate its advantage with respect to the state-of-the-art methods.

IJCAI Conference 2023 Conference Paper

TG-VQA: Ternary Game of Video Question Answering

  • Hao Li
  • Peng Jin
  • Zesen Cheng
  • Songyang Zhang
  • Kai Chen
  • Zhennan Wang
  • Chang Liu
  • Jie Chen

Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i. e. , annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e. g. , video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (10^4 videos), surpassing most of those pre-trained on large-scale data (10^7 videos).

TIST Journal 2023 Journal Article

Trading Off Privacy, Utility, and Efficiency in Federated Learning

  • Xiaojin Zhang
  • Yan Kang
  • Kai Chen
  • Lixin Fan
  • Qiang Yang

Federated learning (FL) enables participating parties to collaboratively build a global model with boosted utility without disclosing private data information. Appropriate protection mechanisms have to be adopted to fulfill the opposing requirements in preserving privacy and maintaining high model utility. In addition, it is a mandate for a federated learning system to achieve high efficiency in order to enable large-scale model training and deployment. We propose a unified federated learning framework that reconciles horizontal and vertical federated learning. Based on this framework, we formulate and quantify the trade-offs between privacy leakage, utility loss, and efficiency reduction, which leads us to the No-Free-Lunch (NFL) theorem for the federated learning system. NFL indicates that it is unrealistic to expect an FL algorithm to simultaneously provide excellent privacy, utility, and efficiency in certain scenarios. We then analyze the lower bounds for the privacy leakage, utility loss, and efficiency reduction for several widely-adopted protection mechanisms, including Randomization, Homomorphic Encryption, Secret Sharing, and Compression. Our analysis could serve as a guide for selecting protection parameters to meet particular requirements.

AAAI Conference 2022 Conference Paper

Attacking Video Recognition Models with Bullet-Screen Comments

  • Kai Chen
  • Zhipeng Wei
  • Jingjing Chen
  • Zuxuan Wu
  • Yu-Gang Jiang

Recent research has demonstrated that Deep Neural Networks (DNNs) are vulnerable to adversarial patches which introduce perceptible but localized changes to the input. Nevertheless, existing approaches have focused on generating adversarial patches on images, their counterparts in videos have been less explored. Compared with images, attacking videos is much more challenging as it needs to consider not only spatial cues but also temporal cues. To close this gap, we introduce a novel adversarial attack in this paper, the bullet-screen comment (BSC) attack, which attacks video recognition models with BSCs. Specifically, adversarial BSCs are generated with a Reinforcement Learning (RL) framework, where the environment is set as the target model and the agent plays the role of selecting the position and transparency of each BSC. By continuously querying the target models and receiving feedback, the agent gradually adjusts its selection strategies in order to achieve a high fooling rate with non-overlapping BSCs. As BSCs can be regarded as a kind of meaningful patch, adding it to a clean video will not affect people’s understanding of the video content, nor will arouse people’s suspicion. We conduct extensive experiments to verify the effectiveness of the proposed method. On both UCF-101 and HMDB-51 datasets, our BSC attack method can achieve about 90% fooling rate when attacking three mainstream video recognition models, while only occluding <8% areas in the video. Our code is available at https: //github. com/kay-ck/BSC-attack.

NeurIPS Conference 2022 Conference Paper

Deliberated Domain Bridging for Domain Adaptive Semantic Segmentation

  • Lin Chen
  • Zhixiang Wei
  • Xin Jin
  • Huaian Chen
  • Miao Zheng
  • Kai Chen
  • Yi Jin

In unsupervised domain adaptation (UDA), directly adapting from the source to the target domain usually suffers significant discrepancies and leads to insufficient alignment. Thus, many UDA works attempt to vanish the domain gap gradually and softly via various intermediate spaces, dubbed domain bridging (DB). However, for dense prediction tasks such as domain adaptive semantic segmentation (DASS), existing solutions have mostly relied on rough style transfer and how to elegantly bridge domains is still under-explored. In this work, we resort to data mixing to establish a deliberated domain bridging (DDB) for DASS, through which the joint distributions of source and target domains are aligned and interacted with each in the intermediate space. At the heart of DDB lies a dual-path domain bridging step for generating two intermediate domains using the coarse-wise and the fine-wise data mixing techniques, alongside a cross-path knowledge distillation step for taking two complementary models trained on generated intermediate samples as ‘teachers’ to develop a superior ‘student’ in a multi-teacher distillation manner. These two optimization steps work in an alternating way and reinforce each other to give rise to DDB with strong adaptation power. Extensive experiments on adaptive segmentation tasks with different settings demonstrate that our DDB significantly outperforms state-of-the-art methods.

TIST Journal 2022 Journal Article

Efficient Federated Matrix Factorization Against Inference Attacks

  • Di Chai
  • Leye Wang
  • Kai Chen
  • Qiang Yang

Recommender systems typically require the revelation of users’ ratings to the recommender server, which will subsequently use these ratings to provide personalized services. However, such revelations make users vulnerable to a broader set of inference attacks, allowing the recommender server to learn users’ private attributes, e.g., age and gender. Therefore, in this paper, we propose an efficient federated matrix factorization method that protects users against inference attacks. The key idea is that we obfuscate one user’s rating to another such that the private attribute leakage is minimized under the given distortion budget, which bounds the recommending loss and overhead of system efficiency. During the obfuscation, we apply differential privacy to control the information leakage between the users. We also adopt homomorphic encryption to protect the intermediate results during training. Our framework is implemented and tested on real-world datasets. The result shows that our method can reduce up to 16.7% of inference attack accuracy compared to using no privacy protections.

TIST Journal 2022 Journal Article

Improving Availability of Vertical Federated Learning: Relaxing Inference on Non-overlapping Data

  • Zhenghang Ren
  • Liu Yang
  • Kai Chen

Vertical Federated Learning (VFL) enables multiple parties to collaboratively train a machine learning model over vertically distributed datasets without data privacy leakage. However, there is a limitation of the current VFL solutions: current VFL models fail to conduct inference on non-overlapping samples during inference. This limitation seriously damages the VFL model’s availability because, in practice, overlapping samples may only take up a small portion of the whole data at each party which means a large part of inference tasks will fail. In this article, we propose a novel VFL framework which enables federated inference on non-overlapping data. Our framework regards the distributed features as privileged information which is available in the training period but disappears during inference. We distill the knowledge of such privileged features and transfer them to the parties’ local model which only processes local features. Furthermore, we adopt Oblivious Transfer (OT) to preserve data ID privacy during training and inference. Empirically, we evaluate the model on the real-world dataset collected from Criteo and Taobao. Besides, we also provide a security analysis of the proposed framework.

TIST Journal 2022 Journal Article

No Free Lunch Theorem for Security and Utility in Federated Learning

  • Xiaojin Zhang
  • Hanlin Gu
  • Lixin Fan
  • Kai Chen
  • Qiang Yang

In a federated learning scenario where multiple parties jointly learn a model from their respective data, there exist two conflicting goals for the choice of appropriate algorithms. On one hand, private and sensitive training data must be kept secure as much as possible in the presence of semi-honest partners; on the other hand, a certain amount of information has to be exchanged among different parties for the sake of learning utility. Such a challenge calls for the privacy-preserving federated learning solution, which maximizes the utility of the learned model and maintains a provable privacy guarantee of participating parties’ private data. This article illustrates a general framework that (1) formulates the trade-off between privacy loss and utility loss from a unified information-theoretic point of view, and (2) delineates quantitative bounds of the privacy-utility trade-off when different protection mechanisms including randomization, sparsity, and homomorphic encryption are used. It was shown that in general there is no free lunch for the privacy-utility trade-off, and one has to trade the preserving of privacy with a certain degree of degraded utility. The quantitative analysis illustrated in this article may serve as the guidance for the design of practical federated learning algorithms.

AAAI Conference 2022 Conference Paper

Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing

  • Zhili LIU
  • Jianhua Han
  • Lanqing Hong
  • Hang Xu
  • Kai Chen
  • Chunjing Xu
  • Zhenguo Li

Self-supervised learning (SSL), especially contrastive methods, has raised attraction recently as it learns effective transferable representations without semantic annotations. A common practice for self-supervised pre-training is to use as much data as possible. For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance, observed from our extensive experiments. On the other hand, for existing SSL methods, it is burdensome and infeasible to use different downstreamtask-customized datasets in pre-training for different tasks. To address this issue, we propose a novel SSL paradigm called Scalable Dynamic Routing (SDR), which can be trained once and deployed efficiently to different downstream tasks with task-customized pre-trained models. Specifically, we construct the SDRnet with various sub-nets and train each sub-net with only one subset of the data by data-aware progressive training. When a downstream task arrives, we route among all the pre-trained sub-nets to get the best along with its corresponding weights. Experiment results show that our SDR can train 256 sub-nets on ImageNet simultaneously, which provides better transfer performance than a unified model trained on the full ImageNet, achieving state-of-the-art (SOTA) averaged accuracy over 11 downstream classification tasks and AP on PASCAL VOC detection task.

NeurIPS Conference 2021 Conference Paper

Few-Shot Object Detection via Association and DIscrimination

  • Yuhang Cao
  • Jiaqi Wang
  • Ying Jin
  • Tong Wu
  • Kai Chen
  • Ziwei Liu
  • Dahua Lin

Object detection has achieved substantial progress in the last decade. However, detecting novel classes with only few samples remains challenging, since deep learning under low data regime usually leads to a degraded feature space. Existing works employ a holistic fine-tuning paradigm to tackle this problem, where the model is first pre-trained on all base classes with abundant samples, and then it is used to carve the novel class feature space. Nonetheless, this paradigm is still imperfect. Durning fine-tuning, a novel class may implicitly leverage the knowledge of multiple base classes to construct its feature space, which induces a scattered feature space, hence violating the inter-class separability. To overcome these obstacles, we propose a two-step fine-tuning framework, Few-shot object detection via Association and DIscrimination (FADI), which builds up a discriminative feature space for each novel class with two integral steps. 1) In the association step, in contrast to implicitly leveraging multiple base classes, we construct a compact novel class feature space via explicitly imitating a specific base class feature space. Specifically, we associate each novel class with a base class according to their semantic similarity. After that, the feature space of a novel class can readily imitate the well-trained feature space of the associated base class. 2) In the discrimination step, to ensure the separability between the novel classes and associated base classes, we disentangle the classification branches for base and novel classes. To further enlarge the inter-class separability between all classes, a set-specialized margin loss is imposed. Extensive experiments on standard Pascal VOC and MS-COCO datasets demonstrate that FADI achieves new state-of-the-art performance, significantly improving the baseline in any shot/split by +18. 7. Notably, the advantage of FADI is most announced on extremely few-shot scenarios (e. g. 1- and 3- shot).

NeurIPS Conference 2021 Conference Paper

K-Net: Towards Unified Image Segmentation

  • Wenwei Zhang
  • Jiangmiao Pang
  • Kai Chen
  • Chen Change Loy

Semantic, instance, and panoptic segmentations have been addressed using different and specialized frameworks despite their underlying connections. This paper presents a unified, simple, and effective framework for these essentially similar tasks. The framework, named K-Net, segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. To remedy the difficulties of distinguishing various instances, we propose a kernel update strategy that enables each kernel dynamic and conditional on its meaningful group in the input image. K-Net can be trained in an end-to-end manner with bipartite matching, and its training and inference are naturally NMS-free and box-free. Without bells and whistles, K-Net surpasses all previous published state-of-the-art single-model results of panoptic segmentation on MS COCO test-dev split and semantic segmentation on ADE20K val split with 55. 2% PQ and 54. 3% mIoU, respectively. Its instance segmentation performance is also on par with Cascade Mask R-CNN on MS COCO with 60%-90% faster inference speeds. Code and models will be released at https: //github. com/ZwwWayne/K-Net/.

IS Journal 2021 Journal Article

Secure Federated Matrix Factorization

  • Di Chai
  • Leye Wang
  • Kai Chen
  • Qiang Yang

To protect user privacy and meet law regulations, federated (machine) learning is obtaining vast interests in recent years. The key principle of federated learning is training a machine learning model without needing to know each user’s personal raw private data. In this article, we propose a secure matrix factorization framework under the federated learning setting, called FedMF. First, we design a user-level distributed matrix factorization framework where the model can be learned when each user only uploads the gradient information (instead of the raw preference data) to the server. While gradient information seems secure, we prove that it could still leak users’ raw data. To this end, we enhance the distributed matrix factorization framework with homomorphic encryption. We implement the prototype of FedMF and test it with a real movie rating dataset. Results verify the feasibility of FedMF. We also discuss the challenges for applying FedMF in practice for future research.

NeurIPS Conference 2021 Conference Paper

SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous Driving

  • Jianhua Han
  • Xiwen Liang
  • Hang Xu
  • Kai Chen
  • Lanqing Hong
  • Jiageng Mao
  • Chaoqiang Ye
  • Wei Zhang

Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present a large-scale dataset for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data, which is the first and largest dataset to date. Existing autonomous driving systems heavily rely on `perfect' visual perception models (i. e. , detection) trained using extensive annotated data to ensure safety. However, it is unrealistic to elaborately label instances of all scenarios and circumstances (i. e. , night, extreme weather, cities) when deploying a robust autonomous driving system. Motivated by recent advances of self-supervised and semi-supervised learning, a promising direction is to learn a robust detection model by collaboratively exploiting large-scale unlabeled data and few labeled data. Existing datasets (i. e. , BDD100K, Waymo) either provide only a small amount of data or covers limited domains with full annotation, hindering the exploration of large-scale pre-trained models. Here, we release a Large-Scale 2D Self/semi-supervised Object Detection dataset for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories. To improve diversity, the images are collected within 27833 driving hours under different weather conditions, periods and location scenes of 32 different cities. We provide extensive experiments and deep analyses of existing popular self-supervised and semi-supervised approaches, and some interesting findings in autonomous driving scope. Experiments show that SODA10M can serve as a promising pre-training dataset for different self-supervised learning methods, which gives superior performance when finetuning with different downstream tasks (i. e. , detection, semantic/instance segmentation) in autonomous driving domain. This dataset has been used to hold the ICCV2021 SSLAD challenge. More information can refer to https: //soda-2d. github. io.

AAAI Conference 2021 Conference Paper

Temporal ROI Align for Video Object Recognition

  • Tao Gong
  • Kai Chen
  • Xinjiang Wang
  • Qi Chu
  • Feng Zhu
  • Dahua Lin
  • Nenghai Yu
  • Huamin Feng

Video object detection is challenging in the presence of appearance deterioration in certain video frames. Therefore, it is a natural choice to aggregate temporal information from other frames of the same video into the current frame. However, ROI Align, as one of the most core procedures of video detectors, still remains extracting features from a single-frame feature map for proposals, making the extracted ROI features lack temporal information from videos. In this work, considering the features of the same object instance are highly similar among frames in a video, a novel Temporal ROI Align operator is proposed to extract features from other frames feature maps for current frame proposals by utilizing feature similarity. The proposed Temporal ROI Align operator can extract temporal information from the entire video for proposals. We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal ROI Align operator can consistently and significantly boost the performance. Besides, the proposed Temporal ROI Align can also be applied into video instance segmentation.

AAAI Conference 2020 Conference Paper

Real-Time Scene Text Detection with Differentiable Binarization

  • Minghui Liao
  • Zhaoyi Wan
  • Cong Yao
  • Kai Chen
  • Xiang Bai

Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves stateof-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82. 8, running at 62 FPS, on the MSRA-TD500 dataset. Code is available at: https: //github. com/MhLiao/DB.

IROS Conference 2019 Conference Paper

Automatic Annotation for Semantic Segmentation in Indoor Scenes

  • Md. Alimoor Reza
  • Akshay U. Naik
  • Kai Chen
  • David J. Crandall

Domestic robots could eventually transform our lives, but safely operating in home environments requires a rich understanding of indoor scenes. Learning-based techniques for scene segmentation require large-scale, pixel-level annotations, which are laborious and expensive to collect. We propose an automatic method for pixel-wise semantic annotation of video sequences, that gathers cues from object detectors and indoor 3D room-layout estimation and then annotates all the image pixels in an energy minimization framework. Extensive experiments on a publicly available video dataset (SUN3D) evaluate the approach and demonstrate its effectiveness.

IJCAI Conference 2016 Conference Paper

Planning with Task-Oriented Knowledge Acquisition for a Service Robot

  • Kai Chen
  • Fangkai Yang
  • Xiaoping Chen

We propose a framework for a service robot to behave intelligently in domains that contain incomplete information, underspecified goals and dynamic change. Human robot interaction (HRI), sensing actions and physical actions are uniformly formalized in action language BC. An answer set solver is called to generate plans that guide the robot to acquire task-oriented knowledge and execute actions to achieve its goal, including interacting with human to gather information and sensing the environment to help motion planning. By continuously interpreting and grounding useful sensing information, robot is able to use contingent knowledge to adapt to unexpected changes and faults. We evaluate the approach on service robot KeJia that serves drink to guests, a testing benchmark for general-purpose service robot proposed by RoboCup@Home competition.

NeurIPS Conference 2013 Conference Paper

Distributed Representations of Words and Phrases and their Compositionality

  • Tomas Mikolov
  • Ilya Sutskever
  • Kai Chen
  • Greg Corrado
  • Jeff Dean

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada'' and "Air'' cannot be easily combined to obtain "Air Canada''. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model. "

NeurIPS Conference 2012 Conference Paper

Large Scale Distributed Deep Networks

  • Jeffrey Dean
  • Greg Corrado
  • Rajat Monga
  • Kai Chen
  • Matthieu Devin
  • Mark Mao
  • Marc'Aurelio Ranzato
  • Andrew Senior

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports for a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 100x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.