Arrow Research search

Author name cluster

Hao Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

146 papers
2 author rows

Possible papers

146

AAAI Conference 2026 Conference Paper

Active Multi-source Domain Adaptation for Multimodal Fake News Detection

  • Yanping Chen
  • Weijie Shi
  • Mengze Li
  • Yue Cui
  • Jiaming Li
  • Ruiyuan Zhang
  • Hao Chen
  • Hanghui Guo

Multimodal fake news detection plays a crucial role in combating online misinformation. The inherent domain diversity of news in the real world has driven the development of cross-domain detection methods. However, these detection methods either suffer from significant performance degradation due to semantic and deception pattern shifts between the training (source) and test (target) domains or heavily rely on annotated labels. To address the problems, we propose ADOSE, an active multi-source domain adaptation framework for multimodal fake news detection which actively annotates a small subset of target samples to improve detection performance. Specifically, for domain shifts, we design a multi-expert classifier network based on refined features to comprehensively capture and adapt to the semantic space and deception patterns of news across different domains. To maximize adaptation performance with limited annotation cost, we propose a least-disagree uncertainty selector equipped with a diversity calculator for selecting the most informative samples. The selector leverages the uncertainty of inconsistent predictions before and after perturbations by multiple classifiers as an indicator of unfamiliar samples. It further incorporates diversity scores derived from multi-view features to ensure the chosen samples achieve maximal coverage of target domain features. The extensive experiments on multiple datasets show that ADOSE outperforms existing domain adaptation methods by 2.45% ~ 9.1%, indicating the superiority of our model.

AAAI Conference 2026 Conference Paper

AIR-DR: Adaptive Image Retargeting with Instance Relocation and Dual-guidance Repainting

  • Zhitong Dong
  • Chao Li
  • Yongjian Deng
  • Hao Chen

Image retargeting aims to adjust the aspect ratio of images to accommodate various display devices. While existing methods consider both foreground semantics and background inpainting, their Seam-carving-based framework is inherently destructive, often compromising the structural integrity of foreground instances. Furthermore, conventional inpainting models struggle to achieve pixel-level accuracy with global-only guidance, leading to local inconsistencies and background distortions. To address these challenges, we reformulate image retargeting as a instance-level re-layout task. By Adaptive Instance Relocation and Dual-guidance Repainting (AIR-DR), our method preserves the structural integrity of the foreground and recovers the background with consistent details. Additionally, we introduce an adaptive retargeting decision that maintains robustness across challenging retargeting scenarios and any ratios. Extensive experiments on multiple public datasets across various aspect ratios demonstrate that our approach consistently outperforms existing methods in both objective metrics and subjective evaluations. Comprehensive ablation studies further validate the effectiveness of each component.

AAAI Conference 2026 Conference Paper

An Invariant Latent Space Perspective on Language Model Inversion

  • Wentao Ye
  • Jiaqi Hu
  • Haobo Wang
  • Xinpeng Ti
  • Zhiqing Xiao
  • Hao Chen
  • Liyao Li
  • Lei Feng

Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM's own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies.

AAAI Conference 2026 Conference Paper

ConSurv: Multimodal Continual Learning for Survival Analysis

  • Dianzhi Yu
  • Conghao Xiong
  • Yankai Chen
  • Wenqian Cui
  • Xinni Zhang
  • Yifei Zhang
  • Hao Chen
  • Joseph J. Y. Sung

Survival prediction of cancers is crucial for clinical practice, as it informs mortality risks and influences treatment plans. However, a static model trained on a single dataset fails to adapt to the dynamically evolving clinical environment and continuous data streams, limiting its practical utility. While continual learning (CL) offers a solution to learn dynamically from new datasets, existing CL methods primarily focus on unimodal inputs and suffer from severe catastrophic forgetting in survival prediction. In real-world scenarios, multimodal inputs often provide comprehensive and complementary information, such as whole slide images and genomics; and neglecting inter-modal correlations negatively impacts the performance. To address the two challenges of catastrophic forgetting and complex inter-modal interactions between gigapixel whole slide images and genomics, we propose ConSurv, the first multimodal continual learning (MMCL) method for survival analysis. ConSurv incorporates two key components: Multi-staged Mixture of Experts (MS-MoE) and Feature Constrained Replay (FCR). MS-MoE captures both task-shared and task-specific knowledge at different learning stages of the network, including two modality encoders and the modality fusion component, learning inter-modal relationships. FCR further enhances learned knowledge and mitigates forgetting by restricting feature deviation of previous data at different levels, including encoder-level features of two modalities and the fusion-level representations. Additionally, we introduce a new benchmark integrating four datasets, Multimodal Survival Analysis Incremental Learning (MSAIL), for comprehensive evaluation in the CL setting. Extensive experiments demonstrate that ConSurv outperforms competing methods across multiple metrics.

AAAI Conference 2026 Conference Paper

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

  • Ming Liu
  • Hao Chen
  • Jindong Wang
  • Liwen Wang
  • Jingchen Sun
  • Wensheng Zhang

Vision-Language Models (VLMs) have achieved success in tasks such as visual question answering, yet their resilience to distractions remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models and those specialized for reasoning—against distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. We evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with a noticeable degradation in reasoning when extraneous content is present. In particular, some models (including GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies such as prompt engineering. Although these strategies improve resilience modestly, our analysis highlights considerable room for further improvement in the robustness of VLMs.

AAAI Conference 2026 Conference Paper

Knowledge-Enhanced Explainable Prompting for Vision-Language Models

  • Yequan Bie
  • Andong Tan
  • Zhixuan Chen
  • Zhiyuan Cai
  • Luyang Luo
  • Hao Chen

Large-scale vision-language models (VLMs) embedded with expansive representations and visual concepts have showcased significant potential in image and text understanding. Efficiently adapting VLMs such as CLIP to downstream tasks like few-shot image classification has garnered growing attention, with prompt learning emerging as a representative approach. However, most existing prompt-based adaptation methods, which rely solely on coarse-grained textual prompts, suffer from limited performance and interpretability when handling domain tasks that require specific knowledge. This results in a failure to satisfy the stringent trustworthiness requirements of Explainable Artificial Intelligence (XAI) in high-risk scenarios like healthcare. To address this issue, we propose a Knowledge-Enhanced Explainable Prompting (KEEP) framework that leverages fine-grained domain-specific knowledge to enhance the adaptation process of VLMs across various domains and image modalities. By incorporating retrieval augmented generation and domain foundation models, our framework can provide more reliable image-wise knowledge for prompt learning in various domains, alleviating the lack of fine-grained annotations, while offering both visual and textual explanations. Extensive experiments and explainability analyses conducted on eight datasets of different domains and image modalities demonstrate that our method simultaneously achieves superior performance and interpretability, highlighting the effectiveness of the collaboration between foundation models and XAI.

TMLR Journal 2026 Journal Article

Learning from Online Videos at Inference Time for Computer-Use Agents

  • Yujian Liu
  • Ze Wang
  • Hao Chen
  • Ximeng Sun
  • Xiaodong Yu
  • Jialian Wu
  • Jiang Liu
  • Emad Barsoum

Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time.

AAAI Conference 2026 Conference Paper

LLM Collaborative Filtering: User-Item Graph as New Language

  • Huachi Zhou
  • Yujing Zhang
  • Hao Chen
  • Qinggang Zhang
  • Qijie Shen
  • Feiran Huang
  • Xiao Huang

In collaborative filtering, learning effective embeddings for users and items from interaction data remains a central challenge. While recent efforts leverage large language models (LLMs) to enhance collaborative filtering, two critical limitations persist: (1) Efficiency: LLM-based inference is significantly slower than traditional embedding-based search; and (2) Topological Modeling: LLMs struggle to capture graph structures, which are essential for modeling multi-order user-item interactions. To address these limitations, we propose New Language Collaborative Filtering (NLCF), a framework that aligns LLMs with collaborative filtering by conceptualizing user-item graphs as new languages. This approach is based on two key insights: (1) LLMs excel at mastering new languages when trained on suitable corpora, and (2) the empirical conditional probability between tokens in corpora converges to the transition probabilities between nodes in graphs. NLCF translates user-item graphs into corpora, where users and items are treated as tokens. These corpora are used to fine-tune LLMs, and the learned representations are aggregated to construct user and item embeddings that encode multi-order interactions. Unlike methods that deploy LLMs for inference, NLCF distills LLM knowledge learned from corpora into compact embeddings, enabling both efficient training and real-time inference. The framework has been deployed on a billion-scale e-commerce platform for several months. Extensive experiments demonstrate that NLCF outperforms traditional graph CF models and LLM-based baselines while achieving significant training and inference efficiency improvement over LLM-based baselines.

AAAI Conference 2026 Conference Paper

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks

  • Kaijun Wang
  • Liqin Lu
  • Mingyu Liu
  • Jianuo Jiang
  • Zeju Li
  • Bolin Zhang
  • Wancai Zheng
  • Xinyi Yu

Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have shown promise in enhancing spatial reasoning and task planning through learned semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges characteristic of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied in the literature. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination of locomotion and manipulation across challenging terrains. We further present the first comprehensive benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system’s generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks.

AAAI Conference 2026 Conference Paper

Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate–Distortion Awareness

  • Wuyang Cong
  • Junqi Shi
  • Lizhong Wang
  • Weijing Shi
  • Ming Lu
  • Hao Chen
  • Zhan Ma

Neural video compression (NVC) has demonstrated superior compression efficiency, yet effective rate control remains a significant challenge due to complex temporal dependencies. Existing rate control schemes typically leverage frame content to capture distortion interactions, overlooking inter-frame rate dependencies arising from shifts in per-frame coding parameters. This often leads to suboptimal bitrate allocation and cascading parameter decisions. To address this, we propose a reinforcement‑learning (RL)‑based rate control framework that formulates the task as a frame‑by‑frame sequential decision process. At each frame, an RL agent observes a spatiotemporal state and selects coding parameters to optimize a long‑term reward that reflects rate‑distortion (R-D) performance and bitrate adherence. Unlike prior methods, our approach jointly determines bitrate allocation and coding configuration in a single step, independent of group‑of‑pictures (GOP) structure. Extensive experiments across diverse NVC architectures show that our method reduces the average relative bitrate error to 1.20 percent and achieves up to 13.45 percent bitrate savings at typical GOP sizes, outperforming existing approaches. In addition, our framework demonstrates improved robustness to content variation and bandwidth fluctuations with lower encoding/decoding overhead, making it highly suitable for practical deployment.

JBHI Journal 2026 Journal Article

SegTom: A 3D Volumetric Medical Image Segmentation Framework for Thoracoabdominal Multi-Organ Anatomical Structures

  • Yan Pang
  • Yunhao Li
  • Jiaming Liang
  • Hao Chen
  • Ying Hu
  • Qiong Wang

Accurate segmentation of thoracoabdominal anatomical structures in three-dimensional medical imaging modalities is fundamental for informed clinical decision-making across a wide array of medical disciplines. Current approaches often struggle to efficiently and comprehensively process this region’s intricate and heterogeneous anatomical information, leading to suboptimal outcomes in diagnosis, treatment planning, and disease management. To address this challenge, we introduce SegTom, a novel volumetric segmentation framework equipped with a cutting-edge SegTom Block specifically engineered to effectively capture the complex anatomical representations inherent to the thoracoabdominal region. This SegTom Block incorporates a hierarchical anatomical-representation decomposition to facilitate efficient information exchange by decomposing the computationally intensive self-attention mechanism and cost-effectively aggregating the extracted representations. Rigorous validation of SegTom across nine diverse datasets, encompassing both computed tomography (CT) and magnetic resonance imaging (MRI) modalities, consistently demonstrates high performance across a broad spectrum of anatomical structures. Specifically, SegTom achieves a mean Dice similarity coefficient (DSC) of 87. 29% for cardiac segmentation on the MM-WHS MRI dataset, 83. 48% for multi-organ segmentation on the BTCV abdominal CT dataset, and 92. 01% for airway segmentation on a dedicated CT dataset.

JBHI Journal 2026 Journal Article

SPSID: A single-parameter shrinkage inverse-diffusion for denoising gene-regulatory networks

  • Hao Chen
  • Ge Han
  • Wenze Ding
  • Clara Grazian

Inferring gene regulatory networks (GRNs) from expression data is a fundamental problem in systems biology, but its accuracy is often undermined by structural noise arising from transitive correlations. These indirect interactions can obscure the true regulatory architecture, leading to a high rate of false positives. To address this, we introduce SPSID (Single-Parameter Shrinkage Inverse-Diffusion), a novel and robust network denoising framework. SPSID is a deterministic post-processing operator applied to an inferred GRN score matrix, rather than a generative diffusion model for gene expression. SPSID employs a principled spectral filter, built upon a shrinkage-regularized inverse-diffusion operator, to mathematically distinguish direct, one-step interactions from multi-step, indirect paths. This approach guarantees numerical stability and, through a fixed default parameter, effectively eliminates the need for data-dependent tuning. We conducted a comprehensive evaluation of SPSID on both extensive simulations and the gold-standard DREAM5 benchmark. The results demonstrate that SPSID outperforms state-of-the-art baseline methods in both AUROC and AUPR, exhibiting good stability across diverse network conditions. Furthermore, it functions as a post-processing tool, elevating the performance of multiple upstream GRN inference methods. By providing a computationally efficient and parameter-free solution to filter structural noise, SPSID offers a readily applicable tool for uncovering the underlying topology of complex biological networks with greater fidelity.

AAAI Conference 2026 Conference Paper

You Don’t Need Pre-Built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures

  • Shengyuan Chen
  • Chuang Zhou
  • Zheng Yuan
  • Qinggang Zhang
  • Zeyang Cui
  • Hao Chen
  • Yilin Xiao
  • Jiannong Cao

Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a Logic-aware Retrieval Augmented Generation framework (LogicRAG) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.

AAAI Conference 2025 Conference Paper

A Denoising Pre-training Framework for Accelerating Novel Material Discovery

  • Shuaike Shen
  • Ke Liu
  • Muzhi Zhu
  • Hao Chen

Crystal materials play an important role in the development of society. The discovery of new materials is critical to achieving sustainable development goals (SDGs), such as climate change mitigation, affordable and clean energy, and fostering innovation in industry and infrastructure. Recent advances in deep learning for crystal property prediction have accelerated material discovery, but these methods typically rely on labeled data, which is often limited and varies across different properties. This limitation hinders the full utilization of the vast amount of unlabeled data in materials science. To overcome this challenge, we introduce an unsupervised Denoising Pre-training Framework (DPF) tailored for crystal structures. DPF trains a model to reconstruct the original crystal structure by recovering the masked atom types, perturbed atom positions, and perturbed crystal lattices. Through pre-training, models learn the intrinsic features of crystal structures and capture the key features influencing crystal properties. We pre-train models on a dataset of 380,743 unlabeled crystal structures and fine-tune them on downstream property prediction tasks. Extensive experiments demonstrate the effectiveness of our framework, showing its potential to significantly advance material science and contribute to the development of society by accelerating the discovery of materials crucial for sustainable technologies.

IJCAI Conference 2025 Conference Paper

A Survey of Pathology Foundation Model: Progress and Future Directions

  • Conghao Xiong
  • Hao Chen
  • Joseph J. Y. Sung

Computational pathology, which involves analyzing whole slide images for automated cancer diagnosis, relies on multiple instance learning, where performance depends heavily on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced both the extractor and aggregator, but they lack a systematic analysis framework. In this survey, we present a hierarchical taxonomy organizing PFMs through a top-down philosophy applicable to foundation model analysis in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at https: //github. com/BearCleverProud/AwesomeWSI.

ICRA Conference 2025 Conference Paper

Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

  • Hao Chen
  • Jian Chen 0036
  • Xinran Liu
  • Zihui Zhang
  • Yuanrui Huang
  • Zhongkai Zhang 0001
  • Hongbin Liu 0001

Continuum robots offer high flexibility and multiple degrees of freedom, making them ideal for navigating narrow lumens. However, accurately modeling their behavior under large deformations and frequent environmental contacts remains challenging. Current methods for solving the deformation of these robots, such as the Model Order Reduction and Gauss-Seidel (GS) methods, suffer from significant drawbacks. They experience reduced computational speed as the number of contact points increases and struggle to balance speed with model accuracy. To overcome these limitations, we introduce a novel finite element method (FEM) named Acc-FEM. Acc-FEM employs a large deformation quasi-static finite element model and integrates an accelerated solver scheme to handle multi-contact simulations efficiently. Additionally, it utilizes parallel computing with Graphics Processing Units (GPU) for real-time updates of the finite element models and collision detection. Extensive numerical experiments demonstrate that Acc-Fem significantly improves computational efficiency in modeling continuum robots with multiple contacts while achieving satisfactory accuracy, addressing the deficiencies of existing methods.

TIST Journal 2025 Journal Article

Adaptive Intention Learning for Session-Based Recommendation

  • Qingbo Zhang
  • Xiaochun Yang
  • Hao Chen
  • Bin Wang
  • Zhu Sun
  • Xiangmin Zhou

In recent years, session-based recommender systems (SRSs) have emerged as a significant research focus within the recommendation field. Capturing user intentions to infer user interest accordingly has proven to be effective in enhancing the accuracy of SRSs. However, existing techniques assume that all sessions have the same number of intentions or that the items in one category belonging to the same session reflect the same intention. In real applications, such as e-commerce, sessions may have different numbers of intentions, and the same type of items in a session may correspond to different intentions. As a result, existing techniques cannot guarantee high-quality user interest prediction. In this article, we propose a novel Adaptive Intention Learning Network (AILN) to capture an adaptive number of intentions for each session, thereby enhancing the accuracy of user interest inference. Specifically, we design an intention evaluation network (IEN) to evaluate whether a subsequence of a session corresponds to a valid intention, and an intention generation network (IGN) to learn the representation of a valid intention. By checking each subsequence of a session, IEN and IGN enable the incremental learning of a session-specific intention hierarchy (IH) to store valid intentions of the session. To reduce the cost of building the IH, we propose a pruning strategy that exploits the intention validity to avoid unnecessary evaluation. The representative intentions are selected from IH and input into a designed interest predictor to infer the user interest. Experimental results on two real-world datasets demonstrate the superiority of our proposed AILN.

ICML Conference 2025 Conference Paper

Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

  • Shiwei Li 0002
  • Xiandi Luo
  • Xing Tang 0007
  • Haozhao Wang
  • Hao Chen
  • Weihong Luo
  • Yuhua Li 0003
  • Xiuqiang He 0001

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA’s fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA’s robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https: //github. com/Leopold1423/non_zero_lora-icml25.

IROS Conference 2025 Conference Paper

Controllable Traffic Simulation through LLM-Guided Hierarchical Reasoning and Refinement

  • Zhiyuan Liu
  • Leheng Li
  • Yuning Wang
  • Haotian Lin 0006
  • Hao Chen
  • Zhizhe Liu
  • Lei He
  • Jianqiang Wang 0003

Evaluating autonomous driving systems in complex and diverse traffic scenarios through controllable simulation is essential to ensure their safety and reliability. However, existing traffic simulation methods face challenges in their controllability. To address this, we propose a novel diffusion-based and LLM-enhanced traffic simulation framework. Our approach incorporates a high-level understanding module and a low-level refinement module, which systematically examines the hierarchical structure of traffic elements, guides LLMs to thoroughly analyze traffic scenario descriptions step by step, and refines the generation by self-reflection, enhancing their understanding of complex situations. Furthermore, we propose a Frenet-frame-based cost function framework that provides LLMs with geometrically meaningful quantities, improving their grasp of spatial relationships in a scenario and enabling more accurate cost function generation. Experiments on the Waymo Open Motion Dataset (WOMD) demonstrate that our method can handle more intricate descriptions and generate a broader range of scenarios in a controllable manner.

IJCAI Conference 2025 Conference Paper

DGCPL: Dual Graph Distillation for Concept Prerequisite Relation Learning

  • Miao Zhang
  • Jiawei Wang
  • Jinying Han
  • Kui Xiao
  • Zhifei Li
  • Yan Zhang
  • Hao Chen
  • Shihui Wang

Concept prerequisite relations determine the learning order of knowledge concepts in one domain, which has an important impact on teachers' course design and students' personalized learning. Current research usually predicts concept prerequisite relations from the perspective of knowledge, and rarely pays attention to the role of learners' learning behavior. We propose a Dual Graph Distillation Method for Concept Prerequisite Relation Learning (DGCPL). Specifically, DGCPL constructs a dual graph structure from both the knowledge and learning behavior perspectives, and captures the high-order knowledge features and learning behavior features through the concept-resource hypergraph and the learning behavior graph respectively. In addition, we introduce a gated knowledge distillation to fuse the structural information of concept nodes in the two graphs, so as to obtain a more comprehensive concept embedding representation and achieve accurate prediction of prerequisite relations. On three public benchmark datasets, we compare DGCPL with eight graph-based baseline methods and five traditional classification baseline methods. The experimental results show that DGCPL achieves state-of-the-art performance in learning concept prerequisite relations. Our code is available at https: //github. com/wisejw/DGCPL.

ECAI Conference 2025 Conference Paper

DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

  • HongYu Liu
  • Junxin Li
  • Changxi Guo
  • Hao Chen
  • Yaqian Huang
  • Yifu Guo
  • Huan Yang
  • Lihua Cai

Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e. g. , Qwen2. 5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2. 0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https: //github. com/david188888/DialogGraph-LLM

NeurIPS Conference 2025 Conference Paper

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

  • Canyu Zhao
  • Yanlong Sun
  • Mingyu Liu
  • Huanyi Zheng
  • Muzhi Zhu
  • Zhiyue Zhao
  • Hao Chen
  • Tong He

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0. 06% of their data (e. g. , 600K vs. \ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.

AAAI Conference 2025 Conference Paper

DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation

  • Xiankang He
  • Guangkai Xu
  • Bo Zhang
  • Hao Chen
  • Ying Cui
  • Dongyan Guo

Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrates that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.

JBHI Journal 2025 Journal Article

Efficient Breast Lesion Segmentation From Ultrasound Videos Across Multiple Source-Limited Platforms

  • Yan Pang
  • Yunhao Li
  • Teng Huang
  • Jiaming Liang
  • Ziyu Ding
  • Hao Chen
  • Baoliang Zhao
  • Ying Hu

Medical video segmentation is fundamentally important in clinical diagnosis and treatment procedures, offering dynamic tracking of breast lesions across frames in ultrasound videos for improved segmentation performance. However, existing approaches face challenges in striking a balance between segmentation performance and inference speed, hindering real-time application in resource-constrained medical environments. In order to address these limitations, we present BaS, a blazing-fast on-device breast lesion segmentation model. BaS integrates the Stem module and BaSBlock to refine representations through inter- and intra-frame analysis on ultrasound videos. In addition, we release two versions of BaS: the BaS-S for superior segmentation performance and the BaS-L for accelerated inference times. Experimental Results indicate that BaS surpasses the top-performing models in terms of segmenting efficiency and accuracy of predictions on devices with limited resources. This work advances the development of efficient medical video segmentation frameworks applicable to multiple medical platforms.

NeurIPS Conference 2025 Conference Paper

Enforcing Hard Linear Constraints in Deep Learning Models with Decision Rules

  • Gonzalo E. Constante
  • Hao Chen
  • Can Li

Deep learning models are increasingly deployed in safety-critical tasks where predictions must satisfy hard constraints, such as physical laws, fairness requirements, or safety limits. However, standard architectures lack built-in mechanisms to enforce such constraints, and existing approaches based on regularization or projection are often limited to simple constraints, computationally expensive, or lack feasibility guarantees. This paper proposes a model-agnostic framework for enforcing input-dependent linear equality and inequality constraints on neural network outputs. The architecture combines a task network trained for prediction accuracy with a safe network trained using decision rules from the stochastic and robust optimization literature to ensure feasibility across the entire input space. The final prediction is a convex combination of the two subnetworks, guaranteeing constraint satisfaction during both training and inference without iterative procedures or runtime optimization. We prove that the architecture is a universal approximator of constrained functions and derive computationally tractable formulations based on linear decision rules. Empirical results on benchmark regression tasks show that our method consistently satisfies constraints while maintaining competitive accuracy and low inference latency.

NeurIPS Conference 2025 Conference Paper

EPA: Boosting Event-based Video Frame Interpolation with Perceptually Aligned Learning

  • Yuhan Liu
  • LingHui Fu
  • Zhen Yang
  • Hao Chen
  • Youfu Li
  • Yongjian Deng

Event cameras, with their capacity to provide high temporal resolution information between frames, are increasingly utilized for video frame interpolation (VFI) in challenging scenarios characterized by high-speed motion and significant occlusion. However, prevalent issues of blur and distortion within the keyframes and ground truth data used for training and inference in these demanding conditions are frequently overlooked. This oversight impedes the perceptual realism and multi-scene generalization capabilities of existing event-based VFI (E-VFI) methods when generating interpolated frames. Motivated by the observation that semantic-perceptual discrepancies between degraded and pristine images are considerably smaller than their image-level differences, we introduce EPA. This novel E-VFI framework diverges from approaches reliant on direct image-level supervision by constructing multilevel, degradation-insensitive semantic perceptual supervisory signals to enhance the perceptual realism and multi-scene generalization of the model's predictions. Specifically, EPA operates in two phases: it first employs a DINO-based perceptual extractor, a customized style adapter, and a reconstruction generator to derive multi-layered, degradation-insensitive semantic-perceptual features ($\mathcal{S}$). Second, a novel Bidirectional Event-Guided Alignment (BEGA) module utilizes deformable convolutions to align perceptual features from keyframes to ground truth with inter-frame temporal guidance extracted from event signals. By decoupling the learning process from direct image-level supervision, EPA enhances model robustness against degraded keyframes and unreliable ground truth information. Extensive experiments demonstrate that this approach yields interpolated frames more consistent with human perceptual preferences. *The code will be released upon acceptance. *

AAAI Conference 2025 Conference Paper

ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance

  • Yucheng Zhao
  • Gengyu Lyu
  • Ke Li
  • Zihao Wang
  • Hao Chen
  • Zhen Yang
  • Yongjian Deng

Event-based semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

NeurIPS Conference 2025 Conference Paper

Evaluating Program Semantics Reasoning with Type Inference in System $F$

  • Yifeng He
  • Luning Yang
  • Christopher Gonzalo
  • Hao Chen

Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute reasoning capabilities promise significant potential in understanding program logic and semantics beyond mere token recognition. However, current benchmarks evaluating reasoning LLMs for code lack a formal, program-centric deductive framework for the soundness of evaluation, incompetent in assessing of whether models genuinely reason about program semantics or merely associate superficial connections between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as *program semantics reasoning*. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3. 7-sonnet) achieving only $55. 85\%$ accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess the robustness and effectiveness of extended reasoning, underscoring the critical limitation in current LLM capabilities and highlighting essential directions for future research.

NeurIPS Conference 2025 Conference Paper

Fast-in-Slow: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning

  • Hao Chen
  • Jiaming Liu
  • Chenyang Gu
  • Zhuoyang Liu
  • Renrui Zhang
  • Xiaoqi Li
  • Xiao He
  • Yandong Guo

Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches have been proposed to leverage a VLM-based System 2 module for handling high-level decision-making, and a separate System 1 action module for ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1, but also facilitates coordination between multimodal reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2’s contextual understanding to provide stable latent conditions for System 1. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117. 7 Hz control frequency with action chunk set to eight. Project web page: https: //fast-in-slow. github. io.

NeurIPS Conference 2025 Conference Paper

From Pretraining to Pathology: How Noise Leads to Catastrophic Inheritance in Medical Models

  • Hao Sun
  • Zhongyi Han
  • Hao Chen
  • Jindong Wang
  • Xin Gao
  • Yilong Yin

Foundation models pretrained on web-scale data drive contemporary transfer learning in vision, language, and multimodal tasks. Recent work shows that mild label noise in these corpora may lift in-distribution accuracy yet sharply reduce out-of-distribution generalization, an effect known as catastrophic inheritance. Medical data is especially sensitive because annotations are scarce, domain shifts are large, and pretraining sources are noisy. We present the first systematic analysis of catastrophic inheritance in medical models. Controlled label-corruption experiments expose a clear structural collapse: as noise rises, the skewness and kurtosis of feature and logit distributions decline, signaling a flattened representation space and diminished discriminative detail. These higher-order statistics form a compact, interpretable marker of degradation in fine-grained tasks such as histopathology. Guided by this finding, we introduce a fine-tuning objective that restores skewness and kurtosis through two scalar regularizers added to the task loss. The method leaves the backbone unchanged and incurs negligible overhead. Tests on PLIP models trained with Twitter pathology images, as well as other large-scale vision and language backbones, show consistent gains in robustness and cross-domain accuracy under varied noise levels.

NeurIPS Conference 2025 Conference Paper

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

  • Tianhao Chen
  • Xin Xu
  • Zijing Liu
  • Pengxiang Li
  • Xinyuan Song
  • AJAY JAISWAL
  • Fan Zhang
  • Jishan Hu

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https: //github. com/dandingsky/GPAS.

AAAI Conference 2025 Conference Paper

Know Where You Are From: Event-Based Segmentation via Spatio-Temporal Propagation

  • Ke Li
  • Gengyu Lyu
  • Hao Chen
  • Bochen Xie
  • Zhen Yang
  • Youfu Li
  • Yongjian Deng

Event cameras have gained attention in segmentation due to their higher temporal resolution and dynamic range compared to traditional cameras. However, they struggle with issues like lack of color perception and triggering only at motion edges, making it hard to distinguish objects with similar contours or segment spatially continuous objects. Our work aims to address these often overlooked issues. Based on the assumption that various objects exhibit different motion patterns, we believe that embedding the historical motion states of objects into segmented scenes can effectively address these challenges. Inspired by this, we propose the ESS framework ``Know Where You Are From" (KWYAF), which incorporates past motion cues through spatio-temporal propagation embedding. This framework features two core components: the Sequential Motion Encoding Module (SME) and the Event-Based Reliable Region Selection Mechanism (ER²SM). SMEs construct prior motion features through spatio-temporal correlation modeling for boosting final segmentation, while ER²SM adapts to identify high-confidence regions, embedding motion more precisely through local window masks and reliable region selection. A large number of experiments have demonstrated the effectiveness of our proposed framework in terms of both quantity and quality.

ICLR Conference 2025 Conference Paper

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

  • Hanyu Wang
  • Saksham Suri
  • Yixuan Ren
  • Hao Chen
  • Abhinav Shrivastava

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARPs strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs). Project page: https://hywang66.github.io/larp/

ICRA Conference 2025 Conference Paper

Learn to Swim: Data-Driven LSTM Hydrodynamic Model for Quadruped Robot Gait Optimization

  • Fei Han
  • Pengming Guo
  • Hao Chen
  • Weikun Li
  • Jingbo Ren
  • Naijun Liu
  • Ning Yang
  • Dixia Fan

This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FEDLSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional Empirical Formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straightline and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model's precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.

AAAI Conference 2025 Conference Paper

Learning Concept Prerequisite Relation via Global Knowledge Relation Optimization

  • Miao Zhang
  • Jiawei Wang
  • Kui Xiao
  • Shihui Wang
  • Yan Zhang
  • Hao Chen
  • Zhifei Li

Learning concept prerequisite relations helps better master and build a logically coherent knowledge structure. Many studies use graph neural networks to create heterogeneous knowledge networks that enhance concept representations. However, different types of relations in these networks can influence each other. Existing research often focuses solely on concept relations, neglecting other types of knowledge connections. To address this issue, this paper proposes a novel concept prerequisite relation learning model, named the Global Knowledge Relation Optimization Model(GKROM). Specifically, we capture the impact of different knowledge relation types on document and concept semantic representations separately, integrating the document and concept semantic representations. Then, we introduce multi-objective learning to optimize the knowledge relation network from a global perspective. Through the above optimization, GKROM learns richer semantic representations for concepts and documents, improving the accuracy of concept prerequisite relation learning. Extensive experiments on public datasets demonstrate the effectiveness of our GKROM, achieving state-of-the-art performance in concept prerequisite relation learning.

IJCAI Conference 2025 Conference Paper

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

  • Donghao Zhou
  • Jiancheng Huang
  • Jinbin Bai
  • Jiaze Wang
  • Hao Chen
  • Guangyong Chen
  • Xiaowei Hu
  • Pheng-Ann Heng

Text-to-image diffusion models can generate high-quality images but lack fine-grained control of visual concepts, limiting their creativity. Thus, we introduce component-controllable personalization, a new task that enables users to customize and reconfigure individual components within concepts. This task faces two challenges: semantic pollution, where undesired elements disrupt the target concept, and semantic imbalance, which causes disproportionate learning of the target concept and component. To address these, we design MagicTailor, a framework that uses Dynamic Masked Degradation to adaptively perturb unwanted visual semantics and Dual-Stream Balancing for more balanced learning of desired visual semantics. The experimental results show that MagicTailor achieves superior performance in this task and enables more personalized and creative image generation.

AAAI Conference 2025 Conference Paper

MM-Tracker: Motion Mamba for UAV-platform Multiple Object Tracking

  • Mufeng Yao
  • Jinlong Peng
  • Qingdong He
  • Bo Peng
  • Hao Chen
  • Mingmin Chi
  • Chao Liu
  • Jon Atli Benediktsson

Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces both local object motion and global camera motion. Motion blur also increases the difficulty of detecting large moving objects. Previous UAV motion modeling approaches either focus only on local motion or ignore motion blurring effects, thus limiting their tracking performance and speed. To address these issues, we propose the Motion Mamba Module, which explores both local and global motion features through cross-correlation and bi-directional Mamba Modules for better motion modeling. To address the detection difficulties caused by motion blur, we also design motion margin loss to effectively improve the detection accuracy of motion blurred objects. Based on the Motion Mamba module and motion margin loss, our proposed MM-Tracker surpasses the state-of-the-art in two widely open-source UAV-MOT datasets.

NeurIPS Conference 2025 Conference Paper

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

  • Hao Zhong
  • Muzhi Zhu
  • Zongze Du
  • Zheng Huang
  • Canyu Zhao
  • Mingyu Liu
  • Wen Wang
  • Hao Chen

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because "optimal" keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

NeurIPS Conference 2025 Conference Paper

On Fairness of Unified Multimodal Large Language Model for Image Generation

  • Ming Liu
  • Hao Chen
  • Jindong Wang
  • Liwen Wang
  • Bhiksha Raj
  • Wensheng Zhang

Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in end-to-end visual understanding and generation tasks. However, compared to generation-only systems (e. g. , Stable Diffusion), the unified architecture of U-MLLMs introduces new risks of propagating demographic stereotypes. In this paper, we benchmark several state-of-the-art U-MLLMs and show that they exhibit significant gender and race biases in the generated outputs. To diagnose the source of these biases, we propose a locate-then-fix framework: we first audit the vision and language components — using techniques such as linear probing and controlled generation — and find that the language model appears to be a primary origin of the observed generative bias. Moreover, we observe a ``partial alignment'' phenomenon, where the U-MLLMs exhibit less bias in understanding tasks yet produce substantially biased images. To address this, we introduce a novel \emph{balanced preference loss} that enforces uniform generation probabilities across demographics by leveraging a synthetically balanced dataset. Extensive experiments show that our approach significantly reduces demographic bias while preserving semantic fidelity and image quality. Our findings underscore the need for targeted debiasing strategies in unified multimodal systems and introduce a practical approach to mitigate biases.

JBHI Journal 2025 Journal Article

Online Self-Distillation and Self-Modeling for 3D Brain Tumor Segmentation

  • Yan Pang
  • Yunhao Li
  • Teng Huang
  • Jiaming Liang
  • Zhen Wang
  • Changyu Dong
  • Dongyang Kuang
  • Ying Hu

In the specialized domain of brain tumor segmentation, supervised segmentation approaches are hindered by the limited availability of high-quality labeled data, a condition arising from data privacy concerns, significant costs, and ethical issues. In response to this challenge, this paper presents a training framework that adeptly integrates a plug-and-play component, MOD, into current supervised learning models, boosting their efficacy in scenarios with limited data. The MOD consists of an Online Tokenizer and a Dense Predictor, which employs self-distillation and self-modeling on masked patches, promoting swift convergence and efficient representation learning. During the inference phase, the plug-and-play MOD component is excluded, preserving the computational efficiency of the original model without incurring extra processing costs. We substantiated the value of our approach through experiments on leading 3D brain tumor segmentation baselines. Remarkably, models augmented with the MOD consistently showcased superior results, achieving elevated Dice coefficients and HD95 scores on two datasets: BraTS 2021 and MSD 2019 Task-01 Brain Tumor.

NeurIPS Conference 2025 Conference Paper

ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation

  • Pengcheng Huang
  • Zhenghao Liu
  • Yukun Yan
  • Haiyan Zhao
  • Xiaoyuan Yi
  • Hao Chen
  • Zhiyuan Liu
  • Maosong Sun

Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All codes are available at https: //github. com/OpenBMB/ParamMute.

ICLR Conference 2025 Conference Paper

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

  • Xinze Li
  • Sen Mei
  • Zhenghao Liu 0001
  • Yukun Yan
  • Shuo Wang 0013
  • Shi Yu 0001
  • Zheni Zeng
  • Hao Chen

Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for the RAG systems, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to handle diverse RAG tasks using different instructions. However, it trains RAG modules to overfit training signals and overlooks the varying data preferences among agents within the RAG system. In this paper, we propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG systems by aligning data preferences between different RAG modules. DDR works by collecting the rewards to optimize each agent in the RAG system with the rollout method, which prompts agents to sample some potential responses as perturbations, evaluates the impact of these perturbations on the whole RAG system, and subsequently optimizes the agent to produce outputs that improve the performance of the RAG system. Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes the generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. All codes are available at https://github.com/OpenMatch/RAG-DDR.

ICML Conference 2025 Conference Paper

Reinforced Lifelong Editing for Language Models

  • Zherui Li 0001
  • Houcheng Jiang
  • Hao Chen
  • Baolong Bi
  • Zhenhong Zhou
  • Fei Sun 0001
  • Junfeng Fang
  • Xiang Wang 0010

Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59. 24% improvement while requiring only 2. 11% of the time compared to most approaches.

NeurIPS Conference 2025 Conference Paper

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

  • Hao Chen
  • Guanxi Lu
  • Yasuyuki Okoshi
  • Zhiwen Mo
  • Masato Motomura
  • Hongxiang Fan

Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity—that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter $g$. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting $g$ can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3. 1\% over Beam Search and 3. 6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.

JBHI Journal 2025 Journal Article

Revisiting Drug Recommendation From a Causal Perspective

  • Junjie Zhang
  • Xuan Zang
  • Hao Chen
  • Xiaowei Yan
  • Buzhou Tang

Drug recommendation that aims to provide a prescription for a patient is an essential task in healthcare. Drug molecular graphs provide valuable support for drug recommendation. Existing methods tend to overlook drugs' molecular graphs or use the core substructures of molecular graphs with a rule-based segmentation strategy. However, such methods have several limitations: (1) The rule-based segmentation strategy is inflexible and sub-optimal for extremely complex scenarios. (2) The core substructures derived only consider the drug's chemical characteristics and ignore the patient's health condition. (3) The spurious correlation brought by trivial substructures is disregarded. To address these limitations, we design a novel drug recommendation method from a causal perspective, where a conditional causal representation learner for drug recommendation is proposed. Specifically, we first separate the drug molecular representation into causal and spurious parts depending on various patients' health conditions. Then, we eliminate the spurious correlation caused by the spurious part with causal intervention. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that our approach achieves new state-of-the-art performance (e. g. , 6. 68% Jaccard improvements on MIMIC-III with p-value $\ll$ 0. 05).

NeurIPS Conference 2025 Conference Paper

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

  • Wenhao Tang
  • Rong Qin
  • Heng Fang
  • Fengtao Zhou
  • Hao Chen
  • Xiang Li
  • Ming-Ming Cheng

Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. ABMILX mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient ($<$ 10 RTX3090 GPU hours). We demonstrate the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https: //github. com/DearCaat/E2E-WSI-ABMILX.

IROS Conference 2025 Conference Paper

Robotic In Situ Measurement of Multiple Intracellular Physical Parameters Based on Three-micropipettes System

  • Mengya Liu
  • Jinyu Qiu
  • Shaojie Fu
  • Ruimin Li
  • Yuzhu Liu
  • Hao Chen
  • Xin Zhao 0010
  • Qili Zhao

Physical parameters of the intracellular environment such as mass density, intracellular pressure and elasticity have significant effects on the physiological activities of the cell and intracellular operation results. However, the significantly different measurement principles of the above parameters make it a challenging task for in situ measurement of them for the same cell, which significantly limits the study of their comprehensive regulation mechanisms to cell physiological activities and intracellular operation results. For the first time, a robotic in situ measurement system of multiple intracellular physical parameters is proposed based on a self-developed three-micropipettes system in this paper. Using this system, the mass density, elasticity and intracellular pressure of the same cell are measured automatically in sequence, according to a robotic in situ measurement process. Experimental results on sheep oocytes demonstrate an 83. 3% measurement success rate at an average speed of 97. 75 s/cell. The measurement results of the above three parameters are close to the reported results of individual, while with a significantly shorter operation time than theirs combined in references. Our system lays a solid foundation for the future research on the comprehensive regulation mechanism of these parameters to cell physiological activities and intracellular operation results.

NeurIPS Conference 2025 Conference Paper

Role-aware Multi-agent Reinforcement Learning for Coordinated Emergency Traffic Control

  • Ming Cheng
  • Hao Chen
  • Zhiqing Li
  • Jia Wang
  • Senzhang Wang

Emergency traffic control presents an increasingly critical challenge, requiring seamless coordination among emergency vehicles, regular vehicles, and traffic lights to ensure efficient passage for all vehicles. Existing models primarily only focus on traffic light control, leaving emergency and regular vehicles prone to delay due to the lack of navigation strategies. To address this issue, we propose the R ole-aware M ulti-agent T raffic C ontrol (RMTC) framework, which dynamically assigns appropriate roles to traffic components for better cooperation by considering their relations with emergency vehicles and adaptively adjusting their policies. Specifically, RMTC introduces a Heterogeneous Temporal Traffic Graph (HTTG) to model the spatial and temporal relationships among all traffic components (traffic lights, regular and emergency vehicles) at each time step. Furthermore, we develop a Dynamic Role Learning model to infer the evolving roles of traffic lights and regular vehicles based on HTTG. Finally, we present a Role-aware Multi-agent Reinforcement Learning approach that learns traffic policies conditioned on the dynamically roles. Extensive experiments across four public traffic scenarios show that RMTC outperforms existing traffic light control methods by significantly reducing emergency vehicle travel time, while effectively preserving traffic efficiency for regular vehicles. The code is released at https: //github. com/mingchenghexi/RMTC.

ICML Conference 2025 Conference Paper

SDP-CROWN: Efficient Bound Propagation for Neural Network Verification with Tightness of Semidefinite Programming

  • Hong-Ming Chiu
  • Hao Chen
  • Huan Zhang
  • Richard Y. Zhang 0001

Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture inter-neuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound—derived via SDP principles—that explicitly captures $\ell_{2}$-norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of $\sqrt{n}$ tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art $\alpha$-CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2. 47 million parameters, achieving tightness that approaches that of costly SDP-based methods.

IJCAI Conference 2025 Conference Paper

Seeing the Unseen: Composing Outliers for Compositional Zero-Shot Learning

  • Chenchen Jing
  • Mingyu Liu
  • Hao Chen
  • Yuling Xi
  • Xingyuan Bu
  • Dong Gong
  • Chunhua Shen

Compositional zero-shot learning (CZSL) is to recognize unseen attribute-object compositions by learning from seen compositions. The distribution shift between unseen compositions and seen compositions poses challenges to CZSL models, especially when test images are mixed with both seen and unseen compositions. The challenge will be addressed more easily if a model can distinguish unseen/seen compositions and treat them with specific recognition strategies. However, identifying images with unseen compositions is non-trivial, considering that unseen compositions are absent in training and usually contain only subtle differences from seen compositions. In this paper, we propose a novel compositional zero-shot learning method called COMO, which composes outliers in training for distinguishing seen and unseen compositions and further applying specific strategies for them. Specifically, we compose attribute-object representations for unseen compositions based on primitive representations of training images as outliers to enable the model to identify unseen compositions in inference. At test time, the method distinguishes images containing seen/unseen compositions and uses different weights for composition classification and primitive classification to recognize seen/unseen compositions. Experimental results on three datasets show the effectiveness of our method in both the closed-world setting and the open-world setting.

ICML Conference 2025 Conference Paper

Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning

  • Qi Xu 0008
  • Junyang Zhu
  • Dongdong Zhou
  • Hao Chen
  • Yang Liu
  • Jiangrong Shen
  • Qiang Zhang 0008

Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.

NeurIPS Conference 2025 Conference Paper

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

  • Peng Xie
  • Xingyuan Liu
  • Yequan Bie
  • Tsz Wai Chan
  • Yangqiu Song
  • Yang Wang
  • Hao Chen
  • Kani Chen

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (TTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance. Benchmark experiments on SwitchLingua with state-of-the-art ASR models reveal substantial performance gaps, underscoring the dataset’s utility as a rigorous benchmark for CS capability evaluation. In addition, SwitchLingua aims to encourage further research to promote cultural inclusivity and linguistic diversity in speech technology, fostering equitable progress in the ASR field. LinguaMaster (Code): github. com/Shelton1013/SwitchLingua, SwitchLingua (Data): https: //huggingface. co/datasets/Shelton1013/SwitchLingua text, https: //huggingface. co/datasets/Shelton1013/SwitchLingua audio

AAAI Conference 2025 Conference Paper

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

  • Dawei Yan
  • Pengcheng Li
  • Yang Li
  • Hao Chen
  • Qingguo Chen
  • Weihua Luo
  • Wei Dong
  • Qingsen Yan

Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

AAAI Conference 2025 Conference Paper

Time Series Supplier Allocation via Deep Black-Litterman Model

  • Xinke Jiang
  • Wentao Zhang
  • Yuchen Fang
  • Xiaowei Gao
  • Hao Chen
  • Haoyu Zhang
  • Dingyi Zhuang
  • Jiayuan Luo

As a typical problem of Spatiotemporal Resource Management, Time Series Supplier Allocation (TSSA) poses a complex NP-hard challenge, aimed at refining future order dispatching strategies to satisfy the trade-off between demands and maximum supply. The Black-Litterman (BL) model, which comes from financial portfolio management, offers a new perspective for the TSSA by balancing expected returns against insufficient supply risks. However, the BL model is not only constrained by manually constructed perspective matrices and spatio-temporal market dynamics but also restricted by the absence of supervisory signals and unreliable supplier data. To solve these limitations, we introduce the pioneering Deep Black-Litterman Model for TSSA, which innovatively adapts the BL model from financial domain to supply chain context. Specifically, DBLM leverages Spatio-Temporal Graph Neural Networks (STGNNs) to capture spatio-temporal dependencies for automatically generating future perspective matrices. Moreover, a novel Spearman rank correlation is designed as our DBLM supervise signal to navigate complex risks and interactions of the supplier. Finally, DBLM further uses a masking mechanism to counteract the bias of unreliable data, thus improving precision and reliability. Extensive experiments on two datasets demonstrate significant improvements of DBLM on TSSA.

NeurIPS Conference 2025 Conference Paper

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs

  • Xingang Guo
  • Yaxin Li
  • XiangYi Kong
  • YILAN JIANG
  • Xiayu Zhao
  • Zhihua Gong
  • Yufan Zhang
  • Daixuan Li

Modern engineering, spanning electrical, mechanical, aerospace, civil, and computer disciplines, stands as a cornerstone of human civilization and the foundation of our society. However, engineering design poses a fundamentally different challenge for large language models (LLMs) compared with traditional textbook-style problem solving or factual question answering. Although existing benchmarks have driven progress in areas such as language understanding, code synthesis, and scientific problem solving, real-world engineering design demands the synthesis of domain knowledge, navigation of complex trade-offs, and management of the tedious processes that consume much of practicing engineers' time. Despite these shared challenges across engineering disciplines, no benchmark currently captures the unique demands of engineering design work. In this work, we introduce EngDesign, an Engineering Design benchmark that evaluates LLMs' abilities to perform practical design tasks across nine engineering domains. Unlike existing benchmarks that focus on factual recall or question answering, EngDesign uniquely emphasizes LLMs' ability to synthesize domain knowledge, reason under constraints, and generate functional, objective-oriented engineering designs. Each task in EngDesign represents a real-world engineering design problem, accompanied by a detailed task description specifying design goals, constraints, and performance requirements. EngDesign pioneers a simulation-based evaluation paradigm that moves beyond textbook knowledge to assess genuine engineering design capabilities and shifts evaluation from static answer checking to dynamic, simulation-driven functional verification, marking a crucial step toward realizing the vision of engineering Artificial General Intelligence (AGI).

AAAI Conference 2025 Conference Paper

Towards Loss-Resilient Image Coding for Unstable Satellite Networks

  • Hongwei Sha
  • Muchen Dong
  • Quanyou Luo
  • Ming Lu
  • Hao Chen
  • Zhan Ma

Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments.

NeurIPS Conference 2025 Conference Paper

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

  • Jingyang Lin
  • Jialian Wu
  • Ximeng Sun
  • Ze Wang
  • Jiang Liu
  • Yusheng Su
  • Xiaodong Yu
  • Hao Chen

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9, 700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3. 3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

JBHI Journal 2025 Journal Article

Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN

  • Weiwen Zhang
  • Dawei Yang
  • Haoxuan Che
  • An Ran Ran
  • Carol Y. Cheung
  • Hao Chen

For optical coherence tomography angiography (OCTA) images, the limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ( ${\bm {hf}}$ ) and coarse-grained features as low-frequencies ( ${\bm {lf}}$ ). We propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize ${\bm {hf}}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. We collected a paired dataset for evaluation and showed that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.

NeurIPS Conference 2025 Conference Paper

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

  • Jiaming Han
  • Hao Chen
  • Yang Zhao
  • Hanyu Wang
  • Qi Zhao
  • Ziyan Yang
  • Hao He
  • Xiangyu Yue

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. All code, models, and data will be made publicly available.

AAAI Conference 2024 Conference Paper

A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning

  • Yongjian Deng
  • Hao Chen
  • Youfu Li

Recent advances in event-based research prioritize sparsity and temporal precision. Approaches learning sparse point-based representations through graph CNNs (GCN) become more popular. Yet, these graph techniques hold lower performance than their frame-based counterpart due to two issues: (i) Biased graph structures that don't properly incorporate varied attributes (such as semantics, and spatial and temporal signals) for each vertex, resulting in inaccurate graph representations. (ii) A shortage of robust pretrained models. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.

NeurIPS Conference 2024 Conference Paper

A Motion-aware Spatio-temporal Graph for Video Salient Object Ranking

  • Hao Chen
  • Yufei Zhu
  • Yongjian Deng

Video salient object ranking aims to simulate the human attention mechanism by dynamically prioritizing the visual attraction of objects in a scene over time. Despite its numerous practical applications, this area remains underexplored. In this work, we propose a graph model for video salient object ranking. This graph simultaneously explores multi-scale spatial contrasts and intra-/inter-instance temporal correlations across frames to extract diverse spatio-temporal saliency cues. It has two advantages: 1. Unlike previous methods that only perform global inter-frame contrast or compare all proposals across frames globally, we explicitly model the motion of each instance by comparing its features with those in the same spatial region in adjacent frames, thus obtaining more accurate motion saliency cues. 2. We synchronize the spatio-temporal saliency cues in a single graph for joint optimization, which exhibits better dynamics compared to the previous stage-wise methods that prioritize spatial cues followed by temporal cues. Additionally, we propose a simple yet effective video retargeting method based on video saliency ranking. Extensive experiments demonstrate the superiority of our model in video salient object ranking and the effectiveness of the video retargeting method. Our codes/models are released at https: //github. com/zyf-815/VSOR/tree/main.

NeurIPS Conference 2024 Conference Paper

A Simple Image Segmentation Framework via In-Context Examples

  • Yang Liu
  • Chenchen Jing
  • Hengtao Li
  • Muzhi Zhu
  • Hao Chen
  • Xinlong Wang
  • Chunhua Shen

Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image $\textbf{S}$egmentation framework utilizing $\textbf{in}$-context $\textbf{e}$xamples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to eliminate task ambiguity effectively. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.

TIST Journal 2024 Journal Article

A Survey on Evaluation of Large Language Models

  • Yupeng Chang
  • Xu Wang
  • Jindong Wang
  • Yuan Wu
  • Linyi Yang
  • Kaijie Zhu
  • Hao Chen
  • Xiaoyuan Yi

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

JBHI Journal 2024 Journal Article

Adaptive Fusion of Deep Learning With Statistical Anatomical Knowledge for Robust Patella Segmentation From CT Images

  • Jiachen Zhao
  • Tianshu Jiang
  • Yi Lin
  • Lok-Chun Chan
  • Ping-Keung Chan
  • Chunyi Wen
  • Hao Chen

Kneeosteoarthritis (KOA), as a leading joint disease, can be decided by examining the shapes of patella to spot potential abnormal variations. To assist doctors in the diagnosis of KOA, a robust automatic patella segmentation method is highly demanded in clinical practice. Deep learning methods, especially convolutional neural networks (CNNs) have been widely applied to medical image segmentation in recent years. Nevertheless, poor image quality and limited data still impose challenges to segmentation via CNNs. On the other hand, statistical shape models (SSMs) can generate shape priors which give anatomically reliable segmentation to varying instances. Thus, in this work, we propose an adaptive fusion framework, explicitly combining deep neural networks and anatomical knowledge from SSM for robust patella segmentation. Our adaptive fusion framework will accordingly adjust the weight of segmentation candidates in fusion based on their segmentation performance. We also propose a voxel-wise refinement strategy to make the segmentation of CNNs more anatomically correct. Extensive experiments and thorough assessment have been conducted on various mainstream CNN backbones for patella segmentation in low-data regimes, which demonstrate that our framework can be flexibly attached to a CNN model, significantly improving its performance when labeled training data are limited and input image data are of poor quality.

IROS Conference 2024 Conference Paper

BronchoCopilot: Towards Autonomous Robotic Bronchoscopy via Multimodal Reinforcement Learning

  • Jianbo Zhao
  • Hao Chen
  • Qingyao Tian
  • Jian Chen 0036
  • Bingyu Yang
  • Zihui Zhang
  • Hongbin Liu 0001

Bronchoscopy plays a significant role in the early diagnosis and treatment of lung diseases. This process demands physicians to maneuver the flexible endoscope for reaching distal lesions, particularly requiring substantial expertise when examining the airways of the upper lung lobe. With the development of artificial intelligence and robotics, reinforcement learning (RL) method has been applied to the manipulation of interventional surgical robots. However, unlike human physicians who utilize multimodal information, most of the current RL methods rely on a single modality, limiting their performance. In this paper, we propose BronchoCopilot, a multimodal RL agent designed to acquire manipulation skills for autonomous bronchoscopy. BronchoCopilot specifically integrates images from the bronchoscope camera and estimated robot poses, aiming for a higher success rate within challenging airway environment. We employ auxiliary reconstruction tasks to compress multimodal data and utilize attention mechanisms to achieve an efficient latent representation of this data, serving as input for the RL module. This framework adopts a stepwise training and fine-tuning approach to mitigate the challenges of training difficulty. Our evaluation in the realistic simulation environment reveals that BronchoCopilot, by effectively harnessing multimodal information, attains a success rate of approximately 90% in fifth generation airways with consistent movements. Additionally, it demonstrates a robust capacity to adapt to diverse cases.

NeurIPS Conference 2024 Conference Paper

Cost-efficient Knowledge-based Question Answering with Large Language Models

  • Junnan Dong
  • Qinggang Zhang
  • Chuang Zhou
  • Hao Chen
  • Daochen Zha
  • Xiao Huang

Knowledge-based question answering (KBQA) is widely used in many scenarios that necessitate domain knowledge. Large language models (LLMs) bring opportunities to KBQA, while their costs are significantly higher and absence of domain-specific knowledge during pre-training. We are motivated to combine LLMs and prior small models on knowledge graphs (KGMs) for both inferential accuracy and cost saving. However, it remains challenging since accuracy and cost are not readily combined in the optimization as two distinct metrics. It is also laborious for model selection since different models excel in diverse knowledge. To this end, we propose Coke, a novel cost-efficient strategy for KBQA with LLMs, modeled as a tailored multi-armed bandit problem to minimize calls to LLMs within limited budgets. We first formulate the accuracy expectation with a cluster-level Thompson Sampling for either KGMs or LLMs. A context-aware policy is optimized to further distinguish the expert model subject to the question semantics. The overall decision is bounded by the cost regret according to historical expenditure on failures. Extensive experiments showcase the superior performance of Coke, which moves the Pareto frontier with up to 20. 89% saving of GPT-4 fees while achieving a 2. 74% higher accuracy on the benchmark datasets.

NeurIPS Conference 2024 Conference Paper

Fine Tuning Out-of-Vocabulary Item Recommendation with User Sequence Imagination

  • Ruochen Liu
  • Hao Chen
  • Yuanchen Bei
  • Qijie Shen
  • Fangwei Zhong
  • Senzhang Wang
  • Jianxin Wang

Recommending out-of-vocabulary (OOV) items is a challenging problem since the in-vocabulary (IV) items have well-trained behavioral embeddings but the OOV items only have content features. Current OOV recommendation models often generate 'makeshift' embeddings for OOV items from content features and then jointly recommend with the `makeshift' OOV item embeddings and the behavioral IV item embeddings. However, merely using the 'makeshift' embedding will result in suboptimal recommendation performance due to the substantial gap between the content feature and the behavioral embeddings. To bridge the gap, we propose a novel User Sequence IMagination (USIM) fine-tuning framework, which first imagines the user sequences and then refines the generated OOV embeddings with the user behavioral embeddings. Specifically, we frame the user sequence imagination as a reinforcement learning problem and develop a recommendation-focused reward function to evaluate to what extent a user can help recommend the OOV items. Besides, we propose an embedding-driven transition function to model the embedding transition after imaging a user. USIM has been deployed on a prominent e-commerce platform for months, offering recommendations for millions of OOV items and billions of users. Extensive experiments demonstrate that USIM outperforms traditional generative models in OOV item recommendation performance across traditional collaborative filtering and GNN-based collaborative filtering models.

NeurIPS Conference 2024 Conference Paper

FNP: Fourier Neural Processes for Arbitrary-Resolution Data Assimilation

  • Kun Chen
  • Peng Ye
  • Hao Chen
  • Kang Chen
  • Tao Han
  • Wanli Ouyang
  • Tao Chen
  • Lei Bai

Data assimilation is a vital component in modern global medium-range weather forecasting systems to obtain the best estimation of the atmospheric state by combining the short-term forecast and observations. Recently, AI-based data assimilation approaches have attracted increasing attention for their significant advantages over traditional techniques in terms of computational consumption. However, existing AI-based data assimilation methods can only handle observations with a specific resolution, lacking the compatibility and generalization ability to assimilate observations with other resolutions. Considering that complex real-world observations often have different resolutions, we propose the Fourier Neural Processes (FNP) for arbitrary-resolution data assimilation in this paper. Leveraging the efficiency of the designed modules and flexible structure of neural processes, FNP achieves state-of-the-art results in assimilating observations with varying resolutions, and also exhibits increasing advantages over the counterparts as the resolution and the amount of observations increase. Moreover, our FNP trained on a fixed resolution can directly handle the assimilation of observations with out-of-distribution resolutions and the observational information reconstruction task without additional fine-tuning, demonstrating its excellent generalization ability across data resolutions as well as across tasks. Code is available at https: //github. com/OpenEarthLab/FNP.

NeurIPS Conference 2024 Conference Paper

Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling

  • Wanghan Xu
  • Fenghua Ling
  • Wenlong Zhang
  • Tao Han
  • Hao Chen
  • Wanli Ouyang
  • Lei Bai

Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets prevent these models from forecasting at finer time scales. This paper proposes a physics-AI hybrid model (i. e. , WeatherGFT) which generalizes weather forecasts to finer-grained temporal scales beyond training dataset. Specifically, we employ a carefully designed PDE kernel to simulate physical evolution on a small time scale (e. g. , 300 seconds) and use a parallel neural networks with a learnable router for bias correction. Furthermore, we introduce a lead time-aware training framework to promote the generalization of the model at different lead times. The weight analysis of physics-AI modules indicates that physics conducts major evolution while AI performs corrections adaptively. Extensive experiments show that WeatherGFT trained on an hourly dataset, effectively generalizes forecasts across multiple time scales, including 30-minute, which is even smaller than the dataset's temporal resolution.

JBHI Journal 2024 Journal Article

Guest Editorial: Trustworthy Machine Learning for Health Informatics

  • Luyang Luo
  • Daguang Xu
  • Jing Qin
  • Yueming Jin
  • Hao Chen

Machine learning (ML), the stem of today's artificial intelligence, has shown significant growth in the field of biomedical and health informatics. On the one hand, ML techniques are becoming more complex in order to deal with real-world data. On the other hand, ML is also more and more accessible to broader users. For example, automated machine learning products are enabling users to build their own ML models without writing code [1].

NeurIPS Conference 2024 Conference Paper

Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

  • Hao Chen
  • Ankit Shah
  • Jindong Wang
  • Ran Tao
  • Yidong Wang
  • Xiang Li
  • Xing Xie
  • Masashi Sugiyama

Learning with reduced labeling standards, such as noisy label, partial label, and supplementary unlabeled data, which we generically refer to as imprecise label, is a commonplace challenge in machine learning tasks. Previous methods tend to propose specific designs for every emerging imprecise label configuration, which is usually unsustainable when multiple configurations of imprecision coexist. In this paper, we introduce imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations. ILL leverages expectation-maximization (EM) for modeling the imprecise label information, treating the precise labels as latent variables. Instead of approximating the correct labels for training, it considers the entire distribution of all possible labeling entailed by the imprecise information. We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings, with closed-form learning objectives derived from the unified EM modeling. Notably, ILL surpasses the existing specified techniques for handling imprecise labels, marking the first practical and unified framework with robust and effective performance across various challenging settings. We hope our work will inspire further research on this topic, unleashing the full potential of ILL in wider scenarios where precise labels are expensive and complicated to obtain.

NeurIPS Conference 2024 Conference Paper

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen

Adversarial prompts (or say, adversarial examples) generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i. e. , Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt generation and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench. This match rate is 33% higher than that of a very strong baseline known as GCG, demonstrating advanced discrete optimization for adversarial prompt generation against LLMs. In addition, without introducing obvious cost, the combination achieves >30% absolute increase in attack success rates compared with GCG when generating both query-specific (38% ->68%) and universal adversarial prompts (26. 68% -> 60. 32%) for attacking the Llama-2-7B-Chat model on AdvBench. Code at: https: //github. com/qizhangli/Gradient-based-Jailbreak-Attacks.

NeurIPS Conference 2024 Conference Paper

KnowGPT: Knowledge Graph based Prompting for Large Language Models

  • Qinggang Zhang
  • Junnan Dong
  • Hao Chen
  • Daochen Zha
  • Zailiang Yu
  • Xiao Huang

Large Language Models (LLMs) have demonstrated remarkable capabilities in many real-world applications. Nonetheless, LLMs are often criticized for their tendency to produce hallucinations, wherein the models fabricate incorrect statements on tasks beyond their knowledge and perception. To alleviate this issue, graph retrieval-augmented generation (GraphRAG) has been extensively explored which leverages the factual knowledge in knowledge graphs (KGs) to ground the LLM's responses in established facts and principles. However, most state-of-the-art LLMs are closed-source, making it challenging to develop a prompting framework that can efficiently and effectively integrate KGs into LLMs with hard prompts only. Generally, existing KG-enhanced LLMs usually suffer from three critical issues, including huge search space, high API costs, and laborious prompt engineering, that impede their widespread application in practice. To this end, we introduce a novel Know ledge Gr aph based P romp T ing framework, namely KnowGPT, to enhance LLMs with domain knowledge. KnowGPT contains a knowledge extraction module to extract the most informative knowledge from KGs, and a context-aware prompt construction module to automatically convert extracted knowledge into effective prompts. Experiments on three benchmarks demonstrate that KnowGPT significantly outperforms all competitors. Notably, KnowGPT achieves a 92. 6% accuracy on OpenbookQA leaderboard, comparable to human-level performance.

AAAI Conference 2024 Conference Paper

MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction

  • Hao Qian
  • Hongting Zhou
  • Qian Zhao
  • Hao Chen
  • Hongxiang Yao
  • Jingwei Wang
  • Ziqi Liu
  • Fei Yu

The stock market is a crucial component of the financial system, but predicting the movement of stock prices is challenging due to the dynamic and intricate relations arising from various aspects such as economic indicators, financial reports, global news, and investor sentiment. Traditional sequential methods and graph-based models have been applied in stock movement prediction, but they have limitations in capturing the multifaceted and temporal influences in stock price movements. To address these challenges, the Multi-relational Dynamic Graph Neural Network (MDGNN) framework is proposed, which utilizes a discrete dynamic graph to comprehensively capture multifaceted relations among stocks and their evolution over time. The representation generated from the graph offers a complete perspective on the interrelationships among stocks and associated entities. Additionally, the power of the Transformer structure is leveraged to encode the temporal evolution of multiplex relations, providing a dynamic and effective approach to predicting stock investment. Further, our proposed MDGNN framework achieves the best performance in public datasets compared with the state-of-the-art stock investment methods.

NeurIPS Conference 2024 Conference Paper

Metric from Human: Zero-shot Monocular Metric Depth Estimation via Test-time Adaptation

  • Yizhou Zhao
  • Hengwei Bian
  • Kaihua Chen
  • Pengliang Ji
  • Liao Qu
  • Shao-yu Lin
  • Weichen Yu
  • Haoran Li

Monocular depth estimation (MDE) is fundamental for deriving 3D scene structures from 2D images. While state-of-the-art monocular relative depth estimation (MRDE) excels in estimating relative depths for in-the-wild images, current monocular metric depth estimation (MMDE) approaches still face challenges in handling unseen scenes. Since MMDE can be viewed as the composition of MRDE and metric scale recovery, we attribute this difficulty to scene dependency, where MMDE models rely on scenes observed during supervised training for predicting scene scales during inference. To address this issue, we propose to use humans as landmarks for distilling scene-independent metric scale priors from generative painting models. Our approach, Metric from Human (MfH), bridges from generalizable MRDE to zero-shot MMDE in a generate-and-estimate manner. Specifically, MfH generates humans on the input image with generative painting and estimates human dimensions with an off-the-shelf human mesh recovery (HMR) model. Based on MRDE predictions, it propagates the metric information from painted humans to the contexts, resulting in metric depth estimations for the original input. Through this annotation-free test-time adaptation, MfH achieves superior zero-shot performance in MMDE, demonstrating its strong generalization ability.

AAAI Conference 2024 Conference Paper

MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment

  • Yequan Bie
  • Luyang Luo
  • Hao Chen

Black-box deep learning approaches have showcased significant potential in the realm of medical image analysis. However, the stringent trustworthiness requirements intrinsic to the medical field have catalyzed research into the utilization of Explainable Artificial Intelligence (XAI), with a particular focus on concept-based methods. Existing concept-based methods predominantly apply concept annotations from a single perspective (e.g., global level), neglecting the nuanced semantic relationships between sub-regions and concepts embedded within medical images. This leads to underutilization of the valuable medical information and may cause models to fall short in harmoniously balancing interpretability and performance when employing inherently interpretable architectures such as Concept Bottlenecks. To mitigate these shortcomings, we propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata, encompassing the image level, token level, and concept level. Moreover, our method allows for model intervention and offers both textual and visual explanations in terms of human-interpretable concepts. Experimental results on three skin image datasets demonstrate that our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis. The code is available at https://github.com/Tommy-Bie/MICA.

ECAI Conference 2024 Conference Paper

Mixup Your Own Latent: Efficient and Robust Self-Supervised Learning on Small Images

  • Eugene Yang
  • Hao Chen
  • Seokho Kang 0001

Self-supervised learning has emerged as a powerful technique in computer vision, demonstrating remarkable performance in various downstream tasks by leveraging unlabeled data. Among these methods, contrastive learning has proven particularly promising by effectively learning image representations. However, its high reliance on large computational resources poses significant practical challenges. To address this issue, there is a pressing need to improve efficiency without compromising generalization performance and robustness. In this paper, we propose Mixup Your Own Latent (MYOL), a regularization method to improve the generalization performance and robustness of Bootstrap Your Own Latent (BYOL), particularly for small images under limited computational resources. MYOL achieves this using the Mixup of the representations of two input images as the target representation of the Mixup of those images. Through experiments conducted in a single GPU environment, we demonstrate that MYOL outperforms BYOL and other regularization methods across various downstream tasks on small-image datasets. The high resilience of MYOL to small batch sizes and its robustness to adversarial attacks further highlight its effectiveness in mitigating the limitations of BYOL. The source code is available at https: //github. com/cneyang/MYOL-MixupYourOwnLatent.

AAMAS Conference 2024 Conference Paper

Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling

  • Xianjie Zhang
  • Jiahao Sun
  • Chen Gong
  • Kai Wang
  • Yifei Cao
  • Hao Chen
  • Yu Liu

The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers’ income and enabling passengers to travel at lower prices than taxi/car on-demand services. Although on-demand ride pooling services can bring so many benefits, ride pooling services need a well-defined matching strategy to maximize the benefits for all parties (passengers, drivers, aggregation companies and environment), especially the regional dispatching of vehicles has a significant impact on matching and revenue. Existing algorithms often only consider revenue maximization, which makes it difficult for requests with unusual distribution to get rides. How to increase revenue while ensuring a reasonable assignment of requests brings a challenge to ride pooling service companies (aggregation companies). In this paper, we propose a framework for vehicle dispatching for ride pooling tasks, which splits the city into discrete dispatching regions and uses the reinforcement learning (RL) algorithm to dispatch vehicles in these regions. We also consider the mutual information (MI) between vehicle and request distribution as the intrinsic reward of the RL algorithm to improve the correlation between their distributions, thus ensuring the possibility of getting a ride for unusually distributed requests. In experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly increase revenue up to an average of 3% over the existing best on-demand ride pooling method. ∗Corresponding author This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

ICRA Conference 2024 Conference Paper

Optimization of Flexible Bronchoscopy Shape Sensing Using Fiber Optic Sensors

  • Xinran Liu
  • Hao Chen
  • Hongbin Liu

This work presents a novel shape evaluation and optimization approach for shape sensing, specifically targeting the constrained, irregular, and intricate spatial shapes of flexible bronchoscopes (FB) in human bronchial tree. The proposed evaluation criteria and optimization methods combine clinical significance related to bronchial anatomical structures and address issues related to singular points and discontinuities in traditional shape reconstruction models. Three-dimensional experiments were conducted within eight spatial complex configurations printed from a proportional bronchial model. The 3D experiment results demonstrate an average reduction of approximately 34. 1% in shape reconstruction errors across all eight airway models compared to the traditional model, validating the effectiveness and feasibility.

AAMAS Conference 2024 Conference Paper

PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning

  • Hangyu Mao
  • Rui Zhao
  • Ziyue Li
  • Zhiwei Xu
  • Hao Chen
  • Yiqun Chen
  • Bin Zhang
  • Zhen Xiao

Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on the environmental perception by processing the observation at the patch level, whereas the deciding one pays attention to the decisionmaking by conditioning on the history of the desired returns, the perceiver’s outputs, and the actions. Such a network design is generally applicable to a lot of deep RL settings, e. g. , both the online and offline RL algorithms under environments with either image observations, proprioception observations, or hybrid image-language observations. Extensive experiments show that PDiT can not only achieve superior performance than strong baselines in different settings but also extract explainable feature representations. Our code is available at https: //github. com/maohangyu/PDiT.

JMLR Journal 2024 Journal Article

PromptBench: A Unified Library for Evaluation of Large Language Models

  • Kaijie Zhu
  • Qinlin Zhao
  • Hao Chen
  • Jindong Wang
  • Xing Xie

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that can be easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed as an open, general, and flexible codebase for research purpose. It aims to facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2024 Conference Paper

PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation

  • Haibo Jin
  • Haoxuan Che
  • Yi Lin
  • Hao Chen

Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnosis unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.

ICLR Conference 2024 Conference Paper

Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

  • Yilan Zhang
  • Yingxue Xu
  • Jianqi Chen
  • Fengying Xie
  • Hao Chen

Multimodal learning significantly benefits cancer survival prediction, especially the integration of pathological images and genomic data. Despite advantages of multimodal learning for cancer survival prediction, massive redundancy in multimodal data prevents it from extracting discriminative and compact information: (1) An extensive amount of intra-modal task-unrelated information blurs discriminability, especially for gigapixel whole slide images (WSIs) with many patches in pathology and thousands of pathways in genomic data, leading to an "intra-modal redundancy" issue. (2) Duplicated information among modalities dominates the representation of multimodal data, which makes modality-specific information prone to being ignored, resulting in an "inter-modal redundancy" issue. To address these, we propose a new framework, Prototypical Information Bottlenecking and Disentangling (PIBD), consisting of Prototypical Information Bottleneck (PIB) module for intra-modal redundancy and Prototypical Information Disentanglement (PID) module for inter-modal redundancy. Specifically, a variant of information bottleneck, PIB, is proposed to model prototypes approximating a bunch of instances for different risk levels, which can be used for selection of discriminative instances within modality. PID module decouples entangled multimodal data into compact distinct components: modality-common and modality-specific knowledge, under the guidance of the joint prototypical distribution. Extensive experiments on five cancer benchmark datasets demonstrated our superiority over other methods. The code is released.

NeurIPS Conference 2024 Conference Paper

Prune and Repaint: Content-Aware Image Retargeting for any Ratio

  • Feihong Shen
  • Chao Li
  • Yifeng Geng
  • Yongjian Deng
  • Hao Chen

Image retargeting is the task of adjusting the aspect ratio of images to suit different display devices or presentation environments. However, existing retargeting methods often struggle to balance the preservation of key semantics and image quality, resulting in either deformation or loss of important objects, or the introduction of local artifacts such as discontinuous pixels and inconsistent regenerated content. To address these issues, we propose a content-aware retargeting method called PruneRepaint. It incorporates semantic importance for each pixel to guide the identification of regions that need to be pruned or preserved in order to maintain key semantics. Additionally, we introduce an adaptive repainting module that selects image regions for repainting based on the distribution of pruned pixels and the proportion between foreground size and target aspect ratio, thus achieving local smoothness after pruning. By focusing on the content and structure of the foreground, our PruneRepaint approach adaptively avoids key content loss and deformation, while effectively mitigating artifacts with local repainting. We conduct experiments on the public RetargetMe benchmark and demonstrate through objective experimental results and subjective user studies that our method outperforms previous approaches in terms of preserving semantics and aesthetics, as well as better generalization across diverse aspect ratios. Codes will be available at https: //github. com/fhshen2022/PruneRepaint.

AAAI Conference 2024 Conference Paper

Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning

  • Chenchen Jing
  • Yukun Li
  • Hao Chen
  • Chunhua Shen

Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by learning from seen compositions. Composing the learned knowledge of seen primitives, i.e., attributes or objects, into novel compositions is critical for CZSL. In this work, we propose to explicitly retrieve knowledge of seen primitives for compositional zero-shot learning. We present a retrieval-augmented method, which augments standard multi-path classification methods with two retrieval modules. Specifically, we construct two databases storing the attribute and object representations of training images, respectively. For an input training/testing image, we use two retrieval modules to retrieve representations of training images with the same attribute and object, respectively. The primitive representations of the input image are augmented by using the retrieved representations, for composition recognition. By referencing semantically similar images, the proposed method is capable of recalling knowledge of seen primitives for compositional generalization. Experiments on three widely-used datasets show the effectiveness of the proposed method.

AAAI Conference 2024 Conference Paper

Revisiting Open-Set Panoptic Segmentation

  • Yufei Yin
  • Hao Chen
  • Wengang Zhou
  • Jiajun Deng
  • Haiming Xu
  • Houqiang Li

In this paper, we focus on the open-set panoptic segmentation (OPS) task to circumvent the data explosion problem. Different from the close-set setting, OPS targets to detect both known and unknown categories, where the latter is not annotated during training. Different from existing work that only selects a few common categories as unknown ones, we move forward to the real-world scenario by considering the various tail categories (~1k). To this end, we first build a new dataset with long-tail distribution for the OPS task. Based on this dataset, we additionally add a new class type for unknown classes and re-define the training annotations to make the OPS definition more complete and reasonable. Moreover, we analyze the influence of several significant factors in the OPS task and explore the upper bound of performance on unknown classes with different settings. Furthermore, based on the analyses, we design an effective two-phase framework for the OPS task, including thing-agnostic map generation and unknown segment mining. We further adopt semi-supervised learning to improve the OPS performance. Experimental results on different datasets validate the effectiveness of our method.

IJCAI Conference 2024 Conference Paper

Score-CDM: Score-Weighted Convolutional Diffusion Model for Multivariate Time Series Imputation

  • Shunyang Zhang
  • Senzhang Wang
  • Hao Miao
  • Hao Chen
  • Changjun Fan
  • Jian Zhang

Multivariant time series (MTS) data are usually incomplete in real scenarios, and imputing the incomplete MTS is practically important to facilitate various time series mining tasks. Recently, diffusion model-based MTS imputation methods have achieved promising results by utilizing CNN or attention mechanisms for temporal features learning. However, it is hard to adaptively trade off the diverse effects of local and global temporal features by simply combining CNN and attention. To address this issue, we propose a Score-weighted Convolutional Diffusion Model (Score-CDM for short), whose backbone consists of a Score-weighted Convolution Module (SCM) and an Adaptive Reception Module (ARM). SCM adopts a score map to capture the global temporal features in the time domain, while ARM uses a Spectral2Time Window Block (S2TWB) to convolve the local time series data in the spectral domain. Benefiting from the time convolution properties of Fast Fourier Transformation, ARM can adaptively change the receptive field of the score map, and thus effectively balance the local and global temporal features. We conduct extensive evaluations on three real MTS datasets of different domains, and the result verifies the effectiveness of the proposed Score-CDM.

NeurIPS Conference 2024 Conference Paper

Slight Corruption in Pre-training Data Makes Better Diffusion Models

  • Hao Chen
  • Yujin Han
  • Diganta Misra
  • Xiang Li
  • Kai Hu
  • Difan Zou
  • Masashi Sugiyama
  • Jindong Wang

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over $50$ conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

NeurIPS Conference 2024 Conference Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

  • Pedro R. Bassi
  • Wenxuan Li
  • Yucheng Tang
  • Fabian Isensee
  • Zifu Wang
  • Jieneng Chen
  • Yu-Cheng Chou
  • Saikat Roy

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5, 195 training CT scans from 76 hospitals around the world and 5, 903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks---which, differing from algorithms, are more flexible and can support different algorithms—including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

NeurIPS Conference 2024 Conference Paper

Transformer Doctor: Diagnosing and Treating Vision Transformers

  • Jiacong Hu
  • Hao Chen
  • Kejia Chen
  • Yang Gao
  • Jingwen Ye
  • Xingen Wang
  • Mingli Song
  • Zunlei Feng

Due to its powerful representational capabilities, Transformers have gradually become the mainstream model in the field of machine vision. However, the vast and complex parameters of Transformers impede researchers from gaining a deep understanding of their internal mechanisms, especially error mechanisms. Existing methods for interpreting Transformers mainly focus on understanding them from the perspectives of the importance of input tokens or internal modules, as well as the formation and meaning of features. In contrast, inspired by research on information integration mechanisms and conjunctive errors in the biological visual system, this paper conducts an in-depth exploration of the internal error mechanisms of Transformers. We first propose an information integration hypothesis for Transformers in the machine vision domain and provide substantial experimental evidence to support this hypothesis. This includes the dynamic integration of information among tokens and the static integration of information within tokens in Transformers, as well as the presence of conjunctive errors therein. Addressing these errors, we further propose heuristic dynamic integration constraint methods and rule-based static integration constraint methods to rectify errors and ultimately improve model performance. The entire methodology framework is termed as Transformer Doctor, designed for diagnosing and treating internal errors within transformers. Through a plethora of quantitative and qualitative experiments, it has been demonstrated that Transformer Doctor can effectively address internal errors in transformers, thereby enhancing model performance.

NeurIPS Conference 2024 Conference Paper

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

  • Muzhi Zhu
  • Yang Liu
  • Zekai Luo
  • Chenchen Jing
  • Hao Chen
  • Guangkai Xu
  • Xinlong Wang
  • Chunhua Shen

The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.

AAAI Conference 2023 Conference Paper

Consensus Learning for Cooperative Multi-Agent Reinforcement Learning

  • Zhiwei Xu
  • Bin Zhang
  • Dapeng Li
  • Zeren Zhang
  • Guangchong Zhou
  • Hao Chen
  • Guoliang Fan

Almost all multi-agent reinforcement learning algorithms without communication follow the principle of centralized training with decentralized execution. During the centralized training, agents can be guided by the same signals, such as the global state. However, agents lack the shared signal and choose actions given local observations during execution. Inspired by viewpoint invariance and contrastive learning, we propose consensus learning for cooperative multi-agent reinforcement learning in this study. Although based on local observations, different agents can infer the same consensus in discrete spaces without communication. We feed the inferred one-hot consensus to the network of agents as an explicit input in a decentralized way, thereby fostering their cooperative spirit. With minor model modifications, our suggested framework can be extended to a variety of multi-agent reinforcement learning algorithms. Moreover, we carry out these variants on some fully cooperative tasks and get convincing results.

NeurIPS Conference 2023 Conference Paper

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

  • Weijia Wu
  • Yuzhong Zhao
  • Hao Chen
  • Yuchao Gu
  • Rui Zhao
  • Yefei He
  • Hong Zhou
  • Mike Zheng Shou

Current deep networks are very data-hungry and benefit from training on large-scale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse syntheticimages and the corresponding high-quality perception annotations (e. g. , segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) of manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models on downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic15segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more efficient and robust in domain generalization than the real data; 3) state-of-the-art results in zero-shot segmentation setting; and 4) flexibility for efficient application and novel task composition (e. g. , image editing)

IJCAI Conference 2023 Conference Paper

Diagnose Like a Pathologist: Transformer-Enabled Hierarchical Attention-Guided Multiple Instance Learning for Whole Slide Image Classification

  • Conghao Xiong
  • Hao Chen
  • Joseph J. Y. Sung
  • Irwin King

Multiple Instance Learning (MIL) and transformers are increasingly popular in histopathology Whole Slide Image (WSI) classification. However, unlike human pathologists who selectively observe specific regions of histopathology tissues under different magnifications, most methods do not incorporate multiple resolutions of the WSIs, hierarchically and attentively, thereby leading to a loss of focus on the WSIs and information from other resolutions. To resolve this issue, we propose a Hierarchical Attention-Guided Multiple Instance Learning framework to fully exploit the WSIs. This framework can dynamically and attentively discover the discriminative regions across multiple resolutions of the WSIs. Within this framework, an Integrated Attention Transformer is proposed to further enhance the performance of the transformer and obtain a more holistic WSI (bag) representation. This transformer consists of multiple Integrated Attention Modules, which is the combination of a transformer layer and an aggregation module that produces a bag representation based on every instance representation in that bag. The experimental results show that our method achieved state-of-the-art performances on multiple datasets, including Camelyon16, TCGA-RCC, TCGA-NSCLC, and an in-house IMGC dataset. The code is available at https: //github. com/BearCleverProud/HAG-MIL.

NeurIPS Conference 2023 Conference Paper

Improving Adversarial Transferability via Intermediate-level Perturbation Decay

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen

Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10. 07% on average) and CIFAR-10 (+3. 88% on average). Our code is at https: //github. com/qizhangli/ILPD-attack.

NeurIPS Conference 2023 Conference Paper

Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen

The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 10 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided.

NeurIPS Conference 2023 Conference Paper

Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning

  • Shenzhi Wang
  • Qisen Yang
  • Jiawei Gao
  • Matthieu Lin
  • Hao Chen
  • Liwei Wu
  • Ning Jia
  • Shiji Song

Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a universal model to train a family of policies with different improvement/constraint intensities, and a balance model to select a suitable policy for each state. Theoretically, we prove that state-adaptive balances are necessary for achieving a higher policy performance upper bound. Empirically, extensive experiments show that FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark. Codes are available at https: //github. com/LeapLabTHU/FamO2O.

JBHI Journal 2022 Journal Article

A Cascaded Multi-Task Generative Framework for Detecting Aortic Dissection on 3-D Non-Contrast-Enhanced Computed Tomography

  • Xiangyu Xiong
  • Yan Ding
  • Chuanqi Sun
  • Zhuoneng Zhang
  • Xiuhong Guan
  • Tianjing Zhang
  • Hao Chen
  • Hongyan Liu

Contrast-enhanced computed tomography (CE-CT) is the gold standard for diagnosing aortic dissection (AD). However, contrast agents can cause allergic reactions or renal failure in some patients. Moreover, AD diagnosis by radiologists using non-contrast-enhanced CT (NCE-CT) images has poor sensitivity. To address this issue, we propose a novel cascaded multi-task generative framework for AD detection using NCE-CT volumes. The framework includes a 3D nnU-Net and a 3D multi-task generative architecture (3D MTGA). Specifically, the 3D nnU-Net was employed to segment aortas from NCE-CT volumes. The 3D MTGA was then employed to simultaneously synthesize CE-CT volumes, segment true & false lumen, and classify the patient as AD or non-AD. A theoretical formulation demonstrated that the 3D MTGA could increase the Jensen–Shannon Divergence (JSD) between AD and non-AD for each NCE-CT volume, thus indirectly improving the AD detection performance. Experiments also showed that the proposed framework could achieve an average accuracy of 0. 831, a sensitivity of 0. 938, and an F1-score of 0. 847 in comparison with seven state-of-the-art classification models used by three radiologists with junior, intermediate, and senior experiences, respectively. The experimental results indicate that the proposed framework obtains superior performance to state-of-the-art models in AD detection. Thus, it has great potential to reduce the misdiagnosis of AD using NCE-CT in clinical practice. The source codes and supplementary materials for our framework are available at https://github.com/yXiangXiong/CMTGF.

NeurIPS Conference 2022 Conference Paper

An In-depth Study of Stochastic Backpropagation

  • Jun Fang
  • Mingze Xu
  • Hao Chen
  • Bing Shuai
  • Zhuowen Tu
  • Joseph Tighe

In this paper, we provide an in-depth study of Stochastic Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks. During backward propagation, SBP calculates gradients by using only a subset of feature maps to save GPU memory and computational cost. We interpret SBP as an efficient way to implement stochastic gradient decent by performing backpropagation dropout, which leads to significant memory saving and training run-time reduction, with a minimal impact on the overall model accuracy. We offer best practices to apply SBP for training image recognition models, which can be adopted in learning a wide range of deep neural networks. Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy degradation. Code is available at: https: //github. com/amazon-research/stochastic-backpropagation

JMLR Journal 2022 Journal Article

Gaussian Process Parameter Estimation Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

  • Hao Chen
  • Lili Zheng
  • Raed Al Kontar
  • Garvesh Raskutti

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparmeter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

USB: A Unified Semi-supervised Learning Benchmark for Classification

  • Yidong Wang
  • Hao Chen
  • Yue Fan
  • Wang Sun
  • Ran Tao
  • Wenxin Hou
  • Renjie Wang
  • Linyi Yang

Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples. However, currently, popular SSL evaluation protocols are often constrained to computer vision (CV) tasks. In addition, previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly. To address the above issues, we construct a Unified SSL Benchmark (USB) for classification by selecting 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio), on which we systematically evaluate the dominant SSL methods, and also open-source a modular and extensible codebase for fair evaluation of these SSL methods. We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning. USB enables the evaluation of a single SSL algorithm on more tasks from multiple domains but with less cost. Specifically, on a single NVIDIA V100, only 39 GPU days are required to evaluate FixMatch on 15 tasks in USB while 335 GPU days (279 GPU days on 4 CV datasets except for ImageNet) are needed on 5 CV tasks with TorchSSL.

IJCAI Conference 2021 Conference Paper

AMA-GCN: Adaptive Multi-layer Aggregation Graph Convolutional Network for Disease Prediction

  • Hao Chen
  • Fuzhen Zhuang
  • Li Xiao
  • Ling Ma
  • Haiyan Liu
  • Ruifang Zhang
  • Huiqin Jiang
  • Qing He

Recently, Graph Convolutional Networks (GCNs) have proven to be a powerful mean for Computer Aided Diagnosis (CADx). This approach requires building a population graph to aggregate structural information, where the graph adjacency matrix represents the relationship between nodes. Until now, this adjacency matrix is usually defined manually based on phenotypic information. In this paper, we propose an encoder that automatically selects the appropriate phenotypic measures according to their spatial distribution, and uses the text similarity awareness mechanism to calculate the edge weights between nodes. The encoder can automatically construct the population graph using phenotypic measures which have a positive impact on the final results, and further realizes the fusion of multimodal information. In addition, a novel graph convolution network architecture using multi-layer aggregation mechanism is proposed. The structure can obtain deep structure information while suppressing over-smooth, and increase the similarity between the same type of nodes. Experimental results on two databases show that our method can significantly improve the diagnostic accuracy for Autism spectrum disorder and breast cancer, indicating its universality in leveraging multimodal data for disease prediction.

AAAI Conference 2021 System Paper

Dialog Router: Automated Dialog Transition via Multi-Task Learning

  • Ziming Huang
  • Zhuoxuan Jiang
  • Hao Chen
  • Xue Han
  • Yabin Dang

Dialog Router is a general paradigm for human-bot symbiosis dialog systems to provide friendly customer care service. It is equipped with a multi-task learning model to automatically capture the underlying correlation between multiple related tasks, i. e. dialog classification and regression, and greatly reduce human labor work for system customization, which improves the accuracy of dialog transition. In addition, for learning the multi-task model, the training data and labels are easy to collect from human-to-human historical dialog logs, and the Dialog Router can be easily integrated into the majority of existing dialog systems by calling general APIs. We conduct experiments on real-world datasets for dialog classification and regression. The results show that our model achieves improvements on both tasks, which benefits the dialog transition application. The demo illustrates our method’s effectiveness in a real customer care service.

NeurIPS Conference 2021 Conference Paper

Long Short-Term Transformer for Online Action Detection

  • Mingze Xu
  • Yuanjun Xiong
  • Hao Chen
  • Xinyu Li
  • Wei Xia
  • Zhuowen Tu
  • Stefano Soatto

We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm for online action detection, which employs a long- and short-term memory mechanism to model prolonged sequence data. It consists of an LSTR encoder that dynamically leverages coarse-scale historical information from an extended temporal window (e. g. , 2048 frames spanning of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e. g. , 32 frames spanning 8 seconds) to model the fine-scale characteristics of the data. Compared to prior work, LSTR provides an effective and efficient method to model long videos with fewer heuristics, which is validated by extensive empirical analysis. LSTR achieves state-of-the-art performance on three standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment. Code has been made available at: https: //xumingze0308. github. io/projects/lstr.

NeurIPS Conference 2021 Conference Paper

NeRV: Neural Representations for Videos

  • Hao Chen
  • Bo He
  • Hanyu Wang
  • Yixuan Ren
  • Ser Nam Lim
  • Abhinav Shrivastava

We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. Given a frame index, NeRV outputs the corresponding RGB image. Video encoding in NeRV is simply fitting a neural network to video frames and decoding process is a simple feedforward operation. As an image-wise implicit representation, NeRV output the whole image and shows great efficiency compared to pixel-wise implicit representation, improving the encoding speed by $\textbf{25}\times$ to $\textbf{70}\times$, the decoding speed by $\textbf{38}\times$ to $\textbf{132}\times$, while achieving better video quality. With such a representation, we can treat videos as neural networks, simplifying several video-related tasks. For example, conventional video compression methods are restricted by a long and complex pipeline, specifically designed for the task. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H. 264, HEVC \etc). Besides compression, we demonstrate the generalization of NeRV for video denoising. The source code and pre-trained model can be found at https: //github. com/haochen-rye/NeRV. git.

NeurIPS Conference 2020 Conference Paper

Backpropagating Linearly Improves Transferability of Adversarial Examples

  • Yiwen Guo
  • Qizhang Li
  • Hao Chen

The vulnerability of deep neural networks (DNNs) to adversarial examples has drawn great attention from the community. In this paper, we study the transferability of such examples, which lays the foundation of many black-box attacks on DNNs. We revisit a not so new but definitely noteworthy hypothesis of Goodfellow et al. 's and disclose that the transferability can be enhanced by improving the linearity of DNNs in an appropriate manner. We introduce linear backpropagation (LinBP), a method that performs backpropagation in a more linear fashion using off-the-shelf attacks that exploit gradients. More specifically, it calculates forward as normal but backpropagates loss as if some nonlinear activations are not encountered in the forward pass. Experimental results demonstrate that this simple yet effective method obviously outperforms current state-of-the-arts in crafting transferable adversarial examples on CIFAR-10 and ImageNet, leading to more effective attacks on a variety of DNNs. Code at: https: //github. com/qizhangli/linbp-attack.

NeurIPS Conference 2020 Conference Paper

Practical No-box Adversarial Attacks against DNNs

  • Qizhang Li
  • Yiwen Guo
  • Hao Chen

The study of adversarial vulnerabilities of deep neural networks (DNNs) has progressed rapidly. Existing attacks require either internal access (to the architecture, parameters, or training set of the victim model) or external access (to query the model). However, both the access may be infeasible or expensive in many scenarios. We investigate no-box adversarial examples, where the attacker can neither access the model information or the training set nor query the model. Instead, the attacker can only gather a small number of examples from the same problem domain as that of the victim model. Such a stronger threat model greatly expands the applicability of adversarial attacks. We propose three mechanisms for training with a very small dataset (on the order of tens of examples) and find that prototypical reconstruction is the most effective. Our experiments show that adversarial examples crafted on prototypical auto-encoding models transfer well to a variety of image classification and face verification models. On a commercial celebrity recognition system held by clarifai. com, our approach significantly diminishes the average prediction accuracy of the system to only 15. 40%, which is on par with the attack that transfers adversarial examples from a pre-trained Arcface model. Our code is publicly available at: https: //github. com/qizhangli/nobox-attacks.

NeurIPS Conference 2020 Conference Paper

Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes

  • Hao Chen
  • Lili Zheng
  • Raed Al Kontar
  • Garvesh Raskutti

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ up to a statistical error term depending on the minibatch size. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.

JBHI Journal 2020 Journal Article

UD-MIL: Uncertainty-Driven Deep Multiple Instance Learning for OCT Image Classification

  • Xi Wang
  • Fangyao Tang
  • Hao Chen
  • Luyang Luo
  • Ziqi Tang
  • An-Ran Ran
  • Carol Y. Cheung
  • Pheng-Ann Heng

Deep learning has achieved remarkable success in the optical coherence tomography (OCT) image classification task with substantial labelled B-scan images available. However, obtaining such fine-grained expert annotations is usually quite difficult and expensive. How to leverage the volume-level labels to develop a robust classifier is very appealing. In this paper, we propose a weakly supervised deep learning framework with uncertainty estimation to address the macula-related disease classification problem from OCT images with the only volume-level label being available. First, a convolutional neural network (CNN) based instance-level classifier is iteratively refined by using the proposed uncertainty-driven deep multiple instance learning scheme. To our best knowledge, we are the first to incorporate the uncertainty evaluation mechanism into multiple instance learning (MIL) for training a robust instance classifier. The classifier is able to detect suspicious abnormal instances and abstract the corresponding deep embedding with high representation capability simultaneously. Second, a recurrent neural network (RNN) takes instance features from the same bag as input and generates the final bag-level prediction by considering the individually local instance information and globally aggregated bag-level representation. For more comprehensive validation, we built two large diabetic macular edema (DME) OCT datasets from different devices and imaging protocols to evaluate the efficacy of our method, which are composed of 30, 151 B-scans in 1, 396 volumes from 274 patients (Heidelberg-DME dataset) and 38, 976 B-scans in 3, 248 volumes from 490 patients (Triton-DME dataset), respectively. We compare the proposed method with the state-of-the-art approaches, and experimentally demonstrate that our method is superior to alternative methods, achieving volume-level accuracy, F1-score and area under the receiver operating characteristic curve (AUC) of 95. 1%, 0. 939 and 0. 990 on Heidelberg-DME and those of 95. 1%, 0. 935 and 0. 986 on Triton-DME, respectively. Furthermore, the proposed method also yields competitive results on another public age-related macular degeneration OCT dataset, indicating the high potential as an effective screening tool in the clinical practice.

IJCAI Conference 2019 Conference Paper

DeltaDou: Expert-level Doudizhu AI through Self-play

  • Qiqi Jiang
  • Kuangzheng Li
  • Boyao Du
  • Hao Chen
  • Hai Fang

Artificial Intelligence has seen several breakthroughs in two-player perfect information game. Nevertheless, Doudizhu, a three-player imperfect information game, is still quite challenging. In this paper, we present a Doudizhu AI by applying deep reinforcement learning from games of self-play. The algorithm combines an asymmetric MCTS on nodes of information set of each player, a policy-value network that approximates the policy and value on each decision node, and inference on unobserved hands of other players by given policy. Our results show that self-play can significantly improve the performance of our agent in this multi-agent imperfect information game. Even starting with a weak AI, our agent can achieve human expert level after days of self-play and training.

IJCAI Conference 2019 Conference Paper

Light-Weight Hybrid Convolutional Network for Liver Tumor Segmentation

  • Jianpeng Zhang
  • Yutong Xie
  • Pingping Zhang
  • Hao Chen
  • Yong Xia
  • Chunhua Shen

Automated segmentation of liver tumors in contrast-enhanced abdominal computed tomography (CT) scans is essential in assisting medical professionals to evaluate tumor development and make fast therapeutic schedule. Although deep convolutional neural networks (DCNNs) have contributed many breakthroughs in image segmentation, this task remains challenging, since 2D DCNNs are incapable of exploring the inter-slice information and 3D DCNNs are too complex to be trained with the available small dataset. In this paper, we propose the light-weight hybrid convolutional network (LW-HCN) to segment the liver and its tumors in CT volumes. Instead of combining a 2D and a 3D networks for coarse-to-fine segmentation, LW-HCN has a encoder-decoder structure, in which 2D convolutions used at the bottom of the encoder decreases the complexity and 3D convolutions used in other layers explore both spatial and temporal information. To further reduce the complexity, we design the depthwise and spatiotemporal separate (DSTS) factorization for 3D convolutions, which not only reduces parameters dramatically but also improves the performance. We evaluated the proposed LW-HCN model against several recent methods on the LiTS and 3D-IRCADb datasets and achieved, respectively, the Dice per case of 73. 0% and 94. 1% for tumor segmentation, setting a new state of the art.

AAAI Conference 2019 Conference Paper

Synergistic Image and Feature Adaptation: Towards Cross-Modality Domain Adaptation for Medical Image Segmentation

  • Cheng Chen
  • Qi Dou
  • Hao Chen
  • Jing Qin
  • Pheng-Ann Heng

This paper presents a novel unsupervised domain adaptation framework, called Synergistic Image and Feature Adaptation (SIFA), to effectively tackle the problem of domain shift. Domain adaptation has become an important and hot topic in recent studies on deep learning, aiming to recover performance degradation when applying the neural networks to new testing domains. Our proposed SIFA is an elegant learning diagram which presents synergistic fusion of adaptations from both image and feature perspectives. In particular, we simultaneously transform the appearance of images across domains and enhance domain-invariance of the extracted features towards the segmentation task. The feature encoder layers are shared by both perspectives to grasp their mutual benefits during the end-to-end learning procedure. Without using any annotation from the target domain, the learning of our unified model is guided by adversarial losses, with multiple discriminators employed from various aspects. We have extensively validated our method with a challenging application of crossmodality medical image segmentation of cardiac structures. Experimental results demonstrate that our SIFA model recovers the degraded performance from 17. 2% to 73. 0%, and outperforms the state-of-the-art methods by a significant margin.

IJCAI Conference 2019 Conference Paper

Theoretical Investigation of Generalization Bound for Residual Networks

  • Hao Chen
  • Zhanfeng Mo
  • Zhouwang Yang
  • Xiao Wang

This paper presents a framework for norm-based capacity control with respect to an lp, q-norm in weight-normalized Residual Neural Networks (ResNets). We first formulate the representation of each residual block. For the regression problem, we analyze the Rademacher Complexity of the ResNets family. We also establish a tighter generalization upper bound for weight-normalized ResNets. in a more general sight. Using the lp, q-norm weight normalization in which 1/p+1/q >=1, we discuss the properties of a width-independent capacity control, which only relies on the depth according to a square root term. Several comparisons suggest that our result is tighter than previous work. Parallel results for Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN) are included by introducing the lp, q-norm weight normalization for DNN and the lp, q-norm kernel normalization for CNN. Numerical experiments also verify that ResNet structures contribute to better generalization properties.

IROS Conference 2019 Conference Paper

Towards More Realistic Human-Robot Conversation: A Seq2Seq-based Body Gesture Interaction System

  • Minjie Hua
  • Fuyuan Shi
  • Yibing Nan
  • Kai Wang 0012
  • Hao Chen
  • Shiguo Lian

This paper presents a novel system that enables intelligent robots to exhibit realistic body gestures while communicating with humans. The proposed system consists of a listening model and a speaking model used in corresponding conversational phases. Both models are adapted from the sequence-to-sequence (seq2seq) architecture to synthesize body gestures represented by the movements of twelve upper-body keypoints. All the extracted 2D keypoints are firstly 3D-transformed, then rotated and normalized to discard irrelevant information. Substantial videos of human conversations from Youtube are collected and preprocessed to train the listening and speaking models separately, after which the two models are evaluated using metrics of mean squared error (MSE) and cosine similarity on the test dataset. The tuned system is implemented to drive a virtual avatar as well as Pepper, a physical humanoid robot, to demonstrate the improvement on conversational interaction abilities of our method in practice.

AAAI Conference 2018 Conference Paper

LSTD: A Low-Shot Transfer Detector for Object Detection

  • Hao Chen
  • Yali Wang
  • Guoyou Wang
  • Yu Qiao

Recent advances in object detection are mainly driven by deep learning with large-scale detection benchmarks. However, the fully-annotated training set is often limited for a target detection task, which may deteriorate the performance of deep detectors. To address this challenge, we propose a novel low-shot transfer detector (LSTD) in this paper, where we leverage rich source-domain knowledge to construct an effective target-domain detector with very few training examples. The main contributions are described as follows. First, we design a flexible deep architecture of LSTD to alleviate transfer difficulties in low-shot detection. This architecture can integrate the advantages of both SSD and Faster RCNN in a unified deep framework. Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images. Finally, we examine our LSTD on a number of challenging low-shot detection experiments, where LSTD outperforms other state-of-the-art approaches. The results demonstrate that LSTD is a preferable deep detector for low-shot scenarios.

AAAI Conference 2018 Conference Paper

SFCN-OPI: Detection and Fine-Grained Classification of Nuclei Using Sibling FCN With Objectness Prior Interaction

  • Yanning Zhou
  • Qi Dou
  • Hao Chen
  • Jing Qin
  • Pheng-Ann Heng

Cell nuclei detection and fine-grained classification have been fundamental yet challenging problems in histopathology image analysis. Due to the nuclei tiny size, significant inter- /intra-class variances, as well as the inferior image quality, previous automated methods would easily suffer from limited accuracy and robustness. In the meanwhile, existing approaches usually deal with these two tasks independently, which would neglect the close relatedness of them. In this paper, we present a novel method of sibling fully convolutional network with prior objectness interaction (called SFCN-OPI) to tackle the two tasks simultaneously and interactively using a unified end-to-end framework. Specifically, the sibling FCN branches share features in earlier layers while holding respective higher layers for specific tasks. More importantly, the detection branch outputs the objectness prior which dynamically interacts with the fine-grained classification sibling branch during the training and testing processes. With this mechanism, the fine-grained classification successfully focuses on regions with high confidence of nuclei existence and outputs the conditional probability, which in turn bene- fits the detection through back propagation. Extensive experiments on colon cancer histology images have validated the effectiveness of our proposed SFCN-OPI and our method has outperformed the state-of-the-art methods by a large margin.

IJCAI Conference 2018 Conference Paper

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

  • Qi Dou
  • Cheng Ouyang
  • Cheng Chen
  • Hao Chen
  • Pheng-Ann Heng

Convolutional networks (ConvNets) have achieved great successes in various challenging vision tasks. However, the performance of ConvNets would degrade when encountering the domain shift. The domain adaptation is more significant while challenging in the field of biomedical image analysis, where cross-modality data have largely different distributions. Given that annotating the medical data is especially expensive, the supervised transfer learning approaches are not quite optimal. In this paper, we propose an unsupervised domain adaptation framework with adversarial learning for cross-modality biomedical image segmentations. Specifically, our model is based on a dilated fully convolutional network for pixel-wise prediction. Moreover, we build a plug-and-play domain adaptation module (DAM) to map the target input to features which are aligned with source domain feature space. A domain critic module (DCM) is set up for discriminating the feature space of both domains. We optimize the DAM and DCM via an adversarial loss without using any target domain label. Our proposed method is validated by adapting a ConvNet trained with MRI images to unpaired CT data for cardiac structures segmentations, and achieved very promising results.

JBHI Journal 2017 Journal Article

Integrating Online and Offline Three-Dimensional Deep Learning for Automated Polyp Detection in Colonoscopy Videos

  • Lequan Yu
  • Hao Chen
  • Qi Dou
  • Jing Qin
  • Pheng Ann Heng

Automated polyp detection in colonoscopy videos has been demonstrated to be a promising way for colorectal cancer prevention and diagnosis. Traditional manual screening is time consuming, operator dependent, and error prone; hence, automated detection approach is highly demanded in clinical practice. However, automated polyp detection is very challenging due to high intraclass variations in polyp size, color, shape, and texture, and low interclass variations between polyps and hard mimics. In this paper, we propose a novel offline and online three-dimensional (3-D) deep learning integration framework by leveraging the 3-D fully convolutional network (3D-FCN) to tackle this challenging problem. Compared with the previous methods employing hand-crafted features or 2-D convolutional neural network, the 3D-FCN is capable of learning more representative spatio-temporal features from colonoscopy videos, and hence has more powerful discrimination capability. More importantly, we propose a novel online learning scheme to deal with the problem of limited training data by harnessing the specific information of an input video in the learning process. We integrate offline and online learning to effectively reduce the number of false positives generated by the offline network and further improve the detection performance. Extensive experiments on the dataset of MICCAI 2015 Challenge on Polyp Detection demonstrated the better performance of our method when compared with other competitors.

AAAI Conference 2017 Conference Paper

Volumetric ConvNets with Mixed Residual Connections for Automated Prostate Segmentation from 3D MR Images

  • Lequan Yu
  • Xin Yang
  • Hao Chen
  • Jing Qin
  • Pheng Ann Heng

Automated prostate segmentation from 3D MR images is very challenging due to large variations of prostate shape and indistinct prostate boundaries. We propose a novel volumetric convolutional neural network (ConvNet) with mixed residual connections to cope with this challenging problem. Compared with previous methods, our volumetric ConvNet has two compelling advantages. First, it is implemented in a 3D manner and can fully exploit the 3D spatial contextual information of input data to perform efficient, precise and volumeto-volume prediction. Second and more important, the novel combination of residual connections (i. e. , long and short) can greatly improve the training efficiency and discriminative capability of our network by enhancing the information propagation within the ConvNet both locally and globally. While the forward propagation of location information can improve the segmentation accuracy, the smooth backward propagation of gradient flow can accelerate the convergence speed and enhance the discrimination capability. Extensive experiments on the open MICCAI PROMISE12 challenge dataset corroborated the effectiveness of the proposed volumetric ConvNet with mixed residual connections. Our method ranked the first in the challenge, outperforming other competitors by a large margin with respect to most of evaluation metrics. The proposed volumetric ConvNet is general enough and can be easily extended to other medical image analysis tasks, especially ones with limited training data.

AAAI Conference 2016 Conference Paper

Deep Contextual Networks for Neuronal Structure Segmentation

  • Hao Chen
  • Xiao Qi
  • Jie Cheng
  • Pheng Heng

The goal of connectomics is to manifest the interconnections of neural system with the Electron Microscopy (EM) images. However, the formidable size of EM image data renders human annotation impractical, as it may take decades to fulfill the whole job. An alternative way to reconstruct the connectome can be attained with the computerized scheme that can automatically segment the neuronal structures. The segmentation of EM images is very challenging as the depicted structures can be very diverse. To address this difficult problem, a deep contextual network is proposed here by leveraging multi-level contextual information from the deep hierarchical structure to achieve better segmentation performance. To further improve the robustness against the vanishing gradients and strengthen the capability of the back-propagation of gradient flow, auxiliary classifiers are incorporated in the architecture of our deep neural network. It will be shown that our method can effectively parse the semantic meaning from the images with the underlying neural network and accurately delineate the structural boundaries with the reference of low-level contextual cues. Experimental results on the benchmark dataset of 2012 ISBI segmentation challenge of neuronal structures suggest that the proposed method can outperform the state-of-the-art methods by a large margin with respect to different evaluation measurements. Our method can potentially facilitate the automatic connectome analysis from EM images with less human intervention effort.

AAAI Conference 2016 Conference Paper

Mitosis Detection in Breast Cancer Histology Images via Deep Cascaded Networks

  • Hao Chen
  • Qi Dou
  • Xi Wang
  • Jing Qin
  • Pheng Heng

The number of mitoses per tissue area gives an important aggressiveness indication of the invasive breast carcinoma. However, automatic mitosis detection in histology images remains a challenging problem. Traditional methods either employ hand-crafted features to discriminate mitoses from other cells or construct a pixel-wise classifier to label every pixel in a sliding window way. While the former suffers from the large shape variation of mitoses and the existence of many mimics with similar appearance, the slow speed of the later prohibits its use in clinical practice. In order to overcome these shortcomings, we propose a fast and accurate method to detect mitosis by designing a novel deep cascaded convolutional neural network, which is composed of two components. First, by leveraging the fully convolutional neural network, we propose a coarse retrieval model to identify and locate the candidates of mitosis while preserving a high sensitivity. Based on these candidates, a fine discrimination model utilizing knowledge transferred from cross-domain is developed to further single out mitoses from hard mimics. Our approach outperformed other methods by a large margin in 2014 ICPR MITOS-ATYPIA challenge in terms of detection accuracy. When compared with the state-of-the-art methods on the 2012 ICPR MITOSIS data (a smaller and less challenging dataset), our method achieved comparable or better results with a roughly 60 times faster speed.

JBHI Journal 2015 Journal Article

Standard Plane Localization in Fetal Ultrasound via Domain Transferred Deep Neural Networks

  • Hao Chen
  • Dong Ni
  • Jing Qin
  • Shengli Li
  • Xin Yang
  • Tianfu Wang
  • Pheng Ann Heng

Automatic localization of the standard plane containing complicated anatomical structures in ultrasound (US) videos remains a challenging problem. In this paper, we present a learning-based approach to locate the fetal abdominal standard plane (FASP) in US videos by constructing a domain transferred deep convolutional neural network (CNN). Compared with previous works based on low-level features, our approach is able to represent the complicated appearance of the FASP and hence achieve better classification performance. More importantly, in order to reduce the overfitting problem caused by the small amount of training samples, we propose a transfer learning strategy, which transfers the knowledge in the low layers of a base CNN trained from a large database of natural images to our task-specific CNN. Extensive experiments demonstrate that our approach outperforms the state-of-the-art method for the FASP localization as well as the CNN only trained on the limited US training samples. The proposed approach can be easily extended to other similar medical image computing problems, which often suffer from the insufficient training samples when exploiting the deep CNN to represent high-level features.

ICRA Conference 2010 Conference Paper

Design and analysis of a soft mobile robot composed of multiple thermally activated joints driven by a single actuator

  • Nadia Cheng
  • Genya Ishigami
  • Stephan Hawthorne
  • Hao Chen
  • Malik Hansen
  • Maria J. Telleria
  • Robert Playter
  • Karl Iagnemma

Soft robotic systems have applications in industrial, medical, and security applications. Many applications require these robots to be small and lightweight. One challenge in developing a soft robotic system is to drive multiple degrees-of-freedom (DOF) with few actuators, thereby reducing system size and weight. This paper presents the analysis and design of an inchworm-like mobile robot that consists of multiple, independent thermally activated joints but is driven by a single actuator. To realize control of this under-actuated system, a solder-based locking mechanism has been developed to selectively activate individual joints without requiring additional actuators. The design and performance analysis of a prototype mobile robot that is capable of inchworm-like translational and steering motion is described. The design of novel “feet” with anisotropic friction properties is also described.

ICRA Conference 2001 Conference Paper

The Switched Reluctance Motor Drive for the Direct-Drive Joint of the Robot

  • Hao Chen
  • Dong Zhang

The paper presents the principle of decoupling control of the phase voltage in the switched reluctance motor drive for the direct-drive joint of the robot. The motor drive system elements, such as the structure of the three-phase 6/10 structure switched reluctance motor and the rotor position, the main circuit topology of the three-phase bifilar winding power converter and the pulse width modulation control strategy, are described. The mathematical models of the main circuit of the power converter are also presented. The optimum range of the turn-on and turn-off angles of the main switches in the power converter are given by the criterion of reducing the pulsation of the output torque with a 2D finite element electromagnetic field calculation of the motor and the nonlinear simulation of the main circuit of the power converter with the control strategy.