Author name cluster

Hao Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

146 papers

2 author rows

EAAI Journal 2026 Journal Article

A robust routing protocol for energy and coverage optimization in wireless sensor networks using game theory and heuristic algorithm

Hao Chen
Hua-Ping Wan
Hui-Bin Ge
Yaozhi Luo

Details DOI

AAAI Conference 2026 Conference Paper

Active Multi-source Domain Adaptation for Multimodal Fake News Detection

Yanping Chen
Weijie Shi
Mengze Li
Yue Cui
Jiaming Li
Ruiyuan Zhang
Hao Chen
Hanghui Guo

Multimodal fake news detection plays a crucial role in combating online misinformation. The inherent domain diversity of news in the real world has driven the development of cross-domain detection methods. However, these detection methods either suffer from significant performance degradation due to semantic and deception pattern shifts between the training (source) and test (target) domains or heavily rely on annotated labels. To address the problems, we propose ADOSE, an active multi-source domain adaptation framework for multimodal fake news detection which actively annotates a small subset of target samples to improve detection performance. Specifically, for domain shifts, we design a multi-expert classifier network based on refined features to comprehensively capture and adapt to the semantic space and deception patterns of news across different domains. To maximize adaptation performance with limited annotation cost, we propose a least-disagree uncertainty selector equipped with a diversity calculator for selecting the most informative samples. The selector leverages the uncertainty of inconsistent predictions before and after perturbations by multiple classifiers as an indicator of unfamiliar samples. It further incorporates diversity scores derived from multi-view features to ensure the chosen samples achieve maximal coverage of target domain features. The extensive experiments on multiple datasets show that ADOSE outperforms existing domain adaptation methods by 2.45% ~ 9.1%, indicating the superiority of our model.

PDF Details DOI

AAAI Conference 2026 Conference Paper

AIR-DR: Adaptive Image Retargeting with Instance Relocation and Dual-guidance Repainting

Zhitong Dong
Chao Li
Yongjian Deng
Hao Chen

Image retargeting aims to adjust the aspect ratio of images to accommodate various display devices. While existing methods consider both foreground semantics and background inpainting, their Seam-carving-based framework is inherently destructive, often compromising the structural integrity of foreground instances. Furthermore, conventional inpainting models struggle to achieve pixel-level accuracy with global-only guidance, leading to local inconsistencies and background distortions. To address these challenges, we reformulate image retargeting as a instance-level re-layout task. By Adaptive Instance Relocation and Dual-guidance Repainting (AIR-DR), our method preserves the structural integrity of the foreground and recovers the background with consistent details. Additionally, we introduce an adaptive retargeting decision that maintains robustness across challenging retargeting scenarios and any ratios. Extensive experiments on multiple public datasets across various aspect ratios demonstrate that our approach consistently outperforms existing methods in both objective metrics and subjective evaluations. Comprehensive ablation studies further validate the effectiveness of each component.

PDF Details DOI

AAAI Conference 2026 Conference Paper

An Invariant Latent Space Perspective on Language Model Inversion

Wentao Ye
Jiaqi Hu
Haobo Wang
Xinpeng Ti
Zhiqing Xiao
Hao Chen
Liyao Li
Lei Feng

Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM's own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ConSurv: Multimodal Continual Learning for Survival Analysis

Dianzhi Yu
Conghao Xiong
Yankai Chen
Wenqian Cui
Xinni Zhang
Yifei Zhang
Hao Chen
Joseph J. Y. Sung

Survival prediction of cancers is crucial for clinical practice, as it informs mortality risks and influences treatment plans. However, a static model trained on a single dataset fails to adapt to the dynamically evolving clinical environment and continuous data streams, limiting its practical utility. While continual learning (CL) offers a solution to learn dynamically from new datasets, existing CL methods primarily focus on unimodal inputs and suffer from severe catastrophic forgetting in survival prediction. In real-world scenarios, multimodal inputs often provide comprehensive and complementary information, such as whole slide images and genomics; and neglecting inter-modal correlations negatively impacts the performance. To address the two challenges of catastrophic forgetting and complex inter-modal interactions between gigapixel whole slide images and genomics, we propose ConSurv, the first multimodal continual learning (MMCL) method for survival analysis. ConSurv incorporates two key components: Multi-staged Mixture of Experts (MS-MoE) and Feature Constrained Replay (FCR). MS-MoE captures both task-shared and task-specific knowledge at different learning stages of the network, including two modality encoders and the modality fusion component, learning inter-modal relationships. FCR further enhances learned knowledge and mitigates forgetting by restricting feature deviation of previous data at different levels, including encoder-level features of two modalities and the fusion-level representations. Additionally, we introduce a new benchmark integrating four datasets, Multimodal Survival Analysis Incremental Learning (MSAIL), for comprehensive evaluation in the CL setting. Extensive experiments demonstrate that ConSurv outperforms competing methods across multiple metrics.

PDF Details DOI

EAAI Journal 2026 Journal Article

Corrigendum to “Exploring cross-branch information for semi-supervised remote sensing object detection” [Eng. Appl. Artif. Intel. 162 (2025) EAAI_112378]

Shitian He
Huanxin Zou
Yingqian Wang
Xu Cao
Hao Chen
Ning Jing

Details DOI

EAAI Journal 2026 Journal Article

Corrigendum to “MambaRSIS: Context-aware multi-scale feature aggregation with selective state space model for remote sensing instance segmentation” [Eng. Appl. Artif. Intel. 160 (2025) EAAI_111993]

Liyuan Pan
Xu Cao
Huanxin Zou
Hao Chen
Shitian He
Yuqing Zhang
Xuanming Liu
Jiangshan Li

Details DOI

AAAI Conference 2026 Conference Paper

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Ming Liu
Hao Chen
Jindong Wang
Liwen Wang
Jingchen Sun
Wensheng Zhang

Vision-Language Models (VLMs) have achieved success in tasks such as visual question answering, yet their resilience to distractions remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models and those specialized for reasoning—against distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. We evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with a noticeable degradation in reasoning when extraneous content is present. In particular, some models (including GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies such as prompt engineering. Although these strategies improve resilience modestly, our analysis highlights considerable room for further improvement in the robustness of VLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Knowledge-Enhanced Explainable Prompting for Vision-Language Models

Yequan Bie
Andong Tan
Zhixuan Chen
Zhiyuan Cai
Luyang Luo
Hao Chen

Large-scale vision-language models (VLMs) embedded with expansive representations and visual concepts have showcased significant potential in image and text understanding. Efficiently adapting VLMs such as CLIP to downstream tasks like few-shot image classification has garnered growing attention, with prompt learning emerging as a representative approach. However, most existing prompt-based adaptation methods, which rely solely on coarse-grained textual prompts, suffer from limited performance and interpretability when handling domain tasks that require specific knowledge. This results in a failure to satisfy the stringent trustworthiness requirements of Explainable Artificial Intelligence (XAI) in high-risk scenarios like healthcare. To address this issue, we propose a Knowledge-Enhanced Explainable Prompting (KEEP) framework that leverages fine-grained domain-specific knowledge to enhance the adaptation process of VLMs across various domains and image modalities. By incorporating retrieval augmented generation and domain foundation models, our framework can provide more reliable image-wise knowledge for prompt learning in various domains, alleviating the lack of fine-grained annotations, while offering both visual and textual explanations. Extensive experiments and explainability analyses conducted on eight datasets of different domains and image modalities demonstrate that our method simultaneously achieves superior performance and interpretability, highlighting the effectiveness of the collaboration between foundation models and XAI.

PDF Details DOI

TMLR Journal 2026 Journal Article

Learning from Online Videos at Inference Time for Computer-Use Agents

Yujian Liu
Ze Wang
Hao Chen
Ximeng Sun
Xiaodong Yu
Jialian Wu
Jiang Liu
Emad Barsoum

Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time.

PDF Details

AAAI Conference 2026 Conference Paper

LLM Collaborative Filtering: User-Item Graph as New Language

Huachi Zhou
Yujing Zhang
Hao Chen
Qinggang Zhang
Qijie Shen
Feiran Huang
Xiao Huang

In collaborative filtering, learning effective embeddings for users and items from interaction data remains a central challenge. While recent efforts leverage large language models (LLMs) to enhance collaborative filtering, two critical limitations persist: (1) Efficiency: LLM-based inference is significantly slower than traditional embedding-based search; and (2) Topological Modeling: LLMs struggle to capture graph structures, which are essential for modeling multi-order user-item interactions. To address these limitations, we propose New Language Collaborative Filtering (NLCF), a framework that aligns LLMs with collaborative filtering by conceptualizing user-item graphs as new languages. This approach is based on two key insights: (1) LLMs excel at mastering new languages when trained on suitable corpora, and (2) the empirical conditional probability between tokens in corpora converges to the transition probabilities between nodes in graphs. NLCF translates user-item graphs into corpora, where users and items are treated as tokens. These corpora are used to fine-tune LLMs, and the learned representations are aggregated to construct user and item embeddings that encode multi-order interactions. Unlike methods that deploy LLMs for inference, NLCF distills LLM knowledge learned from corpora into compact embeddings, enabling both efficient training and real-time inference. The framework has been deployed on a billion-scale e-commerce platform for several months. Extensive experiments demonstrate that NLCF outperforms traditional graph CF models and LLM-based baselines while achieving significant training and inference efficiency improvement over LLM-based baselines.

PDF Details DOI

EAAI Journal 2026 Journal Article

Multi-label financial statement fraud detection based on long short-term memory and multilayer perceptron hybrid model

Zhensong Chen
Hao Chen
Yanxin Liu
Yong Shi

Details DOI

AAAI Conference 2026 Conference Paper

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks

Kaijun Wang
Liqin Lu
Mingyu Liu
Jianuo Jiang
Zeju Li
Bolin Zhang
Wancai Zheng
Xinyi Yu

Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have shown promise in enhancing spatial reasoning and task planning through learned semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges characteristic of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied in the literature. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination of locomotion and manipulation across challenging terrains. We further present the first comprehensive benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system’s generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate–Distortion Awareness

Wuyang Cong
Junqi Shi
Lizhong Wang
Weijing Shi
Ming Lu
Hao Chen
Zhan Ma

Neural video compression (NVC) has demonstrated superior compression efficiency, yet effective rate control remains a significant challenge due to complex temporal dependencies. Existing rate control schemes typically leverage frame content to capture distortion interactions, overlooking inter-frame rate dependencies arising from shifts in per-frame coding parameters. This often leads to suboptimal bitrate allocation and cascading parameter decisions. To address this, we propose a reinforcement‑learning (RL)‑based rate control framework that formulates the task as a frame‑by‑frame sequential decision process. At each frame, an RL agent observes a spatiotemporal state and selects coding parameters to optimize a long‑term reward that reflects rate‑distortion (R-D) performance and bitrate adherence. Unlike prior methods, our approach jointly determines bitrate allocation and coding configuration in a single step, independent of group‑of‑pictures (GOP) structure. Extensive experiments across diverse NVC architectures show that our method reduces the average relative bitrate error to 1.20 percent and achieves up to 13.45 percent bitrate savings at typical GOP sizes, outperforming existing approaches. In addition, our framework demonstrates improved robustness to content variation and bandwidth fluctuations with lower encoding/decoding overhead, making it highly suitable for practical deployment.

PDF Details DOI

JBHI Journal 2026 Journal Article

SegTom: A 3D Volumetric Medical Image Segmentation Framework for Thoracoabdominal Multi-Organ Anatomical Structures

Yan Pang
Yunhao Li
Jiaming Liang
Hao Chen
Ying Hu
Qiong Wang

Accurate segmentation of thoracoabdominal anatomical structures in three-dimensional medical imaging modalities is fundamental for informed clinical decision-making across a wide array of medical disciplines. Current approaches often struggle to efficiently and comprehensively process this region’s intricate and heterogeneous anatomical information, leading to suboptimal outcomes in diagnosis, treatment planning, and disease management. To address this challenge, we introduce SegTom, a novel volumetric segmentation framework equipped with a cutting-edge SegTom Block specifically engineered to effectively capture the complex anatomical representations inherent to the thoracoabdominal region. This SegTom Block incorporates a hierarchical anatomical-representation decomposition to facilitate efficient information exchange by decomposing the computationally intensive self-attention mechanism and cost-effectively aggregating the extracted representations. Rigorous validation of SegTom across nine diverse datasets, encompassing both computed tomography (CT) and magnetic resonance imaging (MRI) modalities, consistently demonstrates high performance across a broad spectrum of anatomical structures. Specifically, SegTom achieves a mean Dice similarity coefficient (DSC) of 87. 29% for cardiac segmentation on the MM-WHS MRI dataset, 83. 48% for multi-organ segmentation on the BTCV abdominal CT dataset, and 92. 01% for airway segmentation on a dedicated CT dataset.

Details DOI

EAAI Journal 2026 Journal Article

Skin lesion segmentation network based on state space modeling and convolutional perception

Hao Chen
Weiping Ding
Zhe Wang
Haifei Zhang

Details DOI

JBHI Journal 2026 Journal Article

SPSID: A single-parameter shrinkage inverse-diffusion for denoising gene-regulatory networks

Hao Chen
Ge Han
Wenze Ding
Clara Grazian

Inferring gene regulatory networks (GRNs) from expression data is a fundamental problem in systems biology, but its accuracy is often undermined by structural noise arising from transitive correlations. These indirect interactions can obscure the true regulatory architecture, leading to a high rate of false positives. To address this, we introduce SPSID (Single-Parameter Shrinkage Inverse-Diffusion), a novel and robust network denoising framework. SPSID is a deterministic post-processing operator applied to an inferred GRN score matrix, rather than a generative diffusion model for gene expression. SPSID employs a principled spectral filter, built upon a shrinkage-regularized inverse-diffusion operator, to mathematically distinguish direct, one-step interactions from multi-step, indirect paths. This approach guarantees numerical stability and, through a fixed default parameter, effectively eliminates the need for data-dependent tuning. We conducted a comprehensive evaluation of SPSID on both extensive simulations and the gold-standard DREAM5 benchmark. The results demonstrate that SPSID outperforms state-of-the-art baseline methods in both AUROC and AUPR, exhibiting good stability across diverse network conditions. Furthermore, it functions as a post-processing tool, elevating the performance of multiple upstream GRN inference methods. By providing a computationally efficient and parameter-free solution to filter structural noise, SPSID offers a readily applicable tool for uncovering the underlying topology of complex biological networks with greater fidelity.

Details DOI

AAAI Conference 2026 Conference Paper

You Don’t Need Pre-Built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures

Shengyuan Chen
Chuang Zhou
Zheng Yuan
Qinggang Zhang
Zeyang Cui
Hao Chen
Yilin Xiao
Jiannong Cao

Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a Logic-aware Retrieval Augmented Generation framework (LogicRAG) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.

PDF Details DOI

AAAI Conference 2025 Conference Paper

A Denoising Pre-training Framework for Accelerating Novel Material Discovery

Shuaike Shen
Ke Liu
Muzhi Zhu
Hao Chen

Crystal materials play an important role in the development of society. The discovery of new materials is critical to achieving sustainable development goals (SDGs), such as climate change mitigation, affordable and clean energy, and fostering innovation in industry and infrastructure. Recent advances in deep learning for crystal property prediction have accelerated material discovery, but these methods typically rely on labeled data, which is often limited and varies across different properties. This limitation hinders the full utilization of the vast amount of unlabeled data in materials science. To overcome this challenge, we introduce an unsupervised Denoising Pre-training Framework (DPF) tailored for crystal structures. DPF trains a model to reconstruct the original crystal structure by recovering the masked atom types, perturbed atom positions, and perturbed crystal lattices. Through pre-training, models learn the intrinsic features of crystal structures and capture the key features influencing crystal properties. We pre-train models on a dataset of 380,743 unlabeled crystal structures and fine-tune them on downstream property prediction tasks. Extensive experiments demonstrate the effectiveness of our framework, showing its potential to significantly advance material science and contribute to the development of society by accelerating the discovery of materials crucial for sustainable technologies.

PDF Details DOI

EAAI Journal 2025 Journal Article

A sample average approximation-based approach for the last mile delivery and pickup problem with load-dependent travel time under uncertainties

Hongyuan Luo
Deyun Wang
Yanhui Li
Xinyuan Lu
Mingyun Gao
Hao Chen

Details DOI

IJCAI Conference 2025 Conference Paper

A Survey of Pathology Foundation Model: Progress and Future Directions

Conghao Xiong
Hao Chen
Joseph J. Y. Sung

Computational pathology, which involves analyzing whole slide images for automated cancer diagnosis, relies on multiple instance learning, where performance depends heavily on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced both the extractor and aggregator, but they lack a systematic analysis framework. In this survey, we present a hierarchical taxonomy organizing PFMs through a top-down philosophy applicable to foundation model analysis in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at https: //github. com/BearCleverProud/AwesomeWSI.

PDF Details DOI

ICRA Conference 2025 Conference Paper

Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Hao Chen
Jian Chen 0036
Xinran Liu
Zihui Zhang
Yuanrui Huang
Zhongkai Zhang 0001
Hongbin Liu 0001

Continuum robots offer high flexibility and multiple degrees of freedom, making them ideal for navigating narrow lumens. However, accurately modeling their behavior under large deformations and frequent environmental contacts remains challenging. Current methods for solving the deformation of these robots, such as the Model Order Reduction and Gauss-Seidel (GS) methods, suffer from significant drawbacks. They experience reduced computational speed as the number of contact points increases and struggle to balance speed with model accuracy. To overcome these limitations, we introduce a novel finite element method (FEM) named Acc-FEM. Acc-FEM employs a large deformation quasi-static finite element model and integrates an accelerated solver scheme to handle multi-contact simulations efficiently. Additionally, it utilizes parallel computing with Graphics Processing Units (GPU) for real-time updates of the finite element models and collision detection. Extensive numerical experiments demonstrate that Acc-Fem significantly improves computational efficiency in modeling continuum robots with multiple contacts while achieving satisfactory accuracy, addressing the deficiencies of existing methods.

Details

TIST Journal 2025 Journal Article

Adaptive Intention Learning for Session-Based Recommendation

Qingbo Zhang
Xiaochun Yang
Hao Chen
Bin Wang
Zhu Sun
Xiangmin Zhou

In recent years, session-based recommender systems (SRSs) have emerged as a significant research focus within the recommendation field. Capturing user intentions to infer user interest accordingly has proven to be effective in enhancing the accuracy of SRSs. However, existing techniques assume that all sessions have the same number of intentions or that the items in one category belonging to the same session reflect the same intention. In real applications, such as e-commerce, sessions may have different numbers of intentions, and the same type of items in a session may correspond to different intentions. As a result, existing techniques cannot guarantee high-quality user interest prediction. In this article, we propose a novel Adaptive Intention Learning Network (AILN) to capture an adaptive number of intentions for each session, thereby enhancing the accuracy of user interest inference. Specifically, we design an intention evaluation network (IEN) to evaluate whether a subsequence of a session corresponds to a valid intention, and an intention generation network (IGN) to learn the representation of a valid intention. By checking each subsequence of a session, IEN and IGN enable the incremental learning of a session-specific intention hierarchy (IH) to store valid intentions of the session. To reduce the cost of building the IH, we propose a pruning strategy that exploits the intention validity to avoid unnecessary evaluation. The representative intentions are selected from IH and input into a designed interest predictor to infer the user interest. Experimental results on two real-world datasets demonstrate the superiority of our proposed AILN.

Details DOI

EAAI Journal 2025 Journal Article

Attention based network for real-time road drivable area, lane line detection and scene identification

Feng You
Yi Xie
Siyi Zhang
Hao Chen
Haiwei Wang
Wei Zhang
Jianrong Liu

Details DOI

ICML Conference 2025 Conference Paper

Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

Shiwei Li 0002
Xiandi Luo
Xing Tang 0007
Haozhao Wang
Hao Chen
Weihong Luo
Yuhua Li 0003
Xiuqiang He 0001

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA’s fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA’s robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https: //github. com/Leopold1423/non_zero_lora-icml25.

Details

IROS Conference 2025 Conference Paper

Controllable Traffic Simulation through LLM-Guided Hierarchical Reasoning and Refinement

Zhiyuan Liu
Leheng Li
Yuning Wang
Haotian Lin 0006
Hao Chen
Zhizhe Liu
Lei He
Jianqiang Wang 0003

Evaluating autonomous driving systems in complex and diverse traffic scenarios through controllable simulation is essential to ensure their safety and reliability. However, existing traffic simulation methods face challenges in their controllability. To address this, we propose a novel diffusion-based and LLM-enhanced traffic simulation framework. Our approach incorporates a high-level understanding module and a low-level refinement module, which systematically examines the hierarchical structure of traffic elements, guides LLMs to thoroughly analyze traffic scenario descriptions step by step, and refines the generation by self-reflection, enhancing their understanding of complex situations. Furthermore, we propose a Frenet-frame-based cost function framework that provides LLMs with geometrically meaningful quantities, improving their grasp of spatial relationships in a scenario and enabling more accurate cost function generation. Experiments on the Waymo Open Motion Dataset (WOMD) demonstrate that our method can handle more intricate descriptions and generate a broader range of scenarios in a controllable manner.

Details

IJCAI Conference 2025 Conference Paper

DGCPL: Dual Graph Distillation for Concept Prerequisite Relation Learning

Miao Zhang
Jiawei Wang
Jinying Han
Kui Xiao
Zhifei Li
Yan Zhang
Hao Chen
Shihui Wang

Concept prerequisite relations determine the learning order of knowledge concepts in one domain, which has an important impact on teachers' course design and students' personalized learning. Current research usually predicts concept prerequisite relations from the perspective of knowledge, and rarely pays attention to the role of learners' learning behavior. We propose a Dual Graph Distillation Method for Concept Prerequisite Relation Learning (DGCPL). Specifically, DGCPL constructs a dual graph structure from both the knowledge and learning behavior perspectives, and captures the high-order knowledge features and learning behavior features through the concept-resource hypergraph and the learning behavior graph respectively. In addition, we introduce a gated knowledge distillation to fuse the structural information of concept nodes in the two graphs, so as to obtain a more comprehensive concept embedding representation and achieve accurate prediction of prerequisite relations. On three public benchmark datasets, we compare DGCPL with eight graph-based baseline methods and five traditional classification baseline methods. The experimental results show that DGCPL achieves state-of-the-art performance in learning concept prerequisite relations. Our code is available at https: //github. com/wisejw/DGCPL.

PDF Details DOI

ECAI Conference 2025 Conference Paper

DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

HongYu Liu
Junxin Li
Changxi Guo
Hao Chen
Yaqian Huang
Yifu Guo
Huan Yang
Lihua Cai

Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e. g. , Qwen2. 5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2. 0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https: //github. com/david188888/DialogGraph-LLM

Details

NeurIPS Conference 2025 Conference Paper

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Canyu Zhao
Yanlong Sun
Mingyu Liu
Huanyi Zheng
Muzhi Zhu
Zhiyue Zhao
Hao Chen
Tong He

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0. 06% of their data (e. g. , 600K vs. \ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.

PDF Details

AAAI Conference 2025 Conference Paper

DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation

Xiankang He
Guangkai Xu
Bo Zhang
Hao Chen
Ying Cui
Dongyan Guo

Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrates that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.

PDF Details DOI

JBHI Journal 2025 Journal Article

Efficient Breast Lesion Segmentation From Ultrasound Videos Across Multiple Source-Limited Platforms

Yan Pang
Yunhao Li
Teng Huang
Jiaming Liang
Ziyu Ding
Hao Chen
Baoliang Zhao
Ying Hu

Medical video segmentation is fundamentally important in clinical diagnosis and treatment procedures, offering dynamic tracking of breast lesions across frames in ultrasound videos for improved segmentation performance. However, existing approaches face challenges in striking a balance between segmentation performance and inference speed, hindering real-time application in resource-constrained medical environments. In order to address these limitations, we present BaS, a blazing-fast on-device breast lesion segmentation model. BaS integrates the Stem module and BaSBlock to refine representations through inter- and intra-frame analysis on ultrasound videos. In addition, we release two versions of BaS: the BaS-S for superior segmentation performance and the BaS-L for accelerated inference times. Experimental Results indicate that BaS surpasses the top-performing models in terms of segmenting efficiency and accuracy of predictions on devices with limited resources. This work advances the development of efficient medical video segmentation frameworks applicable to multiple medical platforms.

Details DOI

NeurIPS Conference 2025 Conference Paper

Enforcing Hard Linear Constraints in Deep Learning Models with Decision Rules

Gonzalo E. Constante
Hao Chen
Can Li

Deep learning models are increasingly deployed in safety-critical tasks where predictions must satisfy hard constraints, such as physical laws, fairness requirements, or safety limits. However, standard architectures lack built-in mechanisms to enforce such constraints, and existing approaches based on regularization or projection are often limited to simple constraints, computationally expensive, or lack feasibility guarantees. This paper proposes a model-agnostic framework for enforcing input-dependent linear equality and inequality constraints on neural network outputs. The architecture combines a task network trained for prediction accuracy with a safe network trained using decision rules from the stochastic and robust optimization literature to ensure feasibility across the entire input space. The final prediction is a convex combination of the two subnetworks, guaranteeing constraint satisfaction during both training and inference without iterative procedures or runtime optimization. We prove that the architecture is a universal approximator of constrained functions and derive computationally tractable formulations based on linear decision rules. Empirical results on benchmark regression tasks show that our method consistently satisfies constraints while maintaining competitive accuracy and low inference latency.

PDF Details

EAAI Journal 2025 Journal Article

Enhancing crop disease recognition via prompt learning-based progressive Mixup and Contrastive Language-Image Pre-training dynamic calibration

Hao Chen
Haidong Li
Jinling Zhao
Chao Ruan
Linsheng Huang

Details DOI

NeurIPS Conference 2025 Conference Paper

EPA: Boosting Event-based Video Frame Interpolation with Perceptually Aligned Learning

Yuhan Liu
LingHui Fu
Zhen Yang
Hao Chen
Youfu Li
Yongjian Deng

Event cameras, with their capacity to provide high temporal resolution information between frames, are increasingly utilized for video frame interpolation (VFI) in challenging scenarios characterized by high-speed motion and significant occlusion. However, prevalent issues of blur and distortion within the keyframes and ground truth data used for training and inference in these demanding conditions are frequently overlooked. This oversight impedes the perceptual realism and multi-scene generalization capabilities of existing event-based VFI (E-VFI) methods when generating interpolated frames. Motivated by the observation that semantic-perceptual discrepancies between degraded and pristine images are considerably smaller than their image-level differences, we introduce EPA. This novel E-VFI framework diverges from approaches reliant on direct image-level supervision by constructing multilevel, degradation-insensitive semantic perceptual supervisory signals to enhance the perceptual realism and multi-scene generalization of the model's predictions. Specifically, EPA operates in two phases: it first employs a DINO-based perceptual extractor, a customized style adapter, and a reconstruction generator to derive multi-layered, degradation-insensitive semantic-perceptual features ($\mathcal{S}$). Second, a novel Bidirectional Event-Guided Alignment (BEGA) module utilizes deformable convolutions to align perceptual features from keyframes to ground truth with inter-frame temporal guidance extracted from event signals. By decoupling the learning process from direct image-level supervision, EPA enhances model robustness against degraded keyframes and unreliable ground truth information. Extensive experiments demonstrate that this approach yields interpolated frames more consistent with human perceptual preferences. *The code will be released upon acceptance. *

PDF Details

AAAI Conference 2025 Conference Paper

ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance

Yucheng Zhao
Gengyu Lyu
Ke Li
Zihao Wang
Hao Chen
Zhen Yang
Yongjian Deng

Event-based semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Evaluating Program Semantics Reasoning with Type Inference in System $F$

Yifeng He
Luning Yang
Christopher Gonzalo
Hao Chen

Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute reasoning capabilities promise significant potential in understanding program logic and semantics beyond mere token recognition. However, current benchmarks evaluating reasoning LLMs for code lack a formal, program-centric deductive framework for the soundness of evaluation, incompetent in assessing of whether models genuinely reason about program semantics or merely associate superficial connections between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as *program semantics reasoning*. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3. 7-sonnet) achieving only $55. 85\%$ accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess the robustness and effectiveness of extended reasoning, underscoring the critical limitation in current LLM capabilities and highlighting essential directions for future research.

PDF Details

EAAI Journal 2025 Journal Article

Event-based video interpolation via complementary motion information

Yuhan Liu
LingHui Fu
Hao Chen
Zhen Yang
Youfu Li
Yongjian Deng

Details DOI

EAAI Journal 2025 Journal Article

Exploring cross-branch information for semi-supervised remote sensing object detection

Shitian He
Huanxin Zou
Yingqian Wang
Xu Cao
Hao Chen
Ning Jing

Details DOI

NeurIPS Conference 2025 Conference Paper

Fast-in-Slow: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning

Hao Chen
Jiaming Liu
Chenyang Gu
Zhuoyang Liu
Renrui Zhang
Xiaoqi Li
Xiao He
Yandong Guo

Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches have been proposed to leverage a VLM-based System 2 module for handling high-level decision-making, and a separate System 1 action module for ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1, but also facilitates coordination between multimodal reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2’s contextual understanding to provide stable latent conditions for System 1. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117. 7 Hz control frequency with action chunk set to eight. Project web page: https: //fast-in-slow. github. io.

PDF Details

NeurIPS Conference 2025 Conference Paper

From Pretraining to Pathology: How Noise Leads to Catastrophic Inheritance in Medical Models

Hao Sun
Zhongyi Han
Hao Chen
Jindong Wang
Xin Gao
Yilong Yin

Foundation models pretrained on web-scale data drive contemporary transfer learning in vision, language, and multimodal tasks. Recent work shows that mild label noise in these corpora may lift in-distribution accuracy yet sharply reduce out-of-distribution generalization, an effect known as catastrophic inheritance. Medical data is especially sensitive because annotations are scarce, domain shifts are large, and pretraining sources are noisy. We present the first systematic analysis of catastrophic inheritance in medical models. Controlled label-corruption experiments expose a clear structural collapse: as noise rises, the skewness and kurtosis of feature and logit distributions decline, signaling a flattened representation space and diminished discriminative detail. These higher-order statistics form a compact, interpretable marker of degradation in fine-grained tasks such as histopathology. Guided by this finding, we introduce a fine-tuning objective that restores skewness and kurtosis through two scalar regularizers added to the task loss. The method leaves the backbone unchanged and incurs negligible overhead. Tests on PLIP models trained with Twitter pathology images, as well as other large-scale vision and language backbones, show consistent gains in robustness and cross-domain accuracy under varied noise levels.

PDF Details

NeurIPS Conference 2025 Conference Paper

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Tianhao Chen
Xin Xu
Zijing Liu
Pengxiang Li
Xinyuan Song
AJAY JAISWAL
Fan Zhang
Jishan Hu

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https: //github. com/dandingsky/GPAS.

PDF Details

EAAI Journal 2025 Journal Article

Integrated spectrogram construction method on multi-channel signals for loose particle localization

Zhigang Sun
Guofu Zhai
Min Zhang
Guotao Wang
Qi Liang
Hao Chen

Details DOI

AAAI Conference 2025 Conference Paper

Know Where You Are From: Event-Based Segmentation via Spatio-Temporal Propagation

Ke Li
Gengyu Lyu
Hao Chen
Bochen Xie
Zhen Yang
Youfu Li
Yongjian Deng

Event cameras have gained attention in segmentation due to their higher temporal resolution and dynamic range compared to traditional cameras. However, they struggle with issues like lack of color perception and triggering only at motion edges, making it hard to distinguish objects with similar contours or segment spatially continuous objects. Our work aims to address these often overlooked issues. Based on the assumption that various objects exhibit different motion patterns, we believe that embedding the historical motion states of objects into segmented scenes can effectively address these challenges. Inspired by this, we propose the ESS framework ``Know Where You Are From" (KWYAF), which incorporates past motion cues through spatio-temporal propagation embedding. This framework features two core components: the Sequential Motion Encoding Module (SME) and the Event-Based Reliable Region Selection Mechanism (ER²SM). SMEs construct prior motion features through spatio-temporal correlation modeling for boosting final segmentation, while ER²SM adapts to identify high-confidence regions, embedding motion more precisely through local window masks and reliable region selection. A large number of experiments have demonstrated the effectiveness of our proposed framework in terms of both quantity and quality.

PDF Details DOI

ICLR Conference 2025 Conference Paper

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

Hanyu Wang
Saksham Suri
Yixuan Ren
Hao Chen
Abhinav Shrivastava

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARPs strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs). Project page: https://hywang66.github.io/larp/

Details

ICRA Conference 2025 Conference Paper

Learn to Swim: Data-Driven LSTM Hydrodynamic Model for Quadruped Robot Gait Optimization

Fei Han
Pengming Guo
Hao Chen
Weikun Li
Jingbo Ren
Naijun Liu
Ning Yang
Dixia Fan

This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FEDLSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional Empirical Formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straightline and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model's precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.

Details

AAAI Conference 2025 Conference Paper

Learning Concept Prerequisite Relation via Global Knowledge Relation Optimization

Miao Zhang
Jiawei Wang
Kui Xiao
Shihui Wang
Yan Zhang
Hao Chen
Zhifei Li

Learning concept prerequisite relations helps better master and build a logically coherent knowledge structure. Many studies use graph neural networks to create heterogeneous knowledge networks that enhance concept representations. However, different types of relations in these networks can influence each other. Existing research often focuses solely on concept relations, neglecting other types of knowledge connections. To address this issue, this paper proposes a novel concept prerequisite relation learning model, named the Global Knowledge Relation Optimization Model(GKROM). Specifically, we capture the impact of different knowledge relation types on document and concept semantic representations separately, integrating the document and concept semantic representations. Then, we introduce multi-objective learning to optimize the knowledge relation network from a global perspective. Through the above optimization, GKROM learns richer semantic representations for concepts and documents, improving the accuracy of concept prerequisite relation learning. Extensive experiments on public datasets demonstrate the effectiveness of our GKROM, achieving state-of-the-art performance in concept prerequisite relation learning.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Donghao Zhou
Jiancheng Huang
Jinbin Bai
Jiaze Wang
Hao Chen
Guangyong Chen
Xiaowei Hu
Pheng-Ann Heng

Text-to-image diffusion models can generate high-quality images but lack fine-grained control of visual concepts, limiting their creativity. Thus, we introduce component-controllable personalization, a new task that enables users to customize and reconfigure individual components within concepts. This task faces two challenges: semantic pollution, where undesired elements disrupt the target concept, and semantic imbalance, which causes disproportionate learning of the target concept and component. To address these, we design MagicTailor, a framework that uses Dynamic Masked Degradation to adaptively perturb unwanted visual semantics and Dual-Stream Balancing for more balanced learning of desired visual semantics. The experimental results show that MagicTailor achieves superior performance in this task and enables more personalized and creative image generation.

PDF Details DOI

EAAI Journal 2025 Journal Article

MambaRSIS: Context-aware multi-scale feature aggregation with selective state space model for remote sensing instance segmentation

Liyuan Pan
Xu Cao
Huanxin Zou
Hao Chen
Shitian He
Yuqing Zhang
Xuanming Liu
Jiangshan Li

Details DOI

YNICL Journal 2025 Journal Article

Mechanisms underlying the spontaneous reorganization of depression network after stroke

Yirong Fang
Xian Chao
Zeyu Lu
Hongmei Huang
Ran Shi
Dawei Yin
Hao Chen
Yanan Lu

Details DOI

AAAI Conference 2025 Conference Paper

MM-Tracker: Motion Mamba for UAV-platform Multiple Object Tracking

Mufeng Yao
Jinlong Peng
Qingdong He
Bo Peng
Hao Chen
Mingmin Chi
Chao Liu
Jon Atli Benediktsson

Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces both local object motion and global camera motion. Motion blur also increases the difficulty of detecting large moving objects. Previous UAV motion modeling approaches either focus only on local motion or ignore motion blurring effects, thus limiting their tracking performance and speed. To address these issues, we propose the Motion Mamba Module, which explores both local and global motion features through cross-correlation and bi-directional Mamba Modules for better motion modeling. To address the detection difficulties caused by motion blur, we also design motion margin loss to effectively improve the detection accuracy of motion blurred objects. Based on the Motion Mamba module and motion margin loss, our proposed MM-Tracker surpasses the state-of-the-art in two widely open-source UAV-MOT datasets.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong
Muzhi Zhu
Zongze Du
Zheng Huang
Canyu Zhao
Mingyu Liu
Wen Wang
Hao Chen

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because "optimal" keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

PDF Details

NeurIPS Conference 2025 Conference Paper

On Fairness of Unified Multimodal Large Language Model for Image Generation

Ming Liu
Hao Chen
Jindong Wang
Liwen Wang
Bhiksha Raj
Wensheng Zhang

Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in end-to-end visual understanding and generation tasks. However, compared to generation-only systems (e. g. , Stable Diffusion), the unified architecture of U-MLLMs introduces new risks of propagating demographic stereotypes. In this paper, we benchmark several state-of-the-art U-MLLMs and show that they exhibit significant gender and race biases in the generated outputs. To diagnose the source of these biases, we propose a locate-then-fix framework: we first audit the vision and language components — using techniques such as linear probing and controlled generation — and find that the language model appears to be a primary origin of the observed generative bias. Moreover, we observe a ``partial alignment'' phenomenon, where the U-MLLMs exhibit less bias in understanding tasks yet produce substantially biased images. To address this, we introduce a novel \emph{balanced preference loss} that enforces uniform generation probabilities across demographics by leveraging a synthetically balanced dataset. Extensive experiments show that our approach significantly reduces demographic bias while preserving semantic fidelity and image quality. Our findings underscore the need for targeted debiasing strategies in unified multimodal systems and introduce a practical approach to mitigate biases.

PDF Details

JBHI Journal 2025 Journal Article

Online Self-Distillation and Self-Modeling for 3D Brain Tumor Segmentation

Yan Pang
Yunhao Li
Teng Huang
Jiaming Liang
Zhen Wang
Changyu Dong
Dongyang Kuang
Ying Hu

In the specialized domain of brain tumor segmentation, supervised segmentation approaches are hindered by the limited availability of high-quality labeled data, a condition arising from data privacy concerns, significant costs, and ethical issues. In response to this challenge, this paper presents a training framework that adeptly integrates a plug-and-play component, MOD, into current supervised learning models, boosting their efficacy in scenarios with limited data. The MOD consists of an Online Tokenizer and a Dense Predictor, which employs self-distillation and self-modeling on masked patches, promoting swift convergence and efficient representation learning. During the inference phase, the plug-and-play MOD component is excluded, preserving the computational efficiency of the original model without incurring extra processing costs. We substantiated the value of our approach through experiments on leading 3D brain tumor segmentation baselines. Remarkably, models augmented with the MOD consistently showcased superior results, achieving elevated Dice coefficients and HD95 scores on two datasets: BraTS 2021 and MSD 2019 Task-01 Brain Tumor.

Details DOI

NeurIPS Conference 2025 Conference Paper

ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation

Pengcheng Huang
Zhenghao Liu
Yukun Yan
Haiyan Zhao
Xiaoyuan Yi
Hao Chen
Zhiyuan Liu
Maosong Sun

Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All codes are available at https: //github. com/OpenBMB/ParamMute.

PDF Details

ICLR Conference 2025 Conference Paper

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

Xinze Li
Sen Mei
Zhenghao Liu 0001
Yukun Yan
Shuo Wang 0013
Shi Yu 0001
Zheni Zeng
Hao Chen

Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for the RAG systems, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to handle diverse RAG tasks using different instructions. However, it trains RAG modules to overfit training signals and overlooks the varying data preferences among agents within the RAG system. In this paper, we propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG systems by aligning data preferences between different RAG modules. DDR works by collecting the rewards to optimize each agent in the RAG system with the rollout method, which prompts agents to sample some potential responses as perturbations, evaluates the impact of these perturbations on the whole RAG system, and subsequently optimizes the agent to produce outputs that improve the performance of the RAG system. Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes the generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. All codes are available at https://github.com/OpenMatch/RAG-DDR.

Details

ICML Conference 2025 Conference Paper

Reinforced Lifelong Editing for Language Models

Zherui Li 0001
Houcheng Jiang
Hao Chen
Baolong Bi
Zhenhong Zhou
Fei Sun 0001
Junfeng Fang
Xiang Wang 0010

Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59. 24% improvement while requiring only 2. 11% of the time compared to most approaches.

Details

NeurIPS Conference 2025 Conference Paper

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

Hao Chen
Guanxi Lu
Yasuyuki Okoshi
Zhiwen Mo
Masato Motomura
Hongxiang Fan

Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity—that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter $g$. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting $g$ can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3. 1\% over Beam Search and 3. 6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.

PDF Details

JBHI Journal 2025 Journal Article

Revisiting Drug Recommendation From a Causal Perspective

Junjie Zhang
Xuan Zang
Hao Chen
Xiaowei Yan
Buzhou Tang

Drug recommendation that aims to provide a prescription for a patient is an essential task in healthcare. Drug molecular graphs provide valuable support for drug recommendation. Existing methods tend to overlook drugs' molecular graphs or use the core substructures of molecular graphs with a rule-based segmentation strategy. However, such methods have several limitations: (1) The rule-based segmentation strategy is inflexible and sub-optimal for extremely complex scenarios. (2) The core substructures derived only consider the drug's chemical characteristics and ignore the patient's health condition. (3) The spurious correlation brought by trivial substructures is disregarded. To address these limitations, we design a novel drug recommendation method from a causal perspective, where a conditional causal representation learner for drug recommendation is proposed. Specifically, we first separate the drug molecular representation into causal and spurious parts depending on various patients' health conditions. Then, we eliminate the spurious correlation caused by the spurious part with causal intervention. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that our approach achieves new state-of-the-art performance (e. g. , 6. 68% Jaccard improvements on MIMIC-III with p-value $\ll$ 0. 05).

Details DOI

NeurIPS Conference 2025 Conference Paper

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Wenhao Tang
Rong Qin
Heng Fang
Fengtao Zhou
Hao Chen
Xiang Li
Ming-Ming Cheng

Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. ABMILX mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient ($<$ 10 RTX3090 GPU hours). We demonstrate the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https: //github. com/DearCaat/E2E-WSI-ABMILX.

PDF Details

IROS Conference 2025 Conference Paper

Robotic In Situ Measurement of Multiple Intracellular Physical Parameters Based on Three-micropipettes System

Mengya Liu
Jinyu Qiu
Shaojie Fu
Ruimin Li
Yuzhu Liu
Hao Chen
Xin Zhao 0010
Qili Zhao

Physical parameters of the intracellular environment such as mass density, intracellular pressure and elasticity have significant effects on the physiological activities of the cell and intracellular operation results. However, the significantly different measurement principles of the above parameters make it a challenging task for in situ measurement of them for the same cell, which significantly limits the study of their comprehensive regulation mechanisms to cell physiological activities and intracellular operation results. For the first time, a robotic in situ measurement system of multiple intracellular physical parameters is proposed based on a self-developed three-micropipettes system in this paper. Using this system, the mass density, elasticity and intracellular pressure of the same cell are measured automatically in sequence, according to a robotic in situ measurement process. Experimental results on sheep oocytes demonstrate an 83. 3% measurement success rate at an average speed of 97. 75 s/cell. The measurement results of the above three parameters are close to the reported results of individual, while with a significantly shorter operation time than theirs combined in references. Our system lays a solid foundation for the future research on the comprehensive regulation mechanism of these parameters to cell physiological activities and intracellular operation results.

Details

NeurIPS Conference 2025 Conference Paper

Role-aware Multi-agent Reinforcement Learning for Coordinated Emergency Traffic Control

Ming Cheng
Hao Chen
Zhiqing Li
Jia Wang
Senzhang Wang

Emergency traffic control presents an increasingly critical challenge, requiring seamless coordination among emergency vehicles, regular vehicles, and traffic lights to ensure efficient passage for all vehicles. Existing models primarily only focus on traffic light control, leaving emergency and regular vehicles prone to delay due to the lack of navigation strategies. To address this issue, we propose the R ole-aware M ulti-agent T raffic C ontrol (RMTC) framework, which dynamically assigns appropriate roles to traffic components for better cooperation by considering their relations with emergency vehicles and adaptively adjusting their policies. Specifically, RMTC introduces a Heterogeneous Temporal Traffic Graph (HTTG) to model the spatial and temporal relationships among all traffic components (traffic lights, regular and emergency vehicles) at each time step. Furthermore, we develop a Dynamic Role Learning model to infer the evolving roles of traffic lights and regular vehicles based on HTTG. Finally, we present a Role-aware Multi-agent Reinforcement Learning approach that learns traffic policies conditioned on the dynamically roles. Extensive experiments across four public traffic scenarios show that RMTC outperforms existing traffic light control methods by significantly reducing emergency vehicle travel time, while effectively preserving traffic efficiency for regular vehicles. The code is released at https: //github. com/mingchenghexi/RMTC.

PDF Details

ICML Conference 2025 Conference Paper

SDP-CROWN: Efficient Bound Propagation for Neural Network Verification with Tightness of Semidefinite Programming

Hong-Ming Chiu
Hao Chen
Huan Zhang
Richard Y. Zhang 0001

Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture inter-neuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound—derived via SDP principles—that explicitly captures $\ell_{2}$-norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of $\sqrt{n}$ tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art $\alpha$-CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2. 47 million parameters, achieving tightness that approaches that of costly SDP-based methods.

Details

IJCAI Conference 2025 Conference Paper

Seeing the Unseen: Composing Outliers for Compositional Zero-Shot Learning

Chenchen Jing
Mingyu Liu
Hao Chen
Yuling Xi
Xingyuan Bu
Dong Gong
Chunhua Shen

Compositional zero-shot learning (CZSL) is to recognize unseen attribute-object compositions by learning from seen compositions. The distribution shift between unseen compositions and seen compositions poses challenges to CZSL models, especially when test images are mixed with both seen and unseen compositions. The challenge will be addressed more easily if a model can distinguish unseen/seen compositions and treat them with specific recognition strategies. However, identifying images with unseen compositions is non-trivial, considering that unseen compositions are absent in training and usually contain only subtle differences from seen compositions. In this paper, we propose a novel compositional zero-shot learning method called COMO, which composes outliers in training for distinguishing seen and unseen compositions and further applying specific strategies for them. Specifically, we compose attribute-object representations for unseen compositions based on primitive representations of training images as outliers to enable the model to identify unseen compositions in inference. At test time, the method distinguishes images containing seen/unseen compositions and uses different weights for composition classification and primitive classification to recognize seen/unseen compositions. Experimental results on three datasets show the effectiveness of our method in both the closed-world setting and the open-world setting.

PDF Details DOI

ICML Conference 2025 Conference Paper

Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning

Qi Xu 0008
Junyang Zhu
Dongdong Zhou
Hao Chen
Yang Liu
Jiangrong Shen
Qiang Zhang 0008

Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.

Details

NeurIPS Conference 2025 Conference Paper

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Peng Xie
Xingyuan Liu
Yequan Bie
Tsz Wai Chan
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (TTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance. Benchmark experiments on SwitchLingua with state-of-the-art ASR models reveal substantial performance gaps, underscoring the dataset’s utility as a rigorous benchmark for CS capability evaluation. In addition, SwitchLingua aims to encourage further research to promote cultural inclusivity and linguistic diversity in speech technology, fostering equitable progress in the ASR field. LinguaMaster (Code): github. com/Shelton1013/SwitchLingua, SwitchLingua (Data): https: //huggingface. co/datasets/Shelton1013/SwitchLingua text, https: //huggingface. co/datasets/Shelton1013/SwitchLingua audio

PDF Details

YNIMG Journal 2025 Journal Article

Synthetizing SWI from 3T to 7T by generative diffusion network for deep medullary veins visualization

Sui Li
Xingguang Deng
Qiwei Li
Zhiming Zhen
Luyi Han
Kang Chen
Chaoyang Zhou
Fengxi Chen

Details DOI

AAAI Conference 2025 Conference Paper

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Dawei Yan
Pengcheng Li
Yang Li
Hao Chen
Qingguo Chen
Weihua Luo
Wei Dong
Qingsen Yan

Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Time Series Supplier Allocation via Deep Black-Litterman Model

Xinke Jiang
Wentao Zhang
Yuchen Fang
Xiaowei Gao
Hao Chen
Haoyu Zhang
Dingyi Zhuang
Jiayuan Luo

As a typical problem of Spatiotemporal Resource Management, Time Series Supplier Allocation (TSSA) poses a complex NP-hard challenge, aimed at refining future order dispatching strategies to satisfy the trade-off between demands and maximum supply. The Black-Litterman (BL) model, which comes from financial portfolio management, offers a new perspective for the TSSA by balancing expected returns against insufficient supply risks. However, the BL model is not only constrained by manually constructed perspective matrices and spatio-temporal market dynamics but also restricted by the absence of supervisory signals and unreliable supplier data. To solve these limitations, we introduce the pioneering Deep Black-Litterman Model for TSSA, which innovatively adapts the BL model from financial domain to supply chain context. Specifically, DBLM leverages Spatio-Temporal Graph Neural Networks (STGNNs) to capture spatio-temporal dependencies for automatically generating future perspective matrices. Moreover, a novel Spearman rank correlation is designed as our DBLM supervise signal to navigate complex risks and interactions of the supplier. Finally, DBLM further uses a masking mechanism to counteract the bias of unreliable data, thus improving precision and reliability. Extensive experiments on two datasets demonstrate significant improvements of DBLM on TSSA.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs

Xingang Guo
Yaxin Li
XiangYi Kong
YILAN JIANG
Xiayu Zhao
Zhihua Gong
Yufan Zhang
Daixuan Li

Modern engineering, spanning electrical, mechanical, aerospace, civil, and computer disciplines, stands as a cornerstone of human civilization and the foundation of our society. However, engineering design poses a fundamentally different challenge for large language models (LLMs) compared with traditional textbook-style problem solving or factual question answering. Although existing benchmarks have driven progress in areas such as language understanding, code synthesis, and scientific problem solving, real-world engineering design demands the synthesis of domain knowledge, navigation of complex trade-offs, and management of the tedious processes that consume much of practicing engineers' time. Despite these shared challenges across engineering disciplines, no benchmark currently captures the unique demands of engineering design work. In this work, we introduce EngDesign, an Engineering Design benchmark that evaluates LLMs' abilities to perform practical design tasks across nine engineering domains. Unlike existing benchmarks that focus on factual recall or question answering, EngDesign uniquely emphasizes LLMs' ability to synthesize domain knowledge, reason under constraints, and generate functional, objective-oriented engineering designs. Each task in EngDesign represents a real-world engineering design problem, accompanied by a detailed task description specifying design goals, constraints, and performance requirements. EngDesign pioneers a simulation-based evaluation paradigm that moves beyond textbook knowledge to assess genuine engineering design capabilities and shifts evaluation from static answer checking to dynamic, simulation-driven functional verification, marking a crucial step toward realizing the vision of engineering Artificial General Intelligence (AGI).

PDF Details

AAAI Conference 2025 Conference Paper

Towards Loss-Resilient Image Coding for Unstable Satellite Networks

Hongwei Sha
Muchen Dong
Quanyou Luo
Ming Lu
Hao Chen
Zhan Ma

Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin
Jialian Wu
Ximeng Sun
Ze Wang
Jiang Liu
Yusheng Su
Xiaodong Yu
Hao Chen

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9, 700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3. 3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

PDF Details

JBHI Journal 2025 Journal Article

Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN

Weiwen Zhang
Dawei Yang
Haoxuan Che
An Ran Ran
Carol Y. Cheung
Hao Chen

For optical coherence tomography angiography (OCTA) images, the limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ( ${\bm {hf}}$ ) and coarse-grained features as low-frequencies ( ${\bm {lf}}$ ). We propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize ${\bm {hf}}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. We collected a paired dataset for evaluation and showed that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.

Details DOI

NeurIPS Conference 2025 Conference Paper

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Jiaming Han
Hao Chen
Yang Zhao
Hanyu Wang
Qi Zhao
Ziyan Yang
Hao He
Xiangyu Yue

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. All code, models, and data will be made publicly available.

PDF Details

AAAI Conference 2024 Conference Paper

A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning

Yongjian Deng
Hao Chen
Youfu Li

Recent advances in event-based research prioritize sparsity and temporal precision. Approaches learning sparse point-based representations through graph CNNs (GCN) become more popular. Yet, these graph techniques hold lower performance than their frame-based counterpart due to two issues: (i) Biased graph structures that don't properly incorporate varied attributes (such as semantics, and spatial and temporal signals) for each vertex, resulting in inaccurate graph representations. (ii) A shortage of robust pretrained models. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

A Motion-aware Spatio-temporal Graph for Video Salient Object Ranking

Hao Chen
Yufei Zhu
Yongjian Deng

Video salient object ranking aims to simulate the human attention mechanism by dynamically prioritizing the visual attraction of objects in a scene over time. Despite its numerous practical applications, this area remains underexplored. In this work, we propose a graph model for video salient object ranking. This graph simultaneously explores multi-scale spatial contrasts and intra-/inter-instance temporal correlations across frames to extract diverse spatio-temporal saliency cues. It has two advantages: 1. Unlike previous methods that only perform global inter-frame contrast or compare all proposals across frames globally, we explicitly model the motion of each instance by comparing its features with those in the same spatial region in adjacent frames, thus obtaining more accurate motion saliency cues. 2. We synchronize the spatio-temporal saliency cues in a single graph for joint optimization, which exhibits better dynamics compared to the previous stage-wise methods that prioritize spatial cues followed by temporal cues. Additionally, we propose a simple yet effective video retargeting method based on video saliency ranking. Extensive experiments demonstrate the superiority of our model in video salient object ranking and the effectiveness of the video retargeting method. Our codes/models are released at https: //github. com/zyf-815/VSOR/tree/main.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

A Simple Image Segmentation Framework via In-Context Examples

Yang Liu
Chenchen Jing
Hengtao Li
Muzhi Zhu
Hao Chen
Xinlong Wang
Chunhua Shen

Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image $\textbf{S}$egmentation framework utilizing $\textbf{in}$-context $\textbf{e}$xamples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to eliminate task ambiguity effectively. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.

PDF Details DOI

TIST Journal 2024 Journal Article

A Survey on Evaluation of Large Language Models

Yupeng Chang
Xu Wang
Jindong Wang
Yuan Wu
Linyi Yang
Kaijie Zhu
Hao Chen
Xiaoyuan Yi

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

Details DOI

JBHI Journal 2024 Journal Article

Adaptive Fusion of Deep Learning With Statistical Anatomical Knowledge for Robust Patella Segmentation From CT Images

Jiachen Zhao
Tianshu Jiang
Yi Lin
Lok-Chun Chan
Ping-Keung Chan
Chunyi Wen
Hao Chen

Kneeosteoarthritis (KOA), as a leading joint disease, can be decided by examining the shapes of patella to spot potential abnormal variations. To assist doctors in the diagnosis of KOA, a robust automatic patella segmentation method is highly demanded in clinical practice. Deep learning methods, especially convolutional neural networks (CNNs) have been widely applied to medical image segmentation in recent years. Nevertheless, poor image quality and limited data still impose challenges to segmentation via CNNs. On the other hand, statistical shape models (SSMs) can generate shape priors which give anatomically reliable segmentation to varying instances. Thus, in this work, we propose an adaptive fusion framework, explicitly combining deep neural networks and anatomical knowledge from SSM for robust patella segmentation. Our adaptive fusion framework will accordingly adjust the weight of segmentation candidates in fusion based on their segmentation performance. We also propose a voxel-wise refinement strategy to make the segmentation of CNNs more anatomically correct. Extensive experiments and thorough assessment have been conducted on various mainstream CNN backbones for patella segmentation in low-data regimes, which demonstrate that our framework can be flexibly attached to a CNN model, significantly improving its performance when labeled training data are limited and input image data are of poor quality.

Details DOI

EAAI Journal 2024 Journal Article

An efficient frequency domain fusion network of infrared and visible images

Chenwu Wang
Junsheng Wu
Aiqing Fang
Zhixiang Zhu
Pei Wang
Hao Chen

Details DOI

IROS Conference 2024 Conference Paper

BronchoCopilot: Towards Autonomous Robotic Bronchoscopy via Multimodal Reinforcement Learning

Jianbo Zhao
Hao Chen
Qingyao Tian
Jian Chen 0036
Bingyu Yang
Zihui Zhang
Hongbin Liu 0001

Bronchoscopy plays a significant role in the early diagnosis and treatment of lung diseases. This process demands physicians to maneuver the flexible endoscope for reaching distal lesions, particularly requiring substantial expertise when examining the airways of the upper lung lobe. With the development of artificial intelligence and robotics, reinforcement learning (RL) method has been applied to the manipulation of interventional surgical robots. However, unlike human physicians who utilize multimodal information, most of the current RL methods rely on a single modality, limiting their performance. In this paper, we propose BronchoCopilot, a multimodal RL agent designed to acquire manipulation skills for autonomous bronchoscopy. BronchoCopilot specifically integrates images from the bronchoscope camera and estimated robot poses, aiming for a higher success rate within challenging airway environment. We employ auxiliary reconstruction tasks to compress multimodal data and utilize attention mechanisms to achieve an efficient latent representation of this data, serving as input for the RL module. This framework adopts a stepwise training and fine-tuning approach to mitigate the challenges of training difficulty. Our evaluation in the realistic simulation environment reveals that BronchoCopilot, by effectively harnessing multimodal information, attains a success rate of approximately 90% in fifth generation airways with consistent movements. Additionally, it demonstrates a robust capacity to adapt to diverse cases.

Details

NeurIPS Conference 2024 Conference Paper

Cost-efficient Knowledge-based Question Answering with Large Language Models

Junnan Dong
Qinggang Zhang
Chuang Zhou
Hao Chen
Daochen Zha
Xiao Huang

Knowledge-based question answering (KBQA) is widely used in many scenarios that necessitate domain knowledge. Large language models (LLMs) bring opportunities to KBQA, while their costs are significantly higher and absence of domain-specific knowledge during pre-training. We are motivated to combine LLMs and prior small models on knowledge graphs (KGMs) for both inferential accuracy and cost saving. However, it remains challenging since accuracy and cost are not readily combined in the optimization as two distinct metrics. It is also laborious for model selection since different models excel in diverse knowledge. To this end, we propose Coke, a novel cost-efficient strategy for KBQA with LLMs, modeled as a tailored multi-armed bandit problem to minimize calls to LLMs within limited budgets. We first formulate the accuracy expectation with a cluster-level Thompson Sampling for either KGMs or LLMs. A context-aware policy is optimized to further distinguish the expert model subject to the question semantics. The overall decision is bounded by the cost regret according to historical expenditure on failures. Extensive experiments showcase the superior performance of Coke, which moves the Pareto frontier with up to 20. 89% saving of GPT-4 fees while achieving a 2. 74% higher accuracy on the benchmark datasets.

PDF Details DOI

EAAI Journal 2024 Journal Article

Fatigue life prediction driven by mesoscopic defect data

Chao Wang
Yali Yang
Hao Chen
Sha Xu
Yongfang Li
Ruoping Zhang
Ming Ling

Details DOI

NeurIPS Conference 2024 Conference Paper

Fine Tuning Out-of-Vocabulary Item Recommendation with User Sequence Imagination

Ruochen Liu
Hao Chen
Yuanchen Bei
Qijie Shen
Fangwei Zhong
Senzhang Wang
Jianxin Wang

Recommending out-of-vocabulary (OOV) items is a challenging problem since the in-vocabulary (IV) items have well-trained behavioral embeddings but the OOV items only have content features. Current OOV recommendation models often generate 'makeshift' embeddings for OOV items from content features and then jointly recommend with the `makeshift' OOV item embeddings and the behavioral IV item embeddings. However, merely using the 'makeshift' embedding will result in suboptimal recommendation performance due to the substantial gap between the content feature and the behavioral embeddings. To bridge the gap, we propose a novel User Sequence IMagination (USIM) fine-tuning framework, which first imagines the user sequences and then refines the generated OOV embeddings with the user behavioral embeddings. Specifically, we frame the user sequence imagination as a reinforcement learning problem and develop a recommendation-focused reward function to evaluate to what extent a user can help recommend the OOV items. Besides, we propose an embedding-driven transition function to model the embedding transition after imaging a user. USIM has been deployed on a prominent e-commerce platform for months, offering recommendations for millions of OOV items and billions of users. Extensive experiments demonstrate that USIM outperforms traditional generative models in OOV item recommendation performance across traditional collaborative filtering and GNN-based collaborative filtering models.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

FNP: Fourier Neural Processes for Arbitrary-Resolution Data Assimilation

Kun Chen
Peng Ye
Hao Chen
Kang Chen
Tao Han
Wanli Ouyang
Tao Chen
Lei Bai

Data assimilation is a vital component in modern global medium-range weather forecasting systems to obtain the best estimation of the atmospheric state by combining the short-term forecast and observations. Recently, AI-based data assimilation approaches have attracted increasing attention for their significant advantages over traditional techniques in terms of computational consumption. However, existing AI-based data assimilation methods can only handle observations with a specific resolution, lacking the compatibility and generalization ability to assimilate observations with other resolutions. Considering that complex real-world observations often have different resolutions, we propose the Fourier Neural Processes (FNP) for arbitrary-resolution data assimilation in this paper. Leveraging the efficiency of the designed modules and flexible structure of neural processes, FNP achieves state-of-the-art results in assimilating observations with varying resolutions, and also exhibits increasing advantages over the counterparts as the resolution and the amount of observations increase. Moreover, our FNP trained on a fixed resolution can directly handle the assimilation of observations with out-of-distribution resolutions and the observational information reconstruction task without additional fine-tuning, demonstrating its excellent generalization ability across data resolutions as well as across tasks. Code is available at https: //github. com/OpenEarthLab/FNP.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling

Wanghan Xu
Fenghua Ling
Wenlong Zhang
Tao Han
Hao Chen
Wanli Ouyang
Lei Bai

Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets prevent these models from forecasting at finer time scales. This paper proposes a physics-AI hybrid model (i. e. , WeatherGFT) which generalizes weather forecasts to finer-grained temporal scales beyond training dataset. Specifically, we employ a carefully designed PDE kernel to simulate physical evolution on a small time scale (e. g. , 300 seconds) and use a parallel neural networks with a learnable router for bias correction. Furthermore, we introduce a lead time-aware training framework to promote the generalization of the model at different lead times. The weight analysis of physics-AI modules indicates that physics conducts major evolution while AI performs corrections adaptively. Extensive experiments show that WeatherGFT trained on an hourly dataset, effectively generalizes forecasts across multiple time scales, including 30-minute, which is even smaller than the dataset's temporal resolution.

PDF Details DOI

JBHI Journal 2024 Journal Article

Guest Editorial: Trustworthy Machine Learning for Health Informatics

Luyang Luo
Daguang Xu
Jing Qin
Yueming Jin
Hao Chen

Machine learning (ML), the stem of today's artificial intelligence, has shown significant growth in the field of biomedical and health informatics. On the one hand, ML techniques are becoming more complex in order to deal with real-world data. On the other hand, ML is also more and more accessible to broader users. For example, automated machine learning products are enabling users to build their own ML models without writing code [1].

Details DOI

NeurIPS Conference 2024 Conference Paper

Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

Hao Chen
Ankit Shah
Jindong Wang
Ran Tao
Yidong Wang
Xiang Li
Xing Xie
Masashi Sugiyama

Learning with reduced labeling standards, such as noisy label, partial label, and supplementary unlabeled data, which we generically refer to as imprecise label, is a commonplace challenge in machine learning tasks. Previous methods tend to propose specific designs for every emerging imprecise label configuration, which is usually unsustainable when multiple configurations of imprecision coexist. In this paper, we introduce imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations. ILL leverages expectation-maximization (EM) for modeling the imprecise label information, treating the precise labels as latent variables. Instead of approximating the correct labels for training, it considers the entire distribution of all possible labeling entailed by the imprecise information. We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings, with closed-form learning objectives derived from the unified EM modeling. Notably, ILL surpasses the existing specified techniques for handling imprecise labels, marking the first practical and unified framework with robust and effective performance across various challenging settings. We hope our work will inspire further research on this topic, unleashing the full potential of ILL in wider scenarios where precise labels are expensive and complicated to obtain.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen

Adversarial prompts (or say, adversarial examples) generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i. e. , Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt generation and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench. This match rate is 33% higher than that of a very strong baseline known as GCG, demonstrating advanced discrete optimization for adversarial prompt generation against LLMs. In addition, without introducing obvious cost, the combination achieves >30% absolute increase in attack success rates compared with GCG when generating both query-specific (38% ->68%) and universal adversarial prompts (26. 68% -> 60. 32%) for attacking the Llama-2-7B-Chat model on AdvBench. Code at: https: //github. com/qizhangli/Gradient-based-Jailbreak-Attacks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

KnowGPT: Knowledge Graph based Prompting for Large Language Models

Qinggang Zhang
Junnan Dong
Hao Chen
Daochen Zha
Zailiang Yu
Xiao Huang

Large Language Models (LLMs) have demonstrated remarkable capabilities in many real-world applications. Nonetheless, LLMs are often criticized for their tendency to produce hallucinations, wherein the models fabricate incorrect statements on tasks beyond their knowledge and perception. To alleviate this issue, graph retrieval-augmented generation (GraphRAG) has been extensively explored which leverages the factual knowledge in knowledge graphs (KGs) to ground the LLM's responses in established facts and principles. However, most state-of-the-art LLMs are closed-source, making it challenging to develop a prompting framework that can efficiently and effectively integrate KGs into LLMs with hard prompts only. Generally, existing KG-enhanced LLMs usually suffer from three critical issues, including huge search space, high API costs, and laborious prompt engineering, that impede their widespread application in practice. To this end, we introduce a novel Know ledge Gr aph based P romp T ing framework, namely KnowGPT, to enhance LLMs with domain knowledge. KnowGPT contains a knowledge extraction module to extract the most informative knowledge from KGs, and a context-aware prompt construction module to automatically convert extracted knowledge into effective prompts. Experiments on three benchmarks demonstrate that KnowGPT significantly outperforms all competitors. Notably, KnowGPT achieves a 92. 6% accuracy on OpenbookQA leaderboard, comparable to human-level performance.

PDF Details DOI

AAAI Conference 2024 Conference Paper

MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction

Hao Qian
Hongting Zhou
Qian Zhao
Hao Chen
Hongxiang Yao
Jingwei Wang
Ziqi Liu
Fei Yu

The stock market is a crucial component of the financial system, but predicting the movement of stock prices is challenging due to the dynamic and intricate relations arising from various aspects such as economic indicators, financial reports, global news, and investor sentiment. Traditional sequential methods and graph-based models have been applied in stock movement prediction, but they have limitations in capturing the multifaceted and temporal influences in stock price movements. To address these challenges, the Multi-relational Dynamic Graph Neural Network (MDGNN) framework is proposed, which utilizes a discrete dynamic graph to comprehensively capture multifaceted relations among stocks and their evolution over time. The representation generated from the graph offers a complete perspective on the interrelationships among stocks and associated entities. Additionally, the power of the Transformer structure is leveraged to encode the temporal evolution of multiplex relations, providing a dynamic and effective approach to predicting stock investment. Further, our proposed MDGNN framework achieves the best performance in public datasets compared with the state-of-the-art stock investment methods.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Metric from Human: Zero-shot Monocular Metric Depth Estimation via Test-time Adaptation

Yizhou Zhao
Hengwei Bian
Kaihua Chen
Pengliang Ji
Liao Qu
Shao-yu Lin
Weichen Yu
Haoran Li

Monocular depth estimation (MDE) is fundamental for deriving 3D scene structures from 2D images. While state-of-the-art monocular relative depth estimation (MRDE) excels in estimating relative depths for in-the-wild images, current monocular metric depth estimation (MMDE) approaches still face challenges in handling unseen scenes. Since MMDE can be viewed as the composition of MRDE and metric scale recovery, we attribute this difficulty to scene dependency, where MMDE models rely on scenes observed during supervised training for predicting scene scales during inference. To address this issue, we propose to use humans as landmarks for distilling scene-independent metric scale priors from generative painting models. Our approach, Metric from Human (MfH), bridges from generalizable MRDE to zero-shot MMDE in a generate-and-estimate manner. Specifically, MfH generates humans on the input image with generative painting and estimates human dimensions with an off-the-shelf human mesh recovery (HMR) model. Based on MRDE predictions, it propagates the metric information from painted humans to the contexts, resulting in metric depth estimations for the original input. Through this annotation-free test-time adaptation, MfH achieves superior zero-shot performance in MMDE, demonstrating its strong generalization ability.

PDF Details DOI

AAAI Conference 2024 Conference Paper

MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment

Yequan Bie
Luyang Luo
Hao Chen

Black-box deep learning approaches have showcased significant potential in the realm of medical image analysis. However, the stringent trustworthiness requirements intrinsic to the medical field have catalyzed research into the utilization of Explainable Artificial Intelligence (XAI), with a particular focus on concept-based methods. Existing concept-based methods predominantly apply concept annotations from a single perspective (e.g., global level), neglecting the nuanced semantic relationships between sub-regions and concepts embedded within medical images. This leads to underutilization of the valuable medical information and may cause models to fall short in harmoniously balancing interpretability and performance when employing inherently interpretable architectures such as Concept Bottlenecks. To mitigate these shortcomings, we propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata, encompassing the image level, token level, and concept level. Moreover, our method allows for model intervention and offers both textual and visual explanations in terms of human-interpretable concepts. Experimental results on three skin image datasets demonstrate that our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis. The code is available at https://github.com/Tommy-Bie/MICA.

PDF Details DOI

ECAI Conference 2024 Conference Paper

Mixup Your Own Latent: Efficient and Robust Self-Supervised Learning on Small Images

Eugene Yang
Hao Chen
Seokho Kang 0001

Self-supervised learning has emerged as a powerful technique in computer vision, demonstrating remarkable performance in various downstream tasks by leveraging unlabeled data. Among these methods, contrastive learning has proven particularly promising by effectively learning image representations. However, its high reliance on large computational resources poses significant practical challenges. To address this issue, there is a pressing need to improve efficiency without compromising generalization performance and robustness. In this paper, we propose Mixup Your Own Latent (MYOL), a regularization method to improve the generalization performance and robustness of Bootstrap Your Own Latent (BYOL), particularly for small images under limited computational resources. MYOL achieves this using the Mixup of the representations of two input images as the target representation of the Mixup of those images. Through experiments conducted in a single GPU environment, we demonstrate that MYOL outperforms BYOL and other regularization methods across various downstream tasks on small-image datasets. The high resilience of MYOL to small batch sizes and its robustness to adversarial attacks further highlight its effectiveness in mitigating the limitations of BYOL. The source code is available at https: //github. com/cneyang/MYOL-MixupYourOwnLatent.

Details

AAMAS Conference 2024 Conference Paper

Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling

Xianjie Zhang
Jiahao Sun
Chen Gong
Kai Wang
Yifei Cao
Hao Chen
Yu Liu

The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers’ income and enabling passengers to travel at lower prices than taxi/car on-demand services. Although on-demand ride pooling services can bring so many benefits, ride pooling services need a well-defined matching strategy to maximize the benefits for all parties (passengers, drivers, aggregation companies and environment), especially the regional dispatching of vehicles has a significant impact on matching and revenue. Existing algorithms often only consider revenue maximization, which makes it difficult for requests with unusual distribution to get rides. How to increase revenue while ensuring a reasonable assignment of requests brings a challenge to ride pooling service companies (aggregation companies). In this paper, we propose a framework for vehicle dispatching for ride pooling tasks, which splits the city into discrete dispatching regions and uses the reinforcement learning (RL) algorithm to dispatch vehicles in these regions. We also consider the mutual information (MI) between vehicle and request distribution as the intrinsic reward of the RL algorithm to improve the correlation between their distributions, thus ensuring the possibility of getting a ride for unusually distributed requests. In experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly increase revenue up to an average of 3% over the existing best on-demand ride pooling method. ∗Corresponding author This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

PDF

ICRA Conference 2024 Conference Paper

Optimization of Flexible Bronchoscopy Shape Sensing Using Fiber Optic Sensors

Xinran Liu
Hao Chen
Hongbin Liu

This work presents a novel shape evaluation and optimization approach for shape sensing, specifically targeting the constrained, irregular, and intricate spatial shapes of flexible bronchoscopes (FB) in human bronchial tree. The proposed evaluation criteria and optimization methods combine clinical significance related to bronchial anatomical structures and address issues related to singular points and discontinuities in traditional shape reconstruction models. Three-dimensional experiments were conducted within eight spatial complex configurations printed from a proportional bronchial model. The 3D experiment results demonstrate an average reduction of approximately 34. 1% in shape reconstruction errors across all eight airway models compared to the traditional model, validating the effectiveness and feasibility.

Details

AAMAS Conference 2024 Conference Paper

PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning

Hangyu Mao
Rui Zhao
Ziyue Li
Zhiwei Xu
Hao Chen
Yiqun Chen
Bin Zhang
Zhen Xiao

Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on the environmental perception by processing the observation at the patch level, whereas the deciding one pays attention to the decisionmaking by conditioning on the history of the desired returns, the perceiver’s outputs, and the actions. Such a network design is generally applicable to a lot of deep RL settings, e. g. , both the online and offline RL algorithms under environments with either image observations, proprioception observations, or hybrid image-language observations. Extensive experiments show that PDiT can not only achieve superior performance than strong baselines in different settings but also extract explainable feature representations. Our code is available at https: //github. com/maohangyu/PDiT.

PDF

JMLR Journal 2024 Journal Article

PromptBench: A Unified Library for Evaluation of Large Language Models

Kaijie Zhu
Qinlin Zhao
Hao Chen
Jindong Wang
Xing Xie

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that can be easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed as an open, general, and flexible codebase for research purpose. It aims to facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

PDF Details

AAAI Conference 2024 Conference Paper

PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation

Haibo Jin
Haoxuan Che
Yi Lin
Hao Chen

Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnosis unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Yilan Zhang
Yingxue Xu
Jianqi Chen
Fengying Xie
Hao Chen

Multimodal learning significantly benefits cancer survival prediction, especially the integration of pathological images and genomic data. Despite advantages of multimodal learning for cancer survival prediction, massive redundancy in multimodal data prevents it from extracting discriminative and compact information: (1) An extensive amount of intra-modal task-unrelated information blurs discriminability, especially for gigapixel whole slide images (WSIs) with many patches in pathology and thousands of pathways in genomic data, leading to an "intra-modal redundancy" issue. (2) Duplicated information among modalities dominates the representation of multimodal data, which makes modality-specific information prone to being ignored, resulting in an "inter-modal redundancy" issue. To address these, we propose a new framework, Prototypical Information Bottlenecking and Disentangling (PIBD), consisting of Prototypical Information Bottleneck (PIB) module for intra-modal redundancy and Prototypical Information Disentanglement (PID) module for inter-modal redundancy. Specifically, a variant of information bottleneck, PIB, is proposed to model prototypes approximating a bunch of instances for different risk levels, which can be used for selection of discriminative instances within modality. PID module decouples entangled multimodal data into compact distinct components: modality-common and modality-specific knowledge, under the guidance of the joint prototypical distribution. Extensive experiments on five cancer benchmark datasets demonstrated our superiority over other methods. The code is released.

Details

NeurIPS Conference 2024 Conference Paper

Prune and Repaint: Content-Aware Image Retargeting for any Ratio

Feihong Shen
Chao Li
Yifeng Geng
Yongjian Deng
Hao Chen

Image retargeting is the task of adjusting the aspect ratio of images to suit different display devices or presentation environments. However, existing retargeting methods often struggle to balance the preservation of key semantics and image quality, resulting in either deformation or loss of important objects, or the introduction of local artifacts such as discontinuous pixels and inconsistent regenerated content. To address these issues, we propose a content-aware retargeting method called PruneRepaint. It incorporates semantic importance for each pixel to guide the identification of regions that need to be pruned or preserved in order to maintain key semantics. Additionally, we introduce an adaptive repainting module that selects image regions for repainting based on the distribution of pruned pixels and the proportion between foreground size and target aspect ratio, thus achieving local smoothness after pruning. By focusing on the content and structure of the foreground, our PruneRepaint approach adaptively avoids key content loss and deformation, while effectively mitigating artifacts with local repainting. We conduct experiments on the public RetargetMe benchmark and demonstrate through objective experimental results and subjective user studies that our method outperforms previous approaches in terms of preserving semantics and aesthetics, as well as better generalization across diverse aspect ratios. Codes will be available at https: //github. com/fhshen2022/PruneRepaint.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning

Chenchen Jing
Yukun Li
Hao Chen
Chunhua Shen

Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by learning from seen compositions. Composing the learned knowledge of seen primitives, i.e., attributes or objects, into novel compositions is critical for CZSL. In this work, we propose to explicitly retrieve knowledge of seen primitives for compositional zero-shot learning. We present a retrieval-augmented method, which augments standard multi-path classification methods with two retrieval modules. Specifically, we construct two databases storing the attribute and object representations of training images, respectively. For an input training/testing image, we use two retrieval modules to retrieve representations of training images with the same attribute and object, respectively. The primitive representations of the input image are augmented by using the retrieved representations, for composition recognition. By referencing semantically similar images, the proposed method is capable of recalling knowledge of seen primitives for compositional generalization. Experiments on three widely-used datasets show the effectiveness of the proposed method.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Revisiting Open-Set Panoptic Segmentation

Yufei Yin
Hao Chen
Wengang Zhou
Jiajun Deng
Haiming Xu
Houqiang Li

In this paper, we focus on the open-set panoptic segmentation (OPS) task to circumvent the data explosion problem. Different from the close-set setting, OPS targets to detect both known and unknown categories, where the latter is not annotated during training. Different from existing work that only selects a few common categories as unknown ones, we move forward to the real-world scenario by considering the various tail categories (~1k). To this end, we first build a new dataset with long-tail distribution for the OPS task. Based on this dataset, we additionally add a new class type for unknown classes and re-define the training annotations to make the OPS definition more complete and reasonable. Moreover, we analyze the influence of several significant factors in the OPS task and explore the upper bound of performance on unknown classes with different settings. Furthermore, based on the analyses, we design an effective two-phase framework for the OPS task, including thing-agnostic map generation and unknown segment mining. We further adopt semi-supervised learning to improve the OPS performance. Experimental results on different datasets validate the effectiveness of our method.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Score-CDM: Score-Weighted Convolutional Diffusion Model for Multivariate Time Series Imputation

Shunyang Zhang
Senzhang Wang
Hao Miao
Hao Chen
Changjun Fan
Jian Zhang

Multivariant time series (MTS) data are usually incomplete in real scenarios, and imputing the incomplete MTS is practically important to facilitate various time series mining tasks. Recently, diffusion model-based MTS imputation methods have achieved promising results by utilizing CNN or attention mechanisms for temporal features learning. However, it is hard to adaptively trade off the diverse effects of local and global temporal features by simply combining CNN and attention. To address this issue, we propose a Score-weighted Convolutional Diffusion Model (Score-CDM for short), whose backbone consists of a Score-weighted Convolution Module (SCM) and an Adaptive Reception Module (ARM). SCM adopts a score map to capture the global temporal features in the time domain, while ARM uses a Spectral2Time Window Block (S2TWB) to convolve the local time series data in the spectral domain. Benefiting from the time convolution properties of Fast Fourier Transformation, ARM can adaptively change the receptive field of the score map, and thus effectively balance the local and global temporal features. We conduct extensive evaluations on three real MTS datasets of different domains, and the result verifies the effectiveness of the proposed Score-CDM.

PDF Details DOI

EAAI Journal 2024 Journal Article

Self-supervised rotation-equivariant spherical vector network for learning canonical 3D point cloud orientation

Hao Chen
Jieyu Zhao
Kangxin Chen
Yu Chen

Details DOI

AIIM Journal 2024 Journal Article

Semi-supervised image segmentation using a residual-driven mean teacher and an exponential Dice loss

Chenyang Mei
Xiaoguo Yang
Mi Zhou
Shaodan Zhang
Hao Chen
Xiaokai Yang
Lei Wang

Details DOI

NeurIPS Conference 2024 Conference Paper

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Hao Chen
Yujin Han
Diganta Misra
Xiang Li
Kai Hu
Difan Zou
Masashi Sugiyama
Jindong Wang

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over $50$ conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R. Bassi
Wenxuan Li
Yucheng Tang
Fabian Isensee
Zifu Wang
Jieneng Chen
Yu-Cheng Chou
Saikat Roy

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5, 195 training CT scans from 76 hospitals around the world and 5, 903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks---which, differing from algorithms, are more flexible and can support different algorithms—including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Transformer Doctor: Diagnosing and Treating Vision Transformers

Jiacong Hu
Hao Chen
Kejia Chen
Yang Gao
Jingwen Ye
Xingen Wang
Mingli Song
Zunlei Feng

Due to its powerful representational capabilities, Transformers have gradually become the mainstream model in the field of machine vision. However, the vast and complex parameters of Transformers impede researchers from gaining a deep understanding of their internal mechanisms, especially error mechanisms. Existing methods for interpreting Transformers mainly focus on understanding them from the perspectives of the importance of input tokens or internal modules, as well as the formation and meaning of features. In contrast, inspired by research on information integration mechanisms and conjunctive errors in the biological visual system, this paper conducts an in-depth exploration of the internal error mechanisms of Transformers. We first propose an information integration hypothesis for Transformers in the machine vision domain and provide substantial experimental evidence to support this hypothesis. This includes the dynamic integration of information among tokens and the static integration of information within tokens in Transformers, as well as the presence of conjunctive errors therein. Addressing these errors, we further propose heuristic dynamic integration constraint methods and rule-based static integration constraint methods to rectify errors and ultimately improve model performance. The entire methodology framework is termed as Transformer Doctor, designed for diagnosing and treating internal errors within transformers. Through a plethora of quantitative and qualitative experiments, it has been demonstrated that Transformer Doctor can effectively address internal errors in transformers, thereby enhancing model performance.

PDF Details DOI

AIIM Journal 2024 Journal Article

TSOANet: Time-Sensitive Orthogonal Attention Network for medical event prediction

Hao Chen
Junjie Zhang
Yang Xiang
Shengye Lu
Buzhou Tang

Details DOI

NeurIPS Conference 2024 Conference Paper

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

Muzhi Zhu
Yang Liu
Zekai Luo
Chenchen Jing
Hao Chen
Guangkai Xu
Xinlong Wang
Chunhua Shen

The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Consensus Learning for Cooperative Multi-Agent Reinforcement Learning

Zhiwei Xu
Bin Zhang
Dapeng Li
Zeren Zhang
Guangchong Zhou
Hao Chen
Guoliang Fan

Almost all multi-agent reinforcement learning algorithms without communication follow the principle of centralized training with decentralized execution. During the centralized training, agents can be guided by the same signals, such as the global state. However, agents lack the shared signal and choose actions given local observations during execution. Inspired by viewpoint invariance and contrastive learning, we propose consensus learning for cooperative multi-agent reinforcement learning in this study. Although based on local observations, different agents can infer the same consensus in discrete spaces without communication. We feed the inferred one-hot consensus to the network of agents as an explicit input in a decentralized way, thereby fostering their cooperative spirit. With minor model modifications, our suggested framework can be extended to a variety of multi-agent reinforcement learning algorithms. Moreover, we carry out these variants on some fully cooperative tasks and get convincing results.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Weijia Wu
Yuzhong Zhao
Hao Chen
Yuchao Gu
Rui Zhao
Yefei He
Hong Zhou
Mike Zheng Shou

Current deep networks are very data-hungry and benefit from training on large-scale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse syntheticimages and the corresponding high-quality perception annotations (e. g. , segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) of manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models on downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic15segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more efficient and robust in domain generalization than the real data; 3) state-of-the-art results in zero-shot segmentation setting; and 4) flexibility for efficient application and novel task composition (e. g. , image editing)

PDF Details

IJCAI Conference 2023 Conference Paper

Diagnose Like a Pathologist: Transformer-Enabled Hierarchical Attention-Guided Multiple Instance Learning for Whole Slide Image Classification

Conghao Xiong
Hao Chen
Joseph J. Y. Sung
Irwin King

Multiple Instance Learning (MIL) and transformers are increasingly popular in histopathology Whole Slide Image (WSI) classification. However, unlike human pathologists who selectively observe specific regions of histopathology tissues under different magnifications, most methods do not incorporate multiple resolutions of the WSIs, hierarchically and attentively, thereby leading to a loss of focus on the WSIs and information from other resolutions. To resolve this issue, we propose a Hierarchical Attention-Guided Multiple Instance Learning framework to fully exploit the WSIs. This framework can dynamically and attentively discover the discriminative regions across multiple resolutions of the WSIs. Within this framework, an Integrated Attention Transformer is proposed to further enhance the performance of the transformer and obtain a more holistic WSI (bag) representation. This transformer consists of multiple Integrated Attention Modules, which is the combination of a transformer layer and an aggregation module that produces a bag representation based on every instance representation in that bag. The experimental results show that our method achieved state-of-the-art performances on multiple datasets, including Camelyon16, TCGA-RCC, TCGA-NSCLC, and an in-house IMGC dataset. The code is available at https: //github. com/BearCleverProud/HAG-MIL.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Improving Adversarial Transferability via Intermediate-level Perturbation Decay

Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen

Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10. 07% on average) and CIFAR-10 (+3. 88% on average). Our code is at https: //github. com/qizhangli/ILPD-attack.

PDF Details

EAAI Journal 2023 Journal Article

Interpretable knowledge-guided framework for modeling minimum miscible pressure of CO2-oil system in CO2-EOR projects

Bin Shen
Shenglai Yang
Xinyuan Gao
Shuai Li
Kun Yang
Jiangtao Hu
Hao Chen

Details DOI

NeurIPS Conference 2023 Conference Paper

Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly

Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen

The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 10 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided.

PDF Details

NeurIPS Conference 2023 Conference Paper

Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning

Shenzhi Wang
Qisen Yang
Jiawei Gao
Matthieu Lin
Hao Chen
Liwei Wu
Ning Jia
Shiji Song

Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a universal model to train a family of policies with different improvement/constraint intensities, and a balance model to select a suitable policy for each state. Theoretically, we prove that state-adaptive balances are necessary for achieving a higher policy performance upper bound. Empirically, extensive experiments show that FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark. Codes are available at https: //github. com/LeapLabTHU/FamO2O.

PDF Details

EAAI Journal 2023 Journal Article

Ultrathin optically transparent and flexible wideband absorber based on ANN and DGCNN

Xiaolu Yang
Zhenguo Liu
Zhe Zhang
Wei Xiang
Mingyang Geng
Hao Chen
Xiaochun Liu
Weibing Lu

Details DOI

JBHI Journal 2022 Journal Article

A Cascaded Multi-Task Generative Framework for Detecting Aortic Dissection on 3-D Non-Contrast-Enhanced Computed Tomography

Xiangyu Xiong
Yan Ding
Chuanqi Sun
Zhuoneng Zhang
Xiuhong Guan
Tianjing Zhang
Hao Chen
Hongyan Liu

Contrast-enhanced computed tomography (CE-CT) is the gold standard for diagnosing aortic dissection (AD). However, contrast agents can cause allergic reactions or renal failure in some patients. Moreover, AD diagnosis by radiologists using non-contrast-enhanced CT (NCE-CT) images has poor sensitivity. To address this issue, we propose a novel cascaded multi-task generative framework for AD detection using NCE-CT volumes. The framework includes a 3D nnU-Net and a 3D multi-task generative architecture (3D MTGA). Specifically, the 3D nnU-Net was employed to segment aortas from NCE-CT volumes. The 3D MTGA was then employed to simultaneously synthesize CE-CT volumes, segment true & false lumen, and classify the patient as AD or non-AD. A theoretical formulation demonstrated that the 3D MTGA could increase the Jensen–Shannon Divergence (JSD) between AD and non-AD for each NCE-CT volume, thus indirectly improving the AD detection performance. Experiments also showed that the proposed framework could achieve an average accuracy of 0. 831, a sensitivity of 0. 938, and an F1-score of 0. 847 in comparison with seven state-of-the-art classification models used by three radiologists with junior, intermediate, and senior experiences, respectively. The experimental results indicate that the proposed framework obtains superior performance to state-of-the-art models in AD detection. Thus, it has great potential to reduce the misdiagnosis of AD using NCE-CT in clinical practice. The source codes and supplementary materials for our framework are available at https://github.com/yXiangXiong/CMTGF.

Details DOI

NeurIPS Conference 2022 Conference Paper

An In-depth Study of Stochastic Backpropagation

Jun Fang
Mingze Xu
Hao Chen
Bing Shuai
Zhuowen Tu
Joseph Tighe

In this paper, we provide an in-depth study of Stochastic Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks. During backward propagation, SBP calculates gradients by using only a subset of feature maps to save GPU memory and computational cost. We interpret SBP as an efficient way to implement stochastic gradient decent by performing backpropagation dropout, which leads to significant memory saving and training run-time reduction, with a minimal impact on the overall model accuracy. We offer best practices to apply SBP for training image recognition models, which can be adopted in learning a wide range of deep neural networks. Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy degradation. Code is available at: https: //github. com/amazon-research/stochastic-backpropagation

PDF Details

JMLR Journal 2022 Journal Article

Gaussian Process Parameter Estimation Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

Hao Chen
Lili Zheng
Raed Al Kontar
Garvesh Raskutti

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparmeter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

PDF Details

NeurIPS Conference 2022 Conference Paper

USB: A Unified Semi-supervised Learning Benchmark for Classification

Yidong Wang
Hao Chen
Yue Fan
Wang Sun
Ran Tao
Wenxin Hou
Renjie Wang
Linyi Yang

Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples. However, currently, popular SSL evaluation protocols are often constrained to computer vision (CV) tasks. In addition, previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly. To address the above issues, we construct a Unified SSL Benchmark (USB) for classification by selecting 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio), on which we systematically evaluate the dominant SSL methods, and also open-source a modular and extensible codebase for fair evaluation of these SSL methods. We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning. USB enables the evaluation of a single SSL algorithm on more tasks from multiple domains but with less cost. Specifically, on a single NVIDIA V100, only 39 GPU days are required to evaluate FixMatch on 15 tasks in USB while 335 GPU days (279 GPU days on 4 CV datasets except for ImageNet) are needed on 5 CV tasks with TorchSSL.

PDF Details

IJCAI Conference 2021 Conference Paper

AMA-GCN: Adaptive Multi-layer Aggregation Graph Convolutional Network for Disease Prediction

Hao Chen
Fuzhen Zhuang
Li Xiao
Ling Ma
Haiyan Liu
Ruifang Zhang
Huiqin Jiang
Qing He

Recently, Graph Convolutional Networks (GCNs) have proven to be a powerful mean for Computer Aided Diagnosis (CADx). This approach requires building a population graph to aggregate structural information, where the graph adjacency matrix represents the relationship between nodes. Until now, this adjacency matrix is usually defined manually based on phenotypic information. In this paper, we propose an encoder that automatically selects the appropriate phenotypic measures according to their spatial distribution, and uses the text similarity awareness mechanism to calculate the edge weights between nodes. The encoder can automatically construct the population graph using phenotypic measures which have a positive impact on the final results, and further realizes the fusion of multimodal information. In addition, a novel graph convolution network architecture using multi-layer aggregation mechanism is proposed. The structure can obtain deep structure information while suppressing over-smooth, and increase the similarity between the same type of nodes. Experimental results on two databases show that our method can significantly improve the diagnostic accuracy for Autism spectrum disorder and breast cancer, indicating its universality in leveraging multimodal data for disease prediction.

PDF Details DOI

AAAI Conference 2021 System Paper

Dialog Router: Automated Dialog Transition via Multi-Task Learning

Ziming Huang
Zhuoxuan Jiang
Hao Chen
Xue Han
Yabin Dang

Dialog Router is a general paradigm for human-bot symbiosis dialog systems to provide friendly customer care service. It is equipped with a multi-task learning model to automatically capture the underlying correlation between multiple related tasks, i. e. dialog classification and regression, and greatly reduce human labor work for system customization, which improves the accuracy of dialog transition. In addition, for learning the multi-task model, the training data and labels are easy to collect from human-to-human historical dialog logs, and the Dialog Router can be easily integrated into the majority of existing dialog systems by calling general APIs. We conduct experiments on real-world datasets for dialog classification and regression. The results show that our model achieves improvements on both tasks, which benefits the dialog transition application. The demo illustrates our method’s effectiveness in a real customer care service.

PDF Details

NeurIPS Conference 2021 Conference Paper

Long Short-Term Transformer for Online Action Detection

Mingze Xu
Yuanjun Xiong
Hao Chen
Xinyu Li
Wei Xia
Zhuowen Tu
Stefano Soatto

We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm for online action detection, which employs a long- and short-term memory mechanism to model prolonged sequence data. It consists of an LSTR encoder that dynamically leverages coarse-scale historical information from an extended temporal window (e. g. , 2048 frames spanning of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e. g. , 32 frames spanning 8 seconds) to model the fine-scale characteristics of the data. Compared to prior work, LSTR provides an effective and efficient method to model long videos with fewer heuristics, which is validated by extensive empirical analysis. LSTR achieves state-of-the-art performance on three standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment. Code has been made available at: https: //xumingze0308. github. io/projects/lstr.

PDF Details

NeurIPS Conference 2021 Conference Paper

NeRV: Neural Representations for Videos

Hao Chen
Bo He
Hanyu Wang
Yixuan Ren
Ser Nam Lim
Abhinav Shrivastava

We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. Given a frame index, NeRV outputs the corresponding RGB image. Video encoding in NeRV is simply fitting a neural network to video frames and decoding process is a simple feedforward operation. As an image-wise implicit representation, NeRV output the whole image and shows great efficiency compared to pixel-wise implicit representation, improving the encoding speed by $\textbf{25}\times$ to $\textbf{70}\times$, the decoding speed by $\textbf{38}\times$ to $\textbf{132}\times$, while achieving better video quality. With such a representation, we can treat videos as neural networks, simplifying several video-related tasks. For example, conventional video compression methods are restricted by a long and complex pipeline, specifically designed for the task. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H. 264, HEVC \etc). Besides compression, we demonstrate the generalization of NeRV for video denoising. The source code and pre-trained model can be found at https: //github. com/haochen-rye/NeRV. git.

PDF Details

NeurIPS Conference 2020 Conference Paper

Backpropagating Linearly Improves Transferability of Adversarial Examples

Yiwen Guo
Qizhang Li
Hao Chen

The vulnerability of deep neural networks (DNNs) to adversarial examples has drawn great attention from the community. In this paper, we study the transferability of such examples, which lays the foundation of many black-box attacks on DNNs. We revisit a not so new but definitely noteworthy hypothesis of Goodfellow et al. 's and disclose that the transferability can be enhanced by improving the linearity of DNNs in an appropriate manner. We introduce linear backpropagation (LinBP), a method that performs backpropagation in a more linear fashion using off-the-shelf attacks that exploit gradients. More specifically, it calculates forward as normal but backpropagates loss as if some nonlinear activations are not encountered in the forward pass. Experimental results demonstrate that this simple yet effective method obviously outperforms current state-of-the-arts in crafting transferable adversarial examples on CIFAR-10 and ImageNet, leading to more effective attacks on a variety of DNNs. Code at: https: //github. com/qizhangli/linbp-attack.

PDF Details

NeurIPS Conference 2020 Conference Paper

Practical No-box Adversarial Attacks against DNNs

Qizhang Li
Yiwen Guo
Hao Chen

The study of adversarial vulnerabilities of deep neural networks (DNNs) has progressed rapidly. Existing attacks require either internal access (to the architecture, parameters, or training set of the victim model) or external access (to query the model). However, both the access may be infeasible or expensive in many scenarios. We investigate no-box adversarial examples, where the attacker can neither access the model information or the training set nor query the model. Instead, the attacker can only gather a small number of examples from the same problem domain as that of the victim model. Such a stronger threat model greatly expands the applicability of adversarial attacks. We propose three mechanisms for training with a very small dataset (on the order of tens of examples) and find that prototypical reconstruction is the most effective. Our experiments show that adversarial examples crafted on prototypical auto-encoding models transfer well to a variety of image classification and face verification models. On a commercial celebrity recognition system held by clarifai. com, our approach significantly diminishes the average prediction accuracy of the system to only 15. 40%, which is on par with the attack that transfers adversarial examples from a pre-trained Arcface model. Our code is publicly available at: https: //github. com/qizhangli/nobox-attacks.

PDF Details

NeurIPS Conference 2020 Conference Paper

Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes

Hao Chen
Lili Zheng
Raed Al Kontar
Garvesh Raskutti

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ up to a statistical error term depending on the minibatch size. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.

PDF Details

JBHI Journal 2020 Journal Article

UD-MIL: Uncertainty-Driven Deep Multiple Instance Learning for OCT Image Classification

Xi Wang
Fangyao Tang
Hao Chen
Luyang Luo
Ziqi Tang
An-Ran Ran
Carol Y. Cheung
Pheng-Ann Heng

Deep learning has achieved remarkable success in the optical coherence tomography (OCT) image classification task with substantial labelled B-scan images available. However, obtaining such fine-grained expert annotations is usually quite difficult and expensive. How to leverage the volume-level labels to develop a robust classifier is very appealing. In this paper, we propose a weakly supervised deep learning framework with uncertainty estimation to address the macula-related disease classification problem from OCT images with the only volume-level label being available. First, a convolutional neural network (CNN) based instance-level classifier is iteratively refined by using the proposed uncertainty-driven deep multiple instance learning scheme. To our best knowledge, we are the first to incorporate the uncertainty evaluation mechanism into multiple instance learning (MIL) for training a robust instance classifier. The classifier is able to detect suspicious abnormal instances and abstract the corresponding deep embedding with high representation capability simultaneously. Second, a recurrent neural network (RNN) takes instance features from the same bag as input and generates the final bag-level prediction by considering the individually local instance information and globally aggregated bag-level representation. For more comprehensive validation, we built two large diabetic macular edema (DME) OCT datasets from different devices and imaging protocols to evaluate the efficacy of our method, which are composed of 30, 151 B-scans in 1, 396 volumes from 274 patients (Heidelberg-DME dataset) and 38, 976 B-scans in 3, 248 volumes from 490 patients (Triton-DME dataset), respectively. We compare the proposed method with the state-of-the-art approaches, and experimentally demonstrate that our method is superior to alternative methods, achieving volume-level accuracy, F1-score and area under the receiver operating characteristic curve (AUC) of 95. 1%, 0. 939 and 0. 990 on Heidelberg-DME and those of 95. 1%, 0. 935 and 0. 986 on Triton-DME, respectively. Furthermore, the proposed method also yields competitive results on another public age-related macular degeneration OCT dataset, indicating the high potential as an effective screening tool in the clinical practice.

Details DOI

IJCAI Conference 2019 Conference Paper

DeltaDou: Expert-level Doudizhu AI through Self-play

Qiqi Jiang
Kuangzheng Li
Boyao Du
Hao Chen
Hai Fang

Artificial Intelligence has seen several breakthroughs in two-player perfect information game. Nevertheless, Doudizhu, a three-player imperfect information game, is still quite challenging. In this paper, we present a Doudizhu AI by applying deep reinforcement learning from games of self-play. The algorithm combines an asymmetric MCTS on nodes of information set of each player, a policy-value network that approximates the policy and value on each decision node, and inference on unobserved hands of other players by given policy. Our results show that self-play can significantly improve the performance of our agent in this multi-agent imperfect information game. Even starting with a weak AI, our agent can achieve human expert level after days of self-play and training.

PDF Details

IJCAI Conference 2019 Conference Paper

Light-Weight Hybrid Convolutional Network for Liver Tumor Segmentation

Jianpeng Zhang
Yutong Xie
Pingping Zhang
Hao Chen
Yong Xia
Chunhua Shen

Automated segmentation of liver tumors in contrast-enhanced abdominal computed tomography (CT) scans is essential in assisting medical professionals to evaluate tumor development and make fast therapeutic schedule. Although deep convolutional neural networks (DCNNs) have contributed many breakthroughs in image segmentation, this task remains challenging, since 2D DCNNs are incapable of exploring the inter-slice information and 3D DCNNs are too complex to be trained with the available small dataset. In this paper, we propose the light-weight hybrid convolutional network (LW-HCN) to segment the liver and its tumors in CT volumes. Instead of combining a 2D and a 3D networks for coarse-to-fine segmentation, LW-HCN has a encoder-decoder structure, in which 2D convolutions used at the bottom of the encoder decreases the complexity and 3D convolutions used in other layers explore both spatial and temporal information. To further reduce the complexity, we design the depthwise and spatiotemporal separate (DSTS) factorization for 3D convolutions, which not only reduces parameters dramatically but also improves the performance. We evaluated the proposed LW-HCN model against several recent methods on the LiTS and 3D-IRCADb datasets and achieved, respectively, the Dice per case of 73. 0% and 94. 1% for tumor segmentation, setting a new state of the art.

PDF Details

AAAI Conference 2019 Conference Paper

Synergistic Image and Feature Adaptation: Towards Cross-Modality Domain Adaptation for Medical Image Segmentation

Cheng Chen
Qi Dou
Hao Chen
Jing Qin
Pheng-Ann Heng

This paper presents a novel unsupervised domain adaptation framework, called Synergistic Image and Feature Adaptation (SIFA), to effectively tackle the problem of domain shift. Domain adaptation has become an important and hot topic in recent studies on deep learning, aiming to recover performance degradation when applying the neural networks to new testing domains. Our proposed SIFA is an elegant learning diagram which presents synergistic fusion of adaptations from both image and feature perspectives. In particular, we simultaneously transform the appearance of images across domains and enhance domain-invariance of the extracted features towards the segmentation task. The feature encoder layers are shared by both perspectives to grasp their mutual benefits during the end-to-end learning procedure. Without using any annotation from the target domain, the learning of our unified model is guided by adversarial losses, with multiple discriminators employed from various aspects. We have extensively validated our method with a challenging application of crossmodality medical image segmentation of cardiac structures. Experimental results demonstrate that our SIFA model recovers the degraded performance from 17. 2% to 73. 0%, and outperforms the state-of-the-art methods by a significant margin.

PDF Details

IJCAI Conference 2019 Conference Paper

Theoretical Investigation of Generalization Bound for Residual Networks

Hao Chen
Zhanfeng Mo
Zhouwang Yang
Xiao Wang

This paper presents a framework for norm-based capacity control with respect to an lp, q-norm in weight-normalized Residual Neural Networks (ResNets). We first formulate the representation of each residual block. For the regression problem, we analyze the Rademacher Complexity of the ResNets family. We also establish a tighter generalization upper bound for weight-normalized ResNets. in a more general sight. Using the lp, q-norm weight normalization in which 1/p+1/q >=1, we discuss the properties of a width-independent capacity control, which only relies on the depth according to a square root term. Several comparisons suggest that our result is tighter than previous work. Parallel results for Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN) are included by introducing the lp, q-norm weight normalization for DNN and the lp, q-norm kernel normalization for CNN. Numerical experiments also verify that ResNet structures contribute to better generalization properties.

PDF Details

IROS Conference 2019 Conference Paper

Towards More Realistic Human-Robot Conversation: A Seq2Seq-based Body Gesture Interaction System

Minjie Hua
Fuyuan Shi
Yibing Nan
Kai Wang 0012
Hao Chen
Shiguo Lian

This paper presents a novel system that enables intelligent robots to exhibit realistic body gestures while communicating with humans. The proposed system consists of a listening model and a speaking model used in corresponding conversational phases. Both models are adapted from the sequence-to-sequence (seq2seq) architecture to synthesize body gestures represented by the movements of twelve upper-body keypoints. All the extracted 2D keypoints are firstly 3D-transformed, then rotated and normalized to discard irrelevant information. Substantial videos of human conversations from Youtube are collected and preprocessed to train the listening and speaking models separately, after which the two models are evaluated using metrics of mean squared error (MSE) and cosine similarity on the test dataset. The tuned system is implemented to drive a virtual avatar as well as Pepper, a physical humanoid robot, to demonstrate the improvement on conversational interaction abilities of our method in practice.

Details

AAAI Conference 2018 Conference Paper

LSTD: A Low-Shot Transfer Detector for Object Detection

Hao Chen
Yali Wang
Guoyou Wang
Yu Qiao

Recent advances in object detection are mainly driven by deep learning with large-scale detection benchmarks. However, the fully-annotated training set is often limited for a target detection task, which may deteriorate the performance of deep detectors. To address this challenge, we propose a novel low-shot transfer detector (LSTD) in this paper, where we leverage rich source-domain knowledge to construct an effective target-domain detector with very few training examples. The main contributions are described as follows. First, we design a ﬂexible deep architecture of LSTD to alleviate transfer difﬁculties in low-shot detection. This architecture can integrate the advantages of both SSD and Faster RCNN in a uniﬁed deep framework. Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance ﬁne-tuning with a few target images. Finally, we examine our LSTD on a number of challenging low-shot detection experiments, where LSTD outperforms other state-of-the-art approaches. The results demonstrate that LSTD is a preferable deep detector for low-shot scenarios.

PDF Details

AAAI Conference 2018 Conference Paper

SFCN-OPI: Detection and Fine-Grained Classification of Nuclei Using Sibling FCN With Objectness Prior Interaction

Yanning Zhou
Qi Dou
Hao Chen
Jing Qin
Pheng-Ann Heng

Cell nuclei detection and ﬁne-grained classiﬁcation have been fundamental yet challenging problems in histopathology image analysis. Due to the nuclei tiny size, signiﬁcant inter- /intra-class variances, as well as the inferior image quality, previous automated methods would easily suffer from limited accuracy and robustness. In the meanwhile, existing approaches usually deal with these two tasks independently, which would neglect the close relatedness of them. In this paper, we present a novel method of sibling fully convolutional network with prior objectness interaction (called SFCN-OPI) to tackle the two tasks simultaneously and interactively using a uniﬁed end-to-end framework. Speciﬁcally, the sibling FCN branches share features in earlier layers while holding respective higher layers for speciﬁc tasks. More importantly, the detection branch outputs the objectness prior which dynamically interacts with the ﬁne-grained classiﬁcation sibling branch during the training and testing processes. With this mechanism, the ﬁne-grained classiﬁcation successfully focuses on regions with high conﬁdence of nuclei existence and outputs the conditional probability, which in turn bene- ﬁts the detection through back propagation. Extensive experiments on colon cancer histology images have validated the effectiveness of our proposed SFCN-OPI and our method has outperformed the state-of-the-art methods by a large margin.

PDF Details

IJCAI Conference 2018 Conference Paper

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Qi Dou
Cheng Ouyang
Cheng Chen
Hao Chen
Pheng-Ann Heng

Convolutional networks (ConvNets) have achieved great successes in various challenging vision tasks. However, the performance of ConvNets would degrade when encountering the domain shift. The domain adaptation is more significant while challenging in the field of biomedical image analysis, where cross-modality data have largely different distributions. Given that annotating the medical data is especially expensive, the supervised transfer learning approaches are not quite optimal. In this paper, we propose an unsupervised domain adaptation framework with adversarial learning for cross-modality biomedical image segmentations. Specifically, our model is based on a dilated fully convolutional network for pixel-wise prediction. Moreover, we build a plug-and-play domain adaptation module (DAM) to map the target input to features which are aligned with source domain feature space. A domain critic module (DCM) is set up for discriminating the feature space of both domains. We optimize the DAM and DCM via an adversarial loss without using any target domain label. Our proposed method is validated by adapting a ConvNet trained with MRI images to unpaired CT data for cardiac structures segmentations, and achieved very promising results.

PDF Details

YNIMG Journal 2018 Journal Article

VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images

Hao Chen
Qi Dou
Lequan Yu
Jing Qin
Pheng-Ann Heng

Details DOI

JBHI Journal 2017 Journal Article

Integrating Online and Offline Three-Dimensional Deep Learning for Automated Polyp Detection in Colonoscopy Videos

Lequan Yu
Hao Chen
Qi Dou
Jing Qin
Pheng Ann Heng

Automated polyp detection in colonoscopy videos has been demonstrated to be a promising way for colorectal cancer prevention and diagnosis. Traditional manual screening is time consuming, operator dependent, and error prone; hence, automated detection approach is highly demanded in clinical practice. However, automated polyp detection is very challenging due to high intraclass variations in polyp size, color, shape, and texture, and low interclass variations between polyps and hard mimics. In this paper, we propose a novel offline and online three-dimensional (3-D) deep learning integration framework by leveraging the 3-D fully convolutional network (3D-FCN) to tackle this challenging problem. Compared with the previous methods employing hand-crafted features or 2-D convolutional neural network, the 3D-FCN is capable of learning more representative spatio-temporal features from colonoscopy videos, and hence has more powerful discrimination capability. More importantly, we propose a novel online learning scheme to deal with the problem of limited training data by harnessing the specific information of an input video in the learning process. We integrate offline and online learning to effectively reduce the number of false positives generated by the offline network and further improve the detection performance. Extensive experiments on the dataset of MICCAI 2015 Challenge on Polyp Detection demonstrated the better performance of our method when compared with other competitors.

Details DOI

AAAI Conference 2017 Conference Paper

Volumetric ConvNets with Mixed Residual Connections for Automated Prostate Segmentation from 3D MR Images

Lequan Yu
Xin Yang
Hao Chen
Jing Qin
Pheng Ann Heng

Automated prostate segmentation from 3D MR images is very challenging due to large variations of prostate shape and indistinct prostate boundaries. We propose a novel volumetric convolutional neural network (ConvNet) with mixed residual connections to cope with this challenging problem. Compared with previous methods, our volumetric ConvNet has two compelling advantages. First, it is implemented in a 3D manner and can fully exploit the 3D spatial contextual information of input data to perform efﬁcient, precise and volumeto-volume prediction. Second and more important, the novel combination of residual connections (i. e. , long and short) can greatly improve the training efﬁciency and discriminative capability of our network by enhancing the information propagation within the ConvNet both locally and globally. While the forward propagation of location information can improve the segmentation accuracy, the smooth backward propagation of gradient ﬂow can accelerate the convergence speed and enhance the discrimination capability. Extensive experiments on the open MICCAI PROMISE12 challenge dataset corroborated the effectiveness of the proposed volumetric ConvNet with mixed residual connections. Our method ranked the ﬁrst in the challenge, outperforming other competitors by a large margin with respect to most of evaluation metrics. The proposed volumetric ConvNet is general enough and can be easily extended to other medical image analysis tasks, especially ones with limited training data.

PDF Details

AAAI Conference 2016 Conference Paper

Deep Contextual Networks for Neuronal Structure Segmentation

Hao Chen
Xiao Qi
Jie Cheng
Pheng Heng

The goal of connectomics is to manifest the interconnections of neural system with the Electron Microscopy (EM) images. However, the formidable size of EM image data renders human annotation impractical, as it may take decades to fulﬁll the whole job. An alternative way to reconstruct the connectome can be attained with the computerized scheme that can automatically segment the neuronal structures. The segmentation of EM images is very challenging as the depicted structures can be very diverse. To address this difﬁcult problem, a deep contextual network is proposed here by leveraging multi-level contextual information from the deep hierarchical structure to achieve better segmentation performance. To further improve the robustness against the vanishing gradients and strengthen the capability of the back-propagation of gradient ﬂow, auxiliary classiﬁers are incorporated in the architecture of our deep neural network. It will be shown that our method can effectively parse the semantic meaning from the images with the underlying neural network and accurately delineate the structural boundaries with the reference of low-level contextual cues. Experimental results on the benchmark dataset of 2012 ISBI segmentation challenge of neuronal structures suggest that the proposed method can outperform the state-of-the-art methods by a large margin with respect to different evaluation measurements. Our method can potentially facilitate the automatic connectome analysis from EM images with less human intervention effort.

PDF Details

AAAI Conference 2016 Conference Paper

Mitosis Detection in Breast Cancer Histology Images via Deep Cascaded Networks

Hao Chen
Qi Dou
Xi Wang
Jing Qin
Pheng Heng

The number of mitoses per tissue area gives an important aggressiveness indication of the invasive breast carcinoma. However, automatic mitosis detection in histology images remains a challenging problem. Traditional methods either employ hand-crafted features to discriminate mitoses from other cells or construct a pixel-wise classiﬁer to label every pixel in a sliding window way. While the former suffers from the large shape variation of mitoses and the existence of many mimics with similar appearance, the slow speed of the later prohibits its use in clinical practice. In order to overcome these shortcomings, we propose a fast and accurate method to detect mitosis by designing a novel deep cascaded convolutional neural network, which is composed of two components. First, by leveraging the fully convolutional neural network, we propose a coarse retrieval model to identify and locate the candidates of mitosis while preserving a high sensitivity. Based on these candidates, a ﬁne discrimination model utilizing knowledge transferred from cross-domain is developed to further single out mitoses from hard mimics. Our approach outperformed other methods by a large margin in 2014 ICPR MITOS-ATYPIA challenge in terms of detection accuracy. When compared with the state-of-the-art methods on the 2012 ICPR MITOSIS data (a smaller and less challenging dataset), our method achieved comparable or better results with a roughly 60 times faster speed.

PDF Details

JBHI Journal 2015 Journal Article

Standard Plane Localization in Fetal Ultrasound via Domain Transferred Deep Neural Networks

Hao Chen
Dong Ni
Jing Qin
Shengli Li
Xin Yang
Tianfu Wang
Pheng Ann Heng

Automatic localization of the standard plane containing complicated anatomical structures in ultrasound (US) videos remains a challenging problem. In this paper, we present a learning-based approach to locate the fetal abdominal standard plane (FASP) in US videos by constructing a domain transferred deep convolutional neural network (CNN). Compared with previous works based on low-level features, our approach is able to represent the complicated appearance of the FASP and hence achieve better classification performance. More importantly, in order to reduce the overfitting problem caused by the small amount of training samples, we propose a transfer learning strategy, which transfers the knowledge in the low layers of a base CNN trained from a large database of natural images to our task-specific CNN. Extensive experiments demonstrate that our approach outperforms the state-of-the-art method for the FASP localization as well as the CNN only trained on the limited US training samples. The proposed approach can be easily extended to other similar medical image computing problems, which often suffer from the insufficient training samples when exploiting the deep CNN to represent high-level features.

Details DOI

ICRA Conference 2010 Conference Paper

Design and analysis of a soft mobile robot composed of multiple thermally activated joints driven by a single actuator

Nadia Cheng
Genya Ishigami
Stephan Hawthorne
Hao Chen
Malik Hansen
Maria J. Telleria
Robert Playter
Karl Iagnemma

Soft robotic systems have applications in industrial, medical, and security applications. Many applications require these robots to be small and lightweight. One challenge in developing a soft robotic system is to drive multiple degrees-of-freedom (DOF) with few actuators, thereby reducing system size and weight. This paper presents the analysis and design of an inchworm-like mobile robot that consists of multiple, independent thermally activated joints but is driven by a single actuator. To realize control of this under-actuated system, a solder-based locking mechanism has been developed to selectively activate individual joints without requiring additional actuators. The design and performance analysis of a prototype mobile robot that is capable of inchworm-like translational and steering motion is described. The design of novel “feet” with anisotropic friction properties is also described.

Details

ICRA Conference 2001 Conference Paper

The Switched Reluctance Motor Drive for the Direct-Drive Joint of the Robot

Hao Chen
Dong Zhang

The paper presents the principle of decoupling control of the phase voltage in the switched reluctance motor drive for the direct-drive joint of the robot. The motor drive system elements, such as the structure of the three-phase 6/10 structure switched reluctance motor and the rotor position, the main circuit topology of the three-phase bifilar winding power converter and the pulse width modulation control strategy, are described. The mathematical models of the main circuit of the power converter are also presented. The optimum range of the turn-on and turn-off angles of the main switches in the power converter are given by the criterion of reducing the pulsation of the output torque with a 2D finite element electromagnetic field calculation of the motor and the nonlinear simulation of the main circuit of the power converter with the control strategy.

Details