Arrow Research search

Author name cluster

Zhe Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
2 author rows

Possible papers

40

AAAI Conference 2026 Conference Paper

Gentle Manipulation Policy Learning via Demonstrations from VLM Planned Atomic Skills

  • Jiayu Zhou
  • Qiwei Wu
  • Jian Li
  • Zhe Chen
  • Xiaogang Xiong
  • Renjing Xu

Autonomous execution of long-horizon, contact-rich manipulation tasks traditionally requires extensive real-world data and expert engineering, posing significant cost and scalability challenges. This paper proposes a novel framework integrating hierarchical semantic decomposition, reinforcement learning (RL), visual language models (VLMs), and knowledge distillation to overcome these limitations. Complex tasks are decomposed into atomic skills, with RL-trained policies for each primitive exclusively in simulation. Crucially, our RL formulation incorporates explicit force constraints to prevent object damage during delicate interactions. VLMs perform high-level task decomposition and skill planning, generating diverse expert demonstrations. These are distilled into a unified policy via Visual-Tactile Diffusion Policy for end-to-end execution. We conduct comprehensive ablation studies exploring different VLM-based task planners to identify optimal demonstration generation pipelines, and systematically compare imitation learning algorithms for skill distillation. Extensive simulation experiments and physical deployment validate that our approach achieves policy learning for long-horizon manipulation without costly human demonstrations, while the VLM-guided atomic skill framework enables scalable generalization to diverse tasks.

AAAI Conference 2026 Conference Paper

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

  • Tianbin Li
  • Yanzhou Su
  • Wei Li
  • Bin Fu
  • Zhe Chen
  • Ziyan Huang
  • Guoan Wang
  • Chenglong Ma

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

AAAI Conference 2026 Conference Paper

MedS³: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

  • Shuyang Jiang
  • Yusheng Liao
  • Zhe Chen
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that MedS3 outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that MedS3 achieves robust and faithful reasoning behavior.

AAAI Conference 2026 Conference Paper

Symbolic Planning and Multi-Agent Path Finding in Extremely Dense Environments with Unassigned Agents

  • Bo Fu
  • Zhe Chen
  • Rahul Chandan
  • Alexandre Ormiga Galvao Barbosa
  • Michael Caldara
  • Joey W. Durham
  • Federico Pecora

We introduce the Block Rearrangement Problem (BRaP), a challenging component of large warehouse management which involves rearranging storage blocks within dense grids to achieve a goal state. We formally define the BRaP as a graph search problem. Building on intuitions from sliding puzzle problems, we propose five search-based solution algorithms, leveraging joint configuration space search, classical planning, multi-agent pathfinding, and expert heuristics. We evaluate the five approaches empirically for plan quality and scalability. Despite the exponential relation between search space size and block number, our methods demonstrate efficiency in creating rearrangement plans for deeply buried blocks in up to 80x80 grids.

AAAI Conference 2025 Conference Paper

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

  • Junxian Li
  • Di Zhang
  • Xunzhi Wang
  • Zeying Hao
  • Jingdi Lei
  • Qian Tan
  • Cai Zhou
  • Wei Liu

Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce ChemVLM, an open-source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks.

AAAI Conference 2025 Conference Paper

Concurrent Planning and Execution in Lifelong Multi-Agent Path Finding with Delay Probabilities

  • Yue Zhang
  • Zhe Chen
  • Daniel Harabor
  • Pierre Le Bodic
  • Peter J. Stuckey

In multi-agent systems, when we account for the possibility of delays during execution, online planning becomes more complicated, as both execution and planning should be able to handle delays when agents are moving. Lifelong Multi-Agent Path Finding (LMAPF) is the problem of (re)planning the collision-free moves of agents to their goals in a shared space, while agents continuously receive new goals. PIE (Planning and Improving while Executing) is a recent approach to LMAPF which concurrently replans later parts of agents' trajectories while execution occurs. However, the execution is assumed to be perfect. Existing approaches either use policy-based methods to quickly coordinate agents every timestep with instant delay feedback, or deploy an execution policy to adjust a solution for delays on the fly. These approaches may introduce large amounts of unnecessary delays to agents due to their planner guarantees or simple delay-handling policies. In this paper, we extend PIE to define a framework for solving the lifelong MAPF problem with execution delays. We instantiate our framework with different execution and replanning strategies, and experimentally evaluate them. Overall, we find that this framework can substantially improve the throughput by up to a factor 3 for lifelong MAPF, compared to approaches that handle delays with simple execution policies.

AAAI Conference 2025 Conference Paper

Incomplete Modality Disentangled Representation for Ophthalmic Disease Grading and Diagnosis

  • Chengzhi Liu
  • Zile Huang
  • Zhe Chen
  • Feilong Tang
  • Yu Tian
  • Zhongxing Xu
  • Zihong Luo
  • Yalin Zheng

Ophthalmologists typically require multimodal data sources to improve diagnostic accuracy in clinical decisions. However, due to medical device shortages, low-quality data and data privacy concerns, missing data modalities are common in real-world scenarios. Existing deep learning methods tend to address it by learning an implicit latent subspace representation for different modality combinations. We identify two significant limitations of these methods: (1) implicit representation constraints that hinder the model's ability to capture modality-specific information and (2) modality heterogeneity, causing distribution gaps and redundancy in feature representations. To address these, we propose an Incomplete Modality Disentangled Representation (IMDR) strategy, which disentangles features into explicit independent modal-common and modal-specific features by guidance of mutual information, distilling informative knowledge and enabling it to reconstruct valuable missing semantics and produce robust multimodal representations. Furthermore, we introduce a joint proxy learning module that assists IMDR in eliminating intra-modality redundancy by exploiting the extracted proxies from each class. Experiments on four ophthalmology multimodal datasets demonstrate that the proposed IMDR outperforms the state-of-the-art methods significantly.

NeurIPS Conference 2025 Conference Paper

On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation

  • Liyao Tang
  • Zhe Chen
  • Dacheng Tao

The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1. 6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https: //github. com/LiyaoTang/GEM.

AAAI Conference 2025 Conference Paper

Online Guidance Graph Optimization for Lifelong Multi-Agent Path Finding

  • Hongzhi Zang
  • Yulun Zhang
  • He Jiang
  • Zhe Chen
  • Daniel Harabor
  • Peter J. Stuckey
  • Jiaoyang Li

We study the problem of optimizing a guidance policy capable of dynamically guiding the agents for lifelong Multi-Agent Path Finding based on real-time traffic patterns. Multi-Agent Path Finding (MAPF) focuses on moving multiple agents from their starts to goals without collisions. Its lifelong variant, LMAPF, continuously assigns new goals to agents. In this work, we focus on improving the solution quality of PIBT, a state-of-the-art rule-based LMAPF algorithm, by optimizing a policy to generate adaptive guidance. We design two pipelines to incorporate guidance in PIBT in two different ways. We demonstrate the superiority of the optimized policy over both static guidance and human-designed policies. Additionally, we explore scenarios where task distribution changes over time, a challenging yet common situation in real-world applications that is rarely explored in the literature.

JAIR Journal 2025 Journal Article

Prioritised Planning: Completeness, Optimality, and Complexity

  • Jonathan Morag
  • Yue Zhang
  • Daniel Koyfman
  • Zhe Chen
  • Ariel Felner
  • Daniel Harabor
  • Roni Stern

Prioritised Planning (PP) is a popular approach for multi-agent and multi-robot navigation. In PP, collision-free paths are computed for one agent at a time, following a total order over the agents, called a priority ordering. Many MAPF algorithms follow this approach or use it in some way, including several state-of-the-art MAPF algorithms, although it is known that PP is neither complete nor optimal. In this work, we characterise the space of problems a PP algorithm can solve, and define the search problem of identifying whether a given MAPF problem is in that space. We call this search problem Prioritised MAPF (P-MAPF) and investigate its computational complexity, showing that it is generally NP-hard. Then, we develop a novel efficient search algorithm called Path and Priority Search (PaPS), which solves P-MAPF, providing guarantees of completeness and optimality. We next observe that PP algorithms operate with two primary degrees of freedom – the choice of priority ordering, and the choice of individual paths for agents. Accordingly, we further divide P-MAPF into two planning problems corresponding to the two degrees of freedom. We call them Priority-Function Constrained MAPF (PFC-MAPF), where the path choice is fixed while the priority ordering is not, and Priority Constrained MAPF (PC-MAPF), where the priority ordering is fixed while the path choice is not. We analyse these problems as well, and show how PaPS can be easily adapted to create algorithms that solve these problems optimally. We experiment with our algorithms in a range of settings, including comparisons with existing PP baselines. Our results show how the different degrees of freedom of PP-based algorithms affect their behaviour, and provide the first-known results for solution-quality optimality for PP-based algorithms on a popular MAPF benchmark set. The latter can be used as a lower bound for any PP algorithm.

NeurIPS Conference 2025 Conference Paper

RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis

  • Haolin Li
  • Tianjie Dai
  • Zhe Chen
  • Siyuan Du
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose R etrieval- A ugmented D iagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https: //github. com/tdlhl/RAD.

AAAI Conference 2025 Conference Paper

ReactGPT: Understanding of Chemical Reactions via In-Context Tuning

  • Zhe Chen
  • Zhe Fang
  • Wenhao Tian
  • Zhaoguang Long
  • Changzhi Sun
  • Yuefeng Chen
  • Hao Yuan
  • Honglin Li

The interdisciplinary field of chemistry and artificial intelligence (AI) is an active area of research aimed at accelerating scientific discovery. Large language Models (LLMs) have shown significant promise in biochemical tasks, especially the molecule caption translation, which aims to align between molecules and natural language texts. However, existing works mainly focus on single molecules, while alignment between chemical reactions and natural language text remains largely unexplored. Additionally, the description of reactions is an essential part in biochemical patents and literature, and research on this aspect not only can help better understand chemical reactions but also promote research on automating chemical synthesis and retrosynthesis. In this work, we propose \textbf{ReactGPT}, a framework aiming to bridge the gap between chemical reaction and text. ReactGPT allows a new task: reaction captioning, by adapting LLMs to learn reaction-text alignment from context examples via In-Context Tuning. Specifically, ReactGPT jointly leverages a Fingerprints-based Reaction Retrieval module, a Domain-Specific Prompt Design module, and a two-stage In-Context Tuning module. We evaluate the effectiveness of ReactGPT on reaction captioning and experimental procedure prediction, both of these tasks can reflect the understanding of chemical reactions. Experimental results show that compared to previous models, ReactGPT exhibits competitive capabilities in resolving chemical reactions and generating high-quality text with correct structure.

ICRA Conference 2025 Conference Paper

RL-OGM-Parking: Lidar OGM-Based Hybrid Reinforcement Learning Planner for Autonomous Parking

  • Zhitao Wang
  • Zhe Chen
  • Mingyang Jiang
  • Tong Qin 0001
  • Ming Yang 0002

Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performance in various scenarios. Therefore, a hybrid approach is necessary that combines the stability of rule-based methods and the generalizability of learning-based methods. Recently, reinforcement learning (RL) based policy has shown robust capability in planning tasks. However, the simulation-to-reality (sim-to-real) transfer gap seriously blocks the real-world deployment. To address these problems, we employ a hybrid policy, consisting of a rule-based Reeds-Shepp (RS) planner and a learningbased reinforcement learning (RL) planner. A real-time LiDARbased Occupancy Grid Map (OGM) representation is adopted to bridge the sim-to-real gap, leading the hybrid policy can be applied to real-world systems seamlessly. We conducted extensive experiments both in the simulation environment and real-world scenarios, and the result demonstrates that the proposed method outperforms pure rule-based and learningbased methods. The real-world experiment further validates the feasibility and efficiency of the proposed method.

IJCAI Conference 2025 Conference Paper

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

  • Zunhai Su
  • Hanyu Wei
  • Zhe Chen
  • Wang Shen
  • Linge Li
  • Huangqi Yu
  • Kehong Yuan

Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT); (ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads; (iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks. RotateKV achieves less than 0. 3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1. 7% degradation on GSM8K, outperforming existing methods even at lower average bit-widths. RotateKV also showcases a 3. 97× reduction in peak memory usage, supports 5. 75× larger batch sizes, and achieves a 2. 32× speedup in decoding stage.

AAAI Conference 2025 Conference Paper

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

  • Zhongxing Xu
  • Feilong Tang
  • Zhe Chen
  • Yingxue Su
  • Zhiyi Zhao
  • Ge Zhang
  • Jionglong Su
  • Zongyuan Ge

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research demonstrates powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, specifically by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, to capture high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.

AAMAS Conference 2024 Conference Paper

Anytime Multi-Agent Path Finding using Operation Parallelism in Large Neighborhood Search

  • Shao-Hung Chan
  • Zhe Chen
  • Dian-Lun Lin
  • Yue Zhang
  • Daniel Harabor
  • Sven Koenig
  • Tsung-Wei Huang
  • Thomy Phan

Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths for multiple agents in a shared environment while improving the solution quality. The state-of-the-art anytime MAPF algorithm is based on Large Neighborhood Search (MAPF- LNS), which is a combinatorial search algorithm that iteratively destroys and repairs a subset of collision-free paths. In this paper, we propose Destroy-Repair Operation Parallelism for MAPF-LNS (DROP-LNS), a parallel framework that performs multiple destroy and repair operations concurrently to explore more regions of the search space and improve the solution quality. Unlike MAPF-LNS, DROP-LNS is able to exploit multiple threads during the search. The results show that DROP-LNS outperforms the state-of-the-art anytime MAPF algorithms, namely MAPF-LNS and LaCAM*, with respect to solution quality when terminated at the same runtime.

AAAI Conference 2024 Conference Paper

AVSegFormer: Audio-Visual Segmentation with Transformer

  • Shengyi Gao
  • Zhe Chen
  • Guo Chen
  • Wenhai Wang
  • Tong Lu

Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a given video, which demands audio-driven pixel-level scene understanding. The existing methods cannot fully process the fine-grained correlations between audio and visual cues across various situations dynamically. They also face challenges in adapting to complex scenarios, such as evolving audio, the coexistence of multiple objects, and more. In this paper, we propose AVSegFormer, a novel framework for AVS that leverages the transformer architecture. Specifically, It comprises a dense audio-visual mixer, which can dynamically adjust interested visual features, and a sparse audio-visual decoder, which implicitly separates audio sources and automatically matches optimal visual features. Combining both components provides a more robust bidirectional conditional multi-modal representation, improving the segmentation performance in different scenarios. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

ICLR Conference 2024 Conference Paper

GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

  • Kai Chen 0023
  • Enze Xie
  • Zhe Chen
  • Yibo Wang 0039
  • Lanqing Hong
  • Zhenguo Li
  • Dit-Yan Yeung

Diffusion models have attracted significant attention due to the remarkable ability to create content and generate data for tasks like image classification. However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are essential. Previous studies have utilized either copy-paste synthesis or layout-to-image (L2I) generation with specifically designed modules to encode the semantic layouts. In this paper, we propose the GeoDiffusion, a simple framework that can flexibly translate various geometric conditions into text prompts and empower pre-trained text-to-image (T2I) diffusion models for high-quality detection data generation. Unlike previous L2I methods, our GeoDiffusion is able to encode not only the bounding boxes but also extra geometric conditions such as camera views in self-driving scenes. Extensive experiments demonstrate GeoDiffusion outperforms previous L2I methods while maintaining 4x training time faster. To the best of our knowledge, this is the first work to adopt diffusion models for layout-to-image generation with geometric conditions and demonstrate that L2I-generated images can be beneficial for improving the performance of object detectors.

NeurIPS Conference 2024 Conference Paper

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

  • Xiaoyi Dong
  • Pan Zhang
  • Yuhang Zang
  • Yuhang Cao
  • Bin Wang
  • Linke Ouyang
  • Songyang Zhang
  • Haodong Duan

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 $\times$ 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 × 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 $\times$ 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks.

NeurIPS Conference 2024 Conference Paper

Needle In A Multimodal Haystack

  • Weiyun Wang
  • Shuibo Zhang
  • Yiming Ren
  • Yuchen Duan
  • Tiantong Li
  • Shuo Liu
  • Mengkang Hu
  • Zhe Chen

With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs. Code and benchmark are released at https: //github. com/OpenGVLab/MM-NIAH.

IROS Conference 2024 Conference Paper

ParkingE2E: Camera-based End-to-end Parking Network, from Images to Planning

  • Changze Li
  • Ziheng Ji
  • Zhe Chen
  • Tong Qin 0001
  • Ming Yang 0002

Autonomous parking is a crucial task in the intelligent driving field. Traditional parking algorithms are usually implemented using rule-based schemes. However, these methods are less effective in complex parking scenarios due to the intricate design of the algorithms. In contrast, neural-network-based methods tend to be more intuitive and versatile than the rule-based methods. By collecting a large number of expert parking trajectory data and emulating human strategy via learning-based methods, the parking task can be effectively addressed. In this paper, we employ imitation learning to perform end-to-end planning from RGB images to path planning by imitating human driving trajectories. The proposed end-to-end approach utilizes a target query encoder to fuse images and target features, and a transformer-based decoder to autoregressively predict future waypoints. We conduct extensive experiments in real-world scenarios, and the results demonstrate that the proposed method achieved an average parking success rate of 87. 8% across four different real-world garages. Real-vehicle experiments further validate the feasibility and effectiveness of the method proposed in this paper. The code can be found at: https://github.com/qintonguav/ParkingE2E.

AAAI Conference 2024 Conference Paper

SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection

  • Haimei Zhao
  • Qiming Zhang
  • Shanshan Zhao
  • Zhe Chen
  • Jing Zhang
  • Dacheng Tao

Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8% mAP and 4.1% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.

AAAI Conference 2024 Conference Paper

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

  • Xiao Wang
  • Wentao Wu
  • Chenglong Li
  • Zhicheng Zhao
  • Zhe Chen
  • Yukai Shi
  • Jin Tang

Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.

AAAI Conference 2024 Conference Paper

Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding

  • Zhe Chen
  • Daniel Harabor
  • Jiaoyang Li
  • Peter J. Stuckey

Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue, we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new destinations. Empirically, we report large improvements in solution quality for one-short MAPF and in overall throughput for lifelong MAPF.

NeurIPS Conference 2024 Conference Paper

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

  • Jiannan Wu
  • Muyan Zhong
  • Sen Xing
  • Zeqiang Lai
  • Zhaoyang Liu
  • Zhe Chen
  • Wenhai Wang
  • Xizhou Zhu

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed ``super link'', as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.

NeurIPS Conference 2023 Conference Paper

All Points Matter: Entropy-Regularized Distribution Alignment for Weakly-supervised 3D Segmentation

  • Liyao Tang
  • Zhe Chen
  • Shanshan Zhao
  • Chaoyue Wang
  • Dacheng Tao

Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning. Existing methods often rely on empirical label selection strategies, such as confidence thresholding, to generate beneficial pseudo-labels for model training. This approach may, however, hinder the comprehensive exploitation of unlabeled data points. We hypothesize that this selective usage arises from the noise in pseudo-labels generated on unlabeled data. The noise in pseudo-labels may result in significant discrepancies between pseudo-labels and model predictions, thus confusing and affecting the model training greatly. To address this issue, we propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for weakly supervised learning in 3D segmentation tasks, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, it reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation network and the 3D segmentation network simultaneously. Despite the simplicity, our method promisingly improves the performance. We validate the effectiveness through extensive experiments on various baselines and large-scale datasets. Results show that ERDA effectively enables the effective usage of all unlabeled data points for learning and achieves state-of-the-art performance under different settings. Remarkably, our method can outperform fully-supervised baselines using only 1\% of true annotations. Code and model will be made publicly available at https: //github. com/LiyaoTang/ERDA.

IJCAI Conference 2023 Conference Paper

Graph Propagation Transformer for Graph Representation Learning

  • Zhe Chen
  • Hao Tan
  • Tao Wang
  • Tianrun Shen
  • Tong Lu
  • Qiuying Peng
  • Cheng Cheng
  • Yue Qi

This paper presents a novel transformer architecture for graph representation learning. The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks. Specifically, we propose a new attention mechanism called Graph Propagation Attention (GPA). It explicitly passes the information among nodes and edges in three ways, i. e. node-to-node, node-to-edge, and edge-to-node, which is essential for learning graph-structured data. On this basis, we design an effective transformer architecture named Graph Propagation Transformer (GPTrans) to further help learn graph data. We verify the performance of GPTrans in a wide range of graph learning experiments on several benchmark datasets. These results show that our method outperforms many state-of-the-art transformer-based graph models with better performance. The code will be released at https: //github. com/czczup/GPTrans.

NeurIPS Conference 2023 Conference Paper

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

  • Wenhai Wang
  • Zhe Chen
  • Xiaokang Chen
  • Jiannan Wu
  • Xizhou Zhu
  • Gang Zeng
  • Ping Luo
  • Tong Lu

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The code shall be released.

AAAI Conference 2022 Conference Paper

MAPF-LNS2: Fast Repairing for Multi-Agent Path Finding via Large Neighborhood Search

  • Jiaoyang Li
  • Zhe Chen
  • Daniel Harabor
  • Peter J. Stuckey
  • Sven Koenig

Multi-Agent Path Finding (MAPF) is the problem of planning collision-free paths for multiple agents in a shared environment. In this paper, we propose a novel algorithm MAPF- LNS2 based on large neighborhood search for solving MAPF efficiently. Starting from a set of paths that contain collisions, MAPF-LNS2 repeatedly selects a subset of colliding agents and replans their paths to reduce the number of collisions until the paths become collision-free. We compare MAPF- LNS2 against a variety of state-of-the-art MAPF algorithms, including Prioritized Planning with random restarts, EECBS, and PPS, and show that MAPF-LNS2 runs significantly faster than them while still providing near-optimal solutions in most cases. MAPF-LNS2 solves 80% of the random-scenario instances with the largest number of agents from the MAPF benchmark suite with a runtime limit of just 5 minutes, which, to our knowledge, has not been achieved by any existing algorithms.

AAAI Conference 2022 Conference Paper

SASA: Semantics-Augmented Set Abstraction for Point-Based 3D Object Detection

  • Chen Chen
  • Zhe Chen
  • Jing Zhang
  • Dacheng Tao

Although point-based networks are demonstrated to be accurate for 3D point cloud modeling, they are still falling behind their voxel-based competitors in 3D detection. We observe that the prevailing set abstraction design for downsampling points may maintain too much unimportant background information that can affect feature learning for detecting objects. To tackle this issue, we propose a novel set abstraction method named Semantics-Augmented Set Abstraction (SASA). Technically, we first add a binary segmentation module as the side output to help identify foreground points. Based on the estimated point-wise foreground scores, we then propose a semantics-guided point sampling algorithm to help retain more important foreground points during downsampling. In practice, SASA shows to be effective in identifying valuable points related to foreground objects and improving feature learning for point-based 3D detection. Additionally, it is an easy-to-plug-in module and able to boost various point-based detectors, including single-stage and twostage ones. Extensive experiments on the popular KITTI and nuScenes datasets validate the superiority of SASA, lifting point-based detection models to reach comparable performance to state-of-the-art voxel-based methods. Code is available at https: //github. com/blakechen97/SASA.

AAAI Conference 2022 Conference Paper

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

  • Zhe Chen
  • Wenhai Wang
  • Enze Xie
  • Tong Lu
  • Ping Luo

We present an extremely simple Ultra-Resolution Style Transfer framework, termed URST, to flexibly process arbitrary high-resolution images (e. g. , 10000×10000 pixels) style transfer for the first time. Most of the existing state-of-the-art methods would fall short due to massive memory cost and small stroke size when processing ultra-high resolution images. URST completely avoids the memory problem caused by ultra-high resolution images by (1) dividing the image into small patches and (2) performing patch-wise style transfer with a novel Thumbnail Instance Normalization (TIN). Specifically, TIN can extract thumbnail features’ normalization statistics and apply them to small patches, ensuring the style consistency among different patches. Overall, the URST framework has three merits compared to prior arts. (1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution. (2) Experiments show that our URST surpasses existing SOTA methods on ultra-high resolution images benefiting from the effectiveness of the proposed stroke perceptual loss in enlarging the stroke size. (3) Our URST can be easily plugged into most existing style transfer methods and directly improve their performance even without training. Code is available at https: //git. io/URST.

IJCAI Conference 2021 Conference Paper

Anytime Multi-Agent Path Finding via Large Neighborhood Search

  • Jiaoyang Li
  • Zhe Chen
  • Daniel Harabor
  • Peter J. Stuckey
  • Sven Koenig

Multi-Agent Path Finding (MAPF) is the challenging problem of computing collision-free paths for multiple agents. Algorithms for solving MAPF can be categorized on a spectrum. At one end are (bounded-sub)optimal algorithms that can find high-quality solutions for small problems. At the other end are unbounded-suboptimal algorithms that can solve large problems but usually find low-quality solutions. In this paper, we consider a third approach that combines the best of both worlds: anytime algorithms that quickly find an initial solution using efficient MAPF algorithms from the literature, even for large problems, and that subsequently improve the solution quality to near-optimal as time progresses by replanning subgroups of agents using Large Neighborhood Search. We compare our algorithm MAPF-LNS against a range of existing work and report significant gains in scalability, runtime to the initial solution, and speed of improving the solution.

AAMAS Conference 2021 Conference Paper

Anytime Multi-Agent Path Finding via Large Neighborhood Search

  • Jiaoyang Li
  • Zhe Chen
  • Daniel Harabor
  • Peter J. Stuckey
  • Sven Koenig

Multi-Agent Path Finding (MAPF) is the challenging problem of computing collision-free paths for a cooperative team of moving agents. Algorithms for solving MAPF can be categorized on a spectrum. At one end are (bounded-sub)optimal algorithms that can find high-quality solutions for small problems. At the other end are unbounded-suboptimal algorithms (including prioritized and rule-based algorithms) that can solve very large practical problems but usually find low-quality solutions. In this paper, we consider a third approach that combines both advantages: anytime algorithms that quickly find an initial solution, including for large problems, and that subsequently improve the solution to near-optimal as time progresses. To improve the solution, we replan subsets of agents using Large Neighborhood Search, a popular meta-heuristic often applied in combinatorial optimization. Empirically, we compare our algorithm MAPF-LNS to the state-of-the-art anytime MAPF algorithm anytime BCBS and report significant gains in scalability, runtime to the first solution, and speed of improving solutions.

AAAI Conference 2021 Conference Paper

Symmetry Breaking for k-Robust Multi-Agent Path Finding

  • Zhe Chen
  • Daniel D. Harabor
  • Jiaoyang Li
  • Peter J. Stuckey

During Multi-Agent Path Finding (MAPF) problems, agents can be delayed by unexpected events. To address such situations recent work describes k-Robust Conflict-Based Search (k-CBS): an algorithm that produces a coordinated and collision-free plan that is robust for up to k delays for any agent. In this work we introduce a variety of pairwise symmetry breaking constraints, specific to k-robust planning, that can efficiently find compatible and optimal paths for pairs of colliding agents. We give a thorough description of the new constraints and report large improvements to success rate in a range of domains including: (i) classic MAPF benchmarks, (ii) automated warehouse domains, and (iii) on maps from the 2019 Flatland Challenge, a recently introduced railway domain where k-robust planning can be fruitfully applied to schedule trains.

JBHI Journal 2021 Journal Article

Weakly Supervised Histopathology Image Segmentation With Sparse Point Annotations

  • Zhe Chen
  • Zhao Chen
  • Jingxin Liu
  • Qiang Zheng
  • Yuang Zhu
  • Yanfei Zuo
  • Zhaoyu Wang
  • Xiaosong Guan

Digital histopathology image segmentation can facilitate computer-assisted cancer diagnostics. Given the difficulty of obtaining manual annotations, weak supervision is more suitable for the task than full supervision is. However, most weakly supervised models are not ideal for handling severe intra-class heterogeneity and inter-class homogeneity in histopathology images. Therefore, we propose a novel end-to-end weakly supervised learning framework named WESUP. With only sparse point annotations, it performs accurate segmentation and exhibits good generalizability. The training phase comprises two major parts, hierarchical feature representation and deep dynamic label propagation. The former uses superpixels to capture local details and global context from the convolutional feature maps obtained via transfer learning. The latter recognizes the manifold structure of the hierarchical features and identifies potential targets with the sparse annotations. Moreover, these two parts are trained jointly to improve the performance of the whole framework. To further boost test performance, pixel-wise inference is adopted for finer prediction. As demonstrated by experimental results, WESUP is able to largely resolve the confusion between histological foreground and background. It outperforms several state-of-the-art weakly supervised methods on a variety of histopathology datasets with minimal annotation efforts. Trained by very sparse point annotations, WESUP can even beat an advanced fully supervised segmentation network.

IJCAI Conference 2020 Conference Paper

TextFuseNet: Scene Text Detection with Richer Fused Features

  • Jian Ye
  • Zhe Chen
  • Juhua Liu
  • Bo Du

Arbitrary shape text detection in natural scenes is an extremely challenging task. Unlike existing text detection approaches that only perceive texts based on limited feature representations, we propose a novel framework, namely TextFuseNet, to exploit the use of richer features fused for text detection. More specifically, we propose to perceive texts from three levels of feature representations, i. e. , character-, word- and global-level, and then introduce a novel text representation fusion technique to help achieve robust arbitrary text detection. The multi-level feature representation can adequately describe texts by dissecting them into individual characters while still maintaining their general semantics. TextFuseNet then collects and merges the texts’ features from different levels using a multi-path fusion architecture which can effectively align and fuse different representations. In practice, our proposed TextFuseNet can learn a more adequate description of arbitrary shapes texts, suppressing false positives and producing more accurate detection results. Our proposed framework can also be trained with weak supervision for those datasets that lack character-level annotations. Experiments on several datasets show that the proposed TextFuseNet achieves state-of-the-art performance. Specifically, we achieve an F-measure of 94. 3% on ICDAR2013, 92. 1% on ICDAR2015, 87. 1% on Total-Text and 86. 6% on CTW-1500, respectively.