Arrow Research search

Author name cluster

Mingjie Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

TIST Journal 2026 Journal Article

Learning Causality-Aware Exploration with Transformers for Goal-Oriented Navigation

  • Ruoyu Wang
  • Tong Yu
  • Mingjie Li
  • Yuanjiang Cao
  • Yao Liu
  • Lina Yao

Navigation is a fundamental task in the research of Embodied AI, and recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct application of RNNs and Transformers often overlooks the distinct characteristics of navigation tasks compared to traditional sequential data modeling. These methods are inherently designed to capture long-term dependencies, which are relatively weak in navigation scenarios, potentially limiting their performance in such tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Navigation tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Navigation. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the model’s Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method’s generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks, and simulation environments, specifically, in Object Navigation within RoboTHOR, Objective Navigation, Point Navigation in Habitat, and R2R Navigation. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings. Additionally, further analysis highlights the robustness of our method, demonstrating its capacity to consistently perform well across diverse experimental settings and varying conditions. This robustness underscores the adaptability and generalizability of our approach, reinforcing its potential for application across a wide range of tasks.

TIST Journal 2026 Journal Article

Mitigating Data Redundancy to Revitalize Transformer-Based Long-Term Time Series Forecasting System

  • Mingjie Li
  • Rui Liu
  • Guangsi Shi
  • Mingfei Han
  • Changlin Li
  • Lina Yao
  • Xiaojun Chang
  • Ling Chen

Long-term time series forecasting (LTSF) is fundamental to various real-world applications, where Transformer-based models have become the dominant framework due to their ability to capture long-range dependencies. However, these models often experience overfitting due to data redundancy in rolling forecasting settings, limiting their generalization ability particularly evident in longer sequences with highly similar adjacent data. In this work, we introduce CLMFormer, a novel framework that mitigates redundancy through curriculum learning and a memory-driven decoder. Specifically, we progressively introduce Bernoulli noise to the training samples, which effectively breaks the high similarity between adjacent data points. This curriculum-driven noise introduction aids the memory-driven decoder by supplying more diverse and representative training data, enhancing the decoder’s ability to model seasonal tendencies and dependencies in the time series data. To further enhance forecasting accuracy, we introduce a memory-driven decoder. This component enables the model to capture seasonal tendencies and dependencies in the time series data and leverages temporal relationships to facilitate the forecasting process. Extensive experiments on six real-world LTSF benchmarks show that CLMFormer consistently improves Transformer-based models by up to 30%, demonstrating its effectiveness in long-horizon forecasting.

NeurIPS Conference 2025 Conference Paper

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

  • Yukun Jiang
  • Mingjie Li
  • Michael Backes
  • Yang Zhang

Despite their superior performance on a wide range of domains, large language models (LLMs) remain vulnerable to misuse for generating harmful content, a risk that has been further amplified by various jailbreak attacks. Existing jailbreak attacks mainly follow sequential logic, where LLMs understand and answer each given task one by one. However, concurrency, a natural extension of the sequential scenario, has been largely overlooked. In this work, we first propose a word-level method to enable task concurrency in LLMs, where adjacent words encode divergent intents. Although LLMs maintain strong utility in answering concurrent tasks, which is demonstrated by our evaluations on mathematical and general question-answering benchmarks, we notably observe that combining a harmful task with a benign one significantly reduces the probability of it being filtered by the guardrail, showing the potential risks associated with concurrency in LLMs. Based on these findings, we introduce $\texttt{JAIL-CON}$, an iterative attack framework that $\underline{\text{JAIL}}$breaks LLMs via task $\underline{\text{CON}}$currency. Experiments on widely-used LLMs demonstrate the strong jailbreak capabilities of $\texttt{JAIL-CON}$ compared to existing attacks. Furthermore, when the guardrail is applied as a defense, compared to the sequential answers generated by previous attacks, the concurrent answers in our $\texttt{JAIL-CON}$ exhibit greater stealthiness and are less detectable by the guardrail, highlighting the unique feature of task concurrency in jailbreaking LLMs.

NeurIPS Conference 2025 Conference Paper

CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

  • Hongyong Han
  • Wei Wang
  • Gaowei Zhang
  • Mingjie Li
  • Yi Wang

Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12, 805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277, 653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

AAAI Conference 2025 Conference Paper

Enhancing Vision-Language Models with Morphological and Taxonomic Knowledge: Towards Coral Recognition for Ocean Health

  • Hongyong Han
  • Wei Wang
  • Gaowei Zhang
  • Mingjie Li
  • Yi Wang

Coral reefs play a crucial role in marine ecosystems, offering a nutrient-rich environment and safe shelter for numerous marine species. Automated coral image recognition aids in monitoring ocean health at a scale without experts' manual effort. Recently, large vision-language models like CLIP have greatly enhanced zero-shot and low-shot classification capabilities for various visual tasks. However, these models struggle with fine-grained coral-related tasks due to a lack of specific knowledge. To bridge this gap, we compile a fine-grained coral image dataset consisting of 16,659 images with taxonomy labels (from Kingdom to Species), accompanied by morphology-specific text descriptions for each species. Based on the dataset, we propose CORAL-Adapter, integrating two complementary kinds of coral-specific knowledge (biological taxonomy and coral morphology) with general knowledge learned by CLIP. CORAL-Adapter is a simple yet powerful extension of CLIP with only a few parameter updates and can be used as a plug-and-play module with various CLIP-based methods. We show improvements in accuracy across diverse coral recognition tasks, e.g., recognizing corals unseen during training that are prone to bleaching or originate from different oceans.

NeurIPS Conference 2025 Conference Paper

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

  • Mingjie Li
  • Wai Man Si
  • Michael Backes
  • Yang Zhang
  • Yisen Wang

Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs' safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.

AAAI Conference 2025 Conference Paper

HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation

  • Tengfei Liu
  • Jiapu Wang
  • Yongli Hu
  • Mingjie Li
  • Junfei Yi
  • Xiaojun Chang
  • Junbin Gao
  • Baocai Yin

Radiology report generation (RRG) models typically focus on individual exams, often overlooking the integration of historical visual or textual data, which is crucial for patient follow-ups. Traditional methods usually struggle with long sequence dependencies when incorporating historical information, but large language models (LLMs) excel at in-context learning, making them well-suited for analyzing longitudinal medical data. In light of this, we propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for RRG, empowering LLMs with longitudinal report generation capabilities by constraining the consistency and differences between longitudinal images and their corresponding reports. Specifically, our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression. Then, we ensure consistent representation by applying intra-modality similarity constraints and aligning various features across modalities with multimodal contrastive and structural constraints. These combined constraints effectively guide the LLMs in generating diagnostic reports that accurately reflect the progression of the disease, achieving state-of-the-art results on the Longitudinal-MIMIC dataset. Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models, enhancing its versatility.

ICML Conference 2023 Conference Paper

Does a Neural Network Really Encode Symbolic Concepts?

  • Mingjie Li
  • Quanshi Zhang

Recently, a series of studies have tried to extract interactions between input variables modeled by a DNN and define such interactions as concepts encoded by the DNN. However, strictly speaking, there still lacks a solid guarantee whether such interactions indeed represent meaningful concepts. Therefore, in this paper, we examine the trustworthiness of interaction concepts from four perspectives. Extensive empirical studies have verified that a well-trained DNN usually encodes sparse, transferable, and discriminative concepts, which is partially aligned with human intuition. The code is released at https: //github. com/sjtu-xai-lab/interaction-concept.

NeurIPS Conference 2023 Conference Paper

GEQ: Gaussian Kernel Inspired Equilibrium Models

  • Mingjie Li
  • Yisen Wang
  • Zhouchen Lin

Despite the connection established by optimization-induced deep equilibrium models (OptEqs) between their output and the underlying hidden optimization problems, the performance of it along with its related works is still not good enough especially when compared to deep networks. One key factor responsible for this performance limitation is the use of linear kernels to extract features in these models. To address this issue, we propose a novel approach by replacing its linear kernel with a new function that can readily capture nonlinear feature dependencies in the input data. Drawing inspiration from classical machine learning algorithms, we introduce Gaussian kernels as the alternative function and then propose our new equilibrium model, which we refer to as GEQ. By leveraging Gaussian kernels, GEQ can effectively extract the nonlinear information embedded within the input features, surpassing the performance of the original OptEqs. Moreover, GEQ can be perceived as a weight-tied neural network with infinite width and depth. GEQ also enjoys better theoretical properties and improved overall performance. Additionally, our GEQ exhibits enhanced stability when confronted with various samples. We further substantiate the effectiveness and stability of GEQ through a series of comprehensive experiments.

NeurIPS Conference 2023 Conference Paper

Mask Propagation for Efficient Video Semantic Segmentation

  • Yuetian Weng
  • Mingfei Han
  • Haoyu He
  • Mingjie Li
  • Lina Yao
  • Xiaojun Chang
  • Bohan Zhuang

Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4. 0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4× FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https: //github. com/ziplab/MPVSS.

ICML Conference 2022 Conference Paper

Towards Theoretical Analysis of Transformation Complexity of ReLU DNNs

  • Jie Ren 0018
  • Mingjie Li
  • Meng Zhou
  • Shih-Han Chan
  • Quanshi Zhang

This paper aims to theoretically analyze the complexity of feature transformations encoded in piecewise linear DNNs with ReLU layers. We propose metrics to measure three types of complexities of transformations based on the information theory. We further discover and prove the strong correlation between the complexity and the disentanglement of transformations. Based on the proposed metrics, we analyze two typical phenomena of the change of the transformation complexity during the training process, and explore the ceiling of a DNN’s complexity. The proposed metrics can also be used as a loss to learn a DNN with the minimum complexity, which also controls the over-fitting level of the DNN and influences adversarial robustness, adversarial transferability, and knowledge consistency. Comprehensive comparative studies have provided new perspectives to understand the DNN. The code is released at https: //github. com/sjtu-XAI-lab/transformation-complexity.

NeurIPS Conference 2021 Conference Paper

FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark

  • Mingjie Li
  • Wenjia Cai
  • Rui Liu
  • Yuetian Weng
  • Xiaoyun Zhao
  • Cong Wang
  • Xin Chen
  • Zhong Liu

The automatic generation of long and coherent medical reports given medical images (e. g. Chest X-ray and Fundus Fluorescein Angiography (FFA)) has great potential to support clinical practice. Researchers have explored advanced methods from computer vision and natural language processing to incorporate medical domain knowledge for the generation of readable medical reports. However, existing medical report generation (MRG) benchmarks lack both explainable annotations and reliable evaluation tools, hindering the current research advances from two aspects: firstly, existing methods can only predict reports without accurate explanation, undermining the trustworthiness of the diagnostic methods; secondly, the comparison among the predicted reports from different MRG methods is unreliable using the evaluation metrics of natural-language generation (NLG). To address these issues, in this paper, we propose an explainable and reliable MRG benchmark based on FFA Images and Reports (FFA-IR). Specifically, FFA-IR is large, with 10, 790 reports along with 1, 048, 584 FFA images from clinical practice; it includes explainable annotations, based on a schema of 46 categories of lesions; and it is bilingual, providing both English and Chinese reports for each case. Besides using the widely used NLG metrics, we propose a set of nine human evaluation criteria to evaluate the generated reports. We envision FFA-IR as a testbed for explainable and reliable medical report generation. We also hope that it can broadly accelerate medical imaging research and facilitate interaction between the fields of medical imaging, computer vision, and natural language processing.

ICLR Conference 2021 Conference Paper

Interpreting and Boosting Dropout from a Game-Theoretic View

  • Hao Zhang 0063
  • Sen Li
  • Yinchao Ma
  • Mingjie Li
  • Yichen Xie 0002
  • Quanshi Zhang

This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretical interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretical proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. So, the utility of dropout can be regarded as decreasing interactions to alleviating the significance of over-fitting. Based on this understanding, we propose the interaction loss to further improve the utility of dropout. Experimental results on various DNNs and datasets have shown that the interaction loss can effectively improve the utility of dropout and boost the performance of DNNs.

ICML Conference 2021 Conference Paper

Interpreting and Disentangling Feature Components of Various Complexity from DNNs

  • Jie Ren 0018
  • Mingjie Li
  • Zexu Liu
  • Quanshi Zhang

This paper aims to define, visualize, and analyze the feature complexity that is learned by a DNN. We propose a generic definition for the feature complexity. Given the feature of a certain layer in the DNN, our method decomposes and visualizes feature components of different complexity orders from the feature. The feature decomposition enables us to evaluate the reliability, the effectiveness, and the significance of over-fitting of these feature components. Furthermore, such analysis helps to improve the performance of DNNs. As a generic method, the feature complexity also provides new insights into existing deep-learning techniques, such as network compression and knowledge distillation.

NeurIPS Conference 2021 Conference Paper

Visualizing the Emergence of Intermediate Visual Patterns in DNNs

  • Mingjie Li
  • Shaobo Wang
  • Quanshi Zhang

This paper proposes a method to visualize the discrimination power of intermediate-layer visual patterns encoded by a DNN. Specifically, we visualize (1) how the DNN gradually learns regional visual patterns in each intermediate layer during the training process, and (2) the effects of the DNN using non-discriminative patterns in low layers to construct disciminative patterns in middle/high layers through the forward propagation. Based on our visualization method, we can quantify knowledge points (i. e. the number of discriminative visual patterns) learned by the DNN to evaluate the representation capacity of the DNN. Furthermore, this method also provides new insights into signal-processing behaviors of existing deep-learning techniques, such as adversarial attacks and knowledge distillation.

IJCAI Conference 2018 Conference Paper

Deep Convolutional Neural Networks with Merge-and-Run Mappings

  • Liming Zhao
  • Mingjie Li
  • Depu Meng
  • Xi Li
  • Zhaoxiang Zhang
  • Yueting Zhuang
  • Zhuowen Tu
  • Jingdong Wang

A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow. To further reduce the training difficulty, we present a simple network architecture, deep merge-and-run neural networks. The novelty lies in a modularized building block, merge-and-run block, which assembles residual branches in parallel through a merge-and-run mapping: average the inputs of these residual branches (Merge), and add the average to the output of each residual branch as the input of the subsequent residual branch (Run), respectively. We show that the merge-and-run mapping is a linear idempotent function in which the transformation matrix is idempotent, and thus improves information flow, making training easy. In comparison with residual networks, our networks enjoy compelling advantages: they contain much shorter paths and the width, i. e. , the number of channels, is increased, and the time complexity remains unchanged. We evaluate the performance on the standard recognition tasks. Our approach demonstrates consistent improvements over ResNets with the comparable setup, and achieves competitive results (e. g. , 3. 06% testing error on CIFAR-10, 17. 55% on CIFAR-100, 1. 51% on SVHN).