Author name cluster

Mingyu Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

AAAI Conference 2026 Conference Paper

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Models

Hanqing Wang
Shaoyang Wang
Yiming Zhong
Zemin Yang
Jiamin Wang
Zhiqing Cui
Jiahao Yuan
Yifan Han

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks

Kaijun Wang
Liqin Lu
Mingyu Liu
Jianuo Jiang
Zeju Li
Bolin Zhang
Wancai Zheng
Xinyi Yu

Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have shown promise in enhancing spatial reasoning and task planning through learned semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges characteristic of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied in the literature. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination of locomotion and manipulation across challenging terrains. We further present the first comprehensive benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system’s generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models

Hanqing Wang
Yuan Tian
Mingyu Liu
Zhenhao Zhang
Xiangyang Zhu

In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose SDEval, the first safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences evaluation results, mitigates data contamination, and exposes safety limitations of MLLMs.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Canyu Zhao
Yanlong Sun
Mingyu Liu
Huanyi Zheng
Muzhi Zhu
Zhiyue Zhao
Hao Chen
Tong He

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0. 06% of their data (e. g. , 600K vs. \ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.

PDF Details

IROS Conference 2025 Conference Paper

MambaSFLNet: A Mamba-based Model for Low-Light Image Enhancement with Spatial and Frequency Features

Mingyu Liu
Yuning Cui 0001
Leah Strand
Xingcheng Zhou
Jiajie Zhang
Alois C. Knoll

Low-light image enhancement (LLIE) aims to enhance the illumination of images that are captured under dark conditions, which is critical for various applications in dim environments, such as robotics and autonomous driving. Existing convolutional neural network (CNN)-based methods usually struggle to capture long-range dependencies, while transformer-based methods, despite their effectiveness, are resource-consuming. Besides, the frequency domain includes important lightness degradation information. To this end, we propose a Mamba-based framework called MambaSFLNet to effectively address LLIE by integrating spatial and frequency features. Our approach utilizes the Visual State Space Module to establish relationships across different regions of the input image while maintaining low model complexity. Furthermore, The spatial module not only balances illumination distribution but also suppresses noise and artifacts during enhancement. In addition, the frequency module enhances image contrast and sharpness by leveraging frequency-domain information. Extensive experiments on nine widely used benchmarks demonstrate that our approach achieves superior performance and exhibits strong generalization capabilities compared to existing methods. The codes are available at https://github.com/MingyuLiu1/MambaSFLNet.git

Details

ICLR Conference 2025 Conference Paper

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

Canyu Zhao
Mingyu Liu
Wen Wang 0015
Weihua Chen
Fan Wang 0019
Hao Chen 0041
Bo Zhang 0025
Chunhua Shen

Recent advancements in video generation have primarily leveraged diffusion models for short-duration content. However, these approaches often fall short in modeling complex narratives and maintaining character consistency over extended periods, which is essential for long-form video production like movies. We propose MovieDreamer, a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering to pioneer long-duration video generation with intricate plot progressions and high visual fidelity. Our approach utilizes autoregressive models for global narrative coherence, predicting sequences of visual tokens that are subsequently transformed into high-quality video frames through diffusion rendering. This method is akin to traditional movie production processes, where complex stories are factorized down into manageable scene capturing. Further, we employ a multimodal script that enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes. We present extensive experiments across various movie genres, demonstrating that our approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities.

Details

NeurIPS Conference 2025 Conference Paper

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong
Muzhi Zhu
Zongze Du
Zheng Huang
Canyu Zhao
Mingyu Liu
Wen Wang
Hao Chen

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because "optimal" keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

PDF Details

ICLR Conference 2025 Conference Paper

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen
Mingyu Liu
Chenchen Jing
Yizhou Zhou
Fengyun Rao
Hao Chen 0041
Bo Zhang 0046
Chunhua Shen

This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

Details

IJCAI Conference 2025 Conference Paper

Seeing the Unseen: Composing Outliers for Compositional Zero-Shot Learning

Chenchen Jing
Mingyu Liu
Hao Chen
Yuling Xi
Xingyuan Bu
Dong Gong
Chunhua Shen

Compositional zero-shot learning (CZSL) is to recognize unseen attribute-object compositions by learning from seen compositions. The distribution shift between unseen compositions and seen compositions poses challenges to CZSL models, especially when test images are mixed with both seen and unseen compositions. The challenge will be addressed more easily if a model can distinguish unseen/seen compositions and treat them with specific recognition strategies. However, identifying images with unseen compositions is non-trivial, considering that unseen compositions are absent in training and usually contain only subtle differences from seen compositions. In this paper, we propose a novel compositional zero-shot learning method called COMO, which composes outliers in training for distinguishing seen and unseen compositions and further applying specific strategies for them. Specifically, we compose attribute-object representations for unseen compositions based on primitive representations of training images as outliers to enable the model to identify unseen compositions in inference. At test time, the method distinguishes images containing seen/unseen compositions and uses different weights for composition classification and primitive classification to recognize seen/unseen compositions. Experimental results on three datasets show the effectiveness of our method in both the closed-world setting and the open-world setting.

PDF Details DOI

ICML Conference 2025 Conference Paper

Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation

He Li
Haoang Chi
Mingyu Liu
Wanrong Huang
Liyang Xu
Wenjing Yang 0002

The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at this URL.

Details

ICML Conference 2025 Conference Paper

TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Xingcheng Zhou
Konstantinos Larintzakis
Hao Guo
Walter Zimmer
Mingyu Liu
Hu Cao
Jiajie Zhang
Venkatnarayanan Lakshminarasimhan

We present TUMTraf VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1, 000 videos, featuring 85, 000 multiple-choice QA pairs, 2, 300 object captioning, and 5, 700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraf VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding—within a cohesive evaluation framework. We further introduce the TraffiX-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TUMTraf VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

Details

ICLR Conference 2025 Conference Paper

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

Guangkai Xu
Yongtao Ge
Mingyu Liu
Chengxiang Fan
Kangyang Xie
Zhiyue Zhao
Hao Chen 0041
Chunhua Shen

Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for a few dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) As a special case of the diffusion scheduler by setting its hyper-parameters, the multi-step generation can be simplified to a one-step fine-tuning paradigm without any loss of performance, while significantly speeding up inference. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific supervision can be beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailored for dense visual perception tasks exploiting diffusion priors. Different from the previous multi-step methods, our paradigm offers a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for task-specific supervision, which can be critical for improving the fine-grained details of predictions. Comprehensive experiments on a diverse set of dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method. Code: https://github.com/aim-uofa/GenPercept

Details

IJCAI Conference 2024 Conference Paper

Hybrid Frequency Modulation Network for Image Restoration

Yuning Cui
Mingyu Liu
Wenqi Ren
Alois Knoll

Image restoration involves recovering a high-quality image from its corrupted counterpart. This paper presents an effective and efficient framework for image restoration, termed CSNet, based on ``channel + spatial" hybrid frequency modulation. Different feature channels include different degradation patterns and degrees, however, most current networks ignore the importance of channel interactions. To alleviate this issue, we propose a frequency-based channel feature modulation module to facilitate channel interactions through the channel-dimension Fourier transform. Furthermore, based on our observations, we develop a multi-scale frequency-based spatial feature modulation module to refine the direct-current component of features using extremely lightweight learnable parameters. This module contains a densely connected coarse-to-fine learning paradigm for enhancing multi-scale representation learning. In addition, we introduce a frequency-inspired loss function to achieve omni-frequency learning. Extensive experiments on nine datasets demonstrate that the proposed network achieves state-of-the-art performance for three image restoration tasks, including image dehazing, image defocus deblurring, and image desnowing. The code and models are available at https: //github. com/c-yn/CSNet.

PDF Details DOI

AAAI Conference 2023 Conference Paper

FanoutNet: A Neuralized PCB Fanout Automation Method Using Deep Reinforcement Learning

Haiyun Li
Jixin Zhang
Ning Xu
Mingyu Liu

In modern electronic manufacturing processes, multi-layer Printed Circuit Board (PCB) routing requires connecting more than hundreds of nets with perplexing topology under complex routing constraints and highly limited resources, so that takes intense effort and time of human engineers. PCB fanout as a pre-design of PCB routing has been proved to be an ideal technique to reduce the complexity of PCB routing by pre-allocating resources and pre-routing. However, current PCB fanout design heavily relies on the experience of human engineers, and there is no existing solution for PCB fanout automation in industry, which limits the quality of PCB routing automation. To address the problem, we propose a neuralized PCB fanout method by deep reinforcement learning. To the best of our knowledge, we are the first in the literature to propose the automation method for PCB fanout. We combine with Convolution Neural Network (CNN) and attention-based network to train our fanout policy model and value model. The models learn representations of PCB layout and netlist to make decisions and evaluations in place of human engineers. We employ Proximal Policy Optimization (PPO) to update the parameters of the models. In addition, we apply our PCB fanout method to a PCB router to improve the quality of PCB routing. Extensive experimental results on real-world industrial PCB benchmarks demonstrate that our approach achieves 100% routability in all industrial cases and improves wire length by an average of 6.8%, which makes a significant improvement compared with the state-of-the-art methods.

PDF Details DOI

ECAI Conference 2023 Conference Paper

Instance-Wise Adaptive Tuning and Caching for Vision-Language Models

Chunjin Yang
Fanman Meng
Shuai Chen
Mingyu Liu
Runtong Zhang

Large-scale vision-language models (LVLMs) pre-trained on massive image-text pairs have achieved remarkable success in visual representations. However, existing paradigms to transfer LVLMs to downstream tasks encounter two primary challenges. Firstly, the text features remain fixed after being calculated and cannot be adjusted according to image features, which decreases the model’s adaptability. Secondly, the model’s output solely depends on the similarity between the text and image features, leading to excessive reliance on LVLMs. To address these two challenges, we introduce a novel two-branch model named the Instance-Wise Adaptive Tuning and Caching (ATC). Specifically, one branch implements our proposed ConditionNet, which guides image features to form an adaptive textual cache that adjusts based on image features, achieving instance-wise inference and improving the model’s adaptability. The other branch introduces the similarities between images and incorporates a learnable visual cache, designed to decouple new and previous knowledge, allowing the model to acquire new knowledge while preserving prior knowledge. The model’s output is jointly determined by the two branches, thus overcoming the limitations of existing methods that rely solely on LVLMs. Additionally, our method requires limited computing resources to tune parameters, yet outperforms existing methods on 11 benchmark datasets.

Details