Author name cluster

Ming Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

46 papers

2 author rows

AAAI Conference 2026 Conference Paper

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Yexing Du
Kaiyuan Liu
Youcheng Pan
Zheng Chu
Bo Yang
Xiaocheng Feng
Ming Liu
Yang Xiang

As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities.

PDF Details DOI

AAAI Conference 2026 Conference Paper

From Sampling to Cognition: Modeling Internal Cognitive Confidence in Language Models for Robust Uncertainty Calibration

Hao Li
Tao He
Jiafeng Liang
Zheng Chu
Ming Liu

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet they generally lack self-awareness, often displaying overconfidence when confronted with questions beyond their knowledge boundaries. This limitation severely hinders their trustworthiness in high-stakes scenarios. Existing calibration methods typically rely on sampling accuracy, derived from multiple outputs, as a proxy for model confidence. However, this coarse-grained metric fails to capture the model’s internal cognitive states, such as confusion, hallucination, or persistent belief in false knowledge. To address this, we propose CogConf (Cognitive Confidence), a cognitively grounded uncertainty signal that extends sampling accuracy by incorporating the semantic diversity of incorrect answers and the model’s abstention behaviors. By shifting the focus from sampling-based to cognition-oriented uncertainty modeling, CogConf offers a more faithful reflection of the model's internal beliefs. Building on this signal, we introduce CogAlign, a simple yet effective alignment framework that explicitly aligns the model’s verbalized confidence with CogConf, thereby producing uncertainty estimates that better reflect the model’s internal cognition. Experimental results on six knowledge-intensive in-domain and out-of-domain QA datasets demonstrate that CogConf robustly characterizes the model's internal uncertainty. Building on this foundation, CogAlign guides the model's expression to significantly enhance the trustworthiness and utility of its uncertainty calibration without compromising its underlying QA capabilities, while also demonstrating strong cross-task generalization and output stability. Offering a new pathway toward building more trustworthy LLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Ming Liu
Hao Chen
Jindong Wang
Liwen Wang
Jingchen Sun
Wensheng Zhang

Vision-Language Models (VLMs) have achieved success in tasks such as visual question answering, yet their resilience to distractions remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models and those specialized for reasoning—against distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. We evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with a noticeable degradation in reasoning when extraneous content is present. In particular, some models (including GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies such as prompt engineering. Although these strategies improve resilience modestly, our analysis highlights considerable room for further improvement in the robustness of VLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RefSTAR: Blind Face Image Restoration with Reference Selection, Transfer, and Reconstruction

Zhicun Yin
Junjie Chen
Ming Liu
Zhixin Wang
Fan Li
Renjing Pei
Xiaoming Li
Rynson W. H. Lau

Introducing high-quality references can largely alleviate the uncertainty in blind face image restoration tasks, yet the equivocal utilization of reference priors makes it still a struggle to well preserve the human identity. We attribute the identity inconsistency to two deficiencies of existing reference-based face restoration methods, namely the inability to effectively determine which features need to be transferred, and the failure to preserve the structure and details of the selected features. This work mainly focuses on these two issues, and we present a novel blind face image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR) to introduce proper features from reference images. Specifically, we construct a reference selection (RefSel) module, which can generate accurate masks to select reference features. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotated masks for 10,000 ground truth-reference pairs. To guarantee the exact introduction of selected reference features, a feature fusion paradigm is designed for reference feature transferring, and a Mask-Compatible Cycle-Consistency Loss is redesigned based on reference reconstruction to further ensure the presence of selected reference image features in the output image. Experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Effective Interplay between Sparsity and Quantization: From Theory to Practice

Simla Burcu Harma
Ayan Chakraborty 0005
Elizaveta Kostenok
Danila Mishin
Dongho Ha
Babak Falsafi
Martin Jaggi
Ming Liu

The increasing size of deep neural networks (DNNs) necessitates effective model compression to reduce their computational and memory footprints. Sparsity and quantization are two prominent compression methods that have been shown to reduce DNNs' computational and memory footprints significantly while preserving model accuracy. However, how these two methods interact when combined together remains a key question for developers, as many tacitly assume that they are orthogonal, meaning that their combined use does not introduce additional errors beyond those introduced by each method independently. In this paper, we provide the first mathematical proof that sparsity and quantization are non-orthogonal. We corroborate these results with experiments spanning a range of large language models, including the OPT and LLaMA model families (with 125M to 8B parameters), and vision models like ViT and ResNet. We show that the order in which we apply these methods matters because applying quantization before sparsity may disrupt the relative importance of tensor elements, which may inadvertently remove significant elements from a tensor. More importantly, we show that even if applied in the correct order, the compounded errors from sparsity and quantization can significantly harm accuracy. Our findings extend to the efficient deployment of large models in resource-constrained compute platforms to reduce serving cost, offering insights into best practices for applying these compression methods to maximize hardware resource efficiency without compromising accuracy.

Details

ECAI Conference 2025 Conference Paper

Endexformer: Hierarchical Endogenous-Exogenous Synergy for Multivariate Time Series Forecasting

Zhiquan Huang
Ruijuan Zheng
Junlong Zhu
Luxin Liu
Meiwen Li
Ming Liu

Exogenous variables provide complementary information that enhances endogenous representations, thereby facilitating more accurate multivariate time series forecasting (MTSF). However, existing methods typically overlook the synergistic interplay between exogenous and endogenous variables by adopting shallow fusion strategies such as simple concatenation or separate encoding, which fail to capture the dynamic dependencies essential for modeling complex temporal patterns. To address this issue, we propose Endexformer, a novel hierarchical Endogenous-Exogenous modeling framework built upon the Transformer architecture. Specifically, Endexformer adopts a hierarchical architecture to jointly model temporal embeddings of endogenous variables and structural embeddings of exogenous variables, enabling a unified representation of cross-variable dependencies. To capture the fine-grained temporal patterns of endogenous variables, we present a multilevel temporal attention mechanism that leverages variable-level embeddings to adaptively incorporate exogenous information. Furthermore, we design a dynamic interactive attention mechanism that selectively emphasizes informative endogenous and exogenous patterns, mitigating redundancy and preserving semantic integrity in variable representations. Extensive experiments on eight real-world datasets show that Endexformer achieves outstanding performance against competing benchmark approaches in MTSF tasks across various temporal scenarios.

Details

ICLR Conference 2025 Conference Paper

Is Your Video Language Model a Reliable Judge?

Ming Liu
Wensheng Zhang

As video language models (VLMs) gain more applications in various scenarios, the need for robust and scalable evaluation of their performance becomes increasingly critical. The traditional human expert-based evaluation of VLMs has limitations in consistency and scalability, which sparked interest in automatic methods such as employing VLMs to evaluate VLMs. However, the reliability of VLMs as judges remains underexplored. Existing methods often rely on a single VLM as the evaluator. However, this approach can be unreliable or biased because such a model may lack the ability to fully understand the content and may have inherent biases, ultimately compromising evaluation reliability. A remedy is to apply the principle of collective thoughts, aggregating evaluations from multiple VLMs to enhance reliability. This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Our findings reveal that incorporating collective judgments from such a mixed pool does not necessarily improve the accuracy of the final evaluation. The inclusion of less reliable judges can introduce noise, undermining the overall reliability of the outcomes. To explore the factors that impact evaluation reliability, we fine-tune an underperforming VLM judge, Video-LLaVA, and observe that improved understanding ability alone is insufficient to make VLM judges more reliable. These findings stress the limitations of collective thought approaches and highlight the need for more advanced methods that can account for the reliability of individual models. Our study promotes the development of more reliable evaluation methods for VLMs

Details

NeurIPS Conference 2025 Conference Paper

On Fairness of Unified Multimodal Large Language Model for Image Generation

Ming Liu
Hao Chen
Jindong Wang
Liwen Wang
Bhiksha Raj
Wensheng Zhang

Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in end-to-end visual understanding and generation tasks. However, compared to generation-only systems (e. g. , Stable Diffusion), the unified architecture of U-MLLMs introduces new risks of propagating demographic stereotypes. In this paper, we benchmark several state-of-the-art U-MLLMs and show that they exhibit significant gender and race biases in the generated outputs. To diagnose the source of these biases, we propose a locate-then-fix framework: we first audit the vision and language components — using techniques such as linear probing and controlled generation — and find that the language model appears to be a primary origin of the observed generative bias. Moreover, we observe a ``partial alignment'' phenomenon, where the U-MLLMs exhibit less bias in understanding tasks yet produce substantially biased images. To address this, we introduce a novel \emph{balanced preference loss} that enforces uniform generation probabilities across demographics by leveraging a synthetically balanced dataset. Extensive experiments show that our approach significantly reduces demographic bias while preserving semantic fidelity and image quality. Our findings underscore the need for targeted debiasing strategies in unified multimodal systems and introduce a practical approach to mitigate biases.

PDF Details

AAAI Conference 2025 Conference Paper

Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

Tao He
Lizi Liao
Yixin Cao
Yuanxing Liu
Yiheng Sun
Zerui Chen
Ming Liu
Bing Qin

Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.

PDF Details DOI

EAAI Journal 2024 Journal Article

Coal allocation optimization based on a hybrid residual prediction model with an improved genetic algorithm

Ming Liu
Ziqi Yu
Boran Li
Qingjie Wang
Huawei Ren
Dong Xu

The objective of the coal blending optimization problem is to find an optimal coal blending in the feasible domain such that the blended coal meets the quality requirements at the end of the coking process and the cost of coal blending is minimized. This paper proposes a hybrid residual prediction model and an improved genetic algorithm to solve this problem and predict coke quality. For this purpose, a hybrid residual prediction model is used to predict coke quality. The model first uses a random forest feature extraction method to reduce the dimensionality of the data, and then trains several prediction models such as eXtreme Gradient Boosting (XGBoost), Adaboost and Light Gradient-Boosting Machine (lightGBM) for different coke indicators an improved genetic algorithm based on the adaptive weighted genetic algorithm (awGA) and another improved genetic algorithm based on a priori knowledge and adaptive random initialization method were designed and implemented to solve the optimization problem under strict constraints (P-awGA). The experimental results show that using the hybrid residual prediction model and the improved genetic algorithm can accurately predict the coke quality and use less time to obtain a lower-cost coal blending solution.