Author name cluster

Liangming Pan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

AAAI Conference 2025 Conference Paper

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

Shengqiong Wu
Hao Fei
Liangming Pan
William Yang Wang
Shuicheng Yan
Tat-Seng Chua

Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

How do Transformers Learn Implicit Reasoning?

Jiaran Ye
Zijun Yao
Zhidian Huang
Liangming Pan
Jinxin Liu
Yushi Bai
Amy Xin
Liu Weichuan

Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly---producing correct answers without explicitly verbalizing intermediate steps---but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.

PDF Details

NeurIPS Conference 2025 Conference Paper

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu
Hao Fei
Yuhui Zhang
Liangming Pan
Qijun Huang
Qian Liu
Preslav Nakov
Min-Yen Kan

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1, 093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4. 1, achieving only 46. 8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4. 1’s Chain-of-Thought performance by 14. 13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements.

PDF Details

TMLR Journal 2024 Journal Article

A Survey on Data Selection for Language Models

Alon Albalak
Yanai Elazar
Sang Michael Xie
Shayne Longpre
Nathan Lambert
Xinyi Wang
Niklas Muennighoff
Bairu Hou

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

PDF Details

NeurIPS Conference 2024 Conference Paper

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma
Yuhang Zang
Liangyu Chen
Meiqi Chen
Yizhu Jiao
Xinze Li
Xinyuan Lu
Ziyu Liu

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLONGBENCH-DOC, a long-context, multi- modal benchmark comprising 1, 082 expert-annotated questions. Distinct from previous datasets, it is constructed upon 135 lengthy PDF-formatted documents with an average of 47. 5 pages and 21, 214 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i. e. , page number). Moreover, 33. 7\% of the questions are cross-page questions requiring evidence across multiple pages. 20. 6\% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 44. 9\%, while the second-best, GPT-4V, scores 30. 5\%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs.

PDF Details DOI

ICML Conference 2024 Conference Paper

Position: AI/ML Influencers Have a Place in the Academic Process

Iain Weissburg
Mehir Arora
Xinyi Wang 0003
Liangming Pan
William Yang Wang

As the number of accepted papers at AI and ML conferences reaches into the thousands, it has become unclear how researchers access and read research publications. In this paper, we investigate the role of social media influencers in enhancing the visibility of machine learning research, particularly the citation counts of papers they share. We have compiled a comprehensive dataset of over 8, 000 papers, spanning tweets from December 2018 to October 2023, alongside controls precisely matched by 9 key covariates. Our statistical and causal inference analysis reveals a significant increase in citations for papers endorsed by these influencers, with median citation counts 2-3 times higher than those of the control group. Additionally, the study delves into the geographic, gender, and institutional diversity of highlighted authors. Given these findings, we advocate for a responsible approach to curation, encouraging influencers to uphold the journalistic standard that includes showcasing diverse research topics, authors, and institutions.

Details

ICML Conference 2024 Conference Paper

Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation

Xinyi Wang 0003
Alfonso Amayuelas
Kexun Zhang
Liangming Pan
Wenhu Chen
William Yang Wang

Pre-trained language models (LMs) are able to perform complex reasoning without explicit fine-tuning. To understand how pre-training with a next-token prediction objective contributes to the emergence of such reasoning capability, we propose that we can view an LM as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time. We found this perspective effective in two important cases of reasoning: logic reasoning with knowledge graphs (KGs) and chain-of-thought (CoT) reasoning. More specifically, we formalize the reasoning paths as random walk paths on the knowledge/reasoning graphs. Analyses of learned LM distributions suggest that a weighted sum of relevant random walk path probabilities is a reasonable way to explain how LMs reason. Experiments and analysis on multiple KG and CoT datasets reveal the effect of training on random walk paths and suggest that augmenting unlabeled random walk reasoning paths can improve real-world multi-step reasoning performance.

Details

AAAI Conference 2020 Conference Paper

Zero-Shot Ingredient Recognition by Multi-Relational Graph Convolutional Network

Jingjing Chen
Liangming Pan
Zhipeng Wei
Xiang Wang
Chong-Wah Ngo
Tat-Seng Chua

Recognizing ingredients for a given dish image is at the core of automatic dietary assessment, attracting increasing attention from both industry and academia. Nevertheless, the task is challenging due to the difﬁculty of collecting and labeling sufﬁcient training data. On one hand, there are hundred thousands of food ingredients in the world, ranging from the common to rare. Collecting training samples for all of the ingredient categories is difﬁcult. On the other hand, as the ingredient appearances exhibit huge visual variance during the food preparation, it requires to collect the training samples under different cooking and cutting methods for robust recognition. Since obtaining sufﬁcient fully annotated training data is not easy, a more practical way of scaling up the recognition is to develop models that are capable of recognizing unseen ingredients. Therefore, in this paper, we target the problem of ingredient recognition with zero training samples. More speciﬁcally, we introduce multi-relational GCN (graph convolutional network) that integrates ingredient hierarchy, attribute as well as co-occurrence for zero-shot ingredient recognition. Extensive experiments on both Chinese and Japanese food datasets are performed to demonstrate the superior performance of multi-relational GCN and shed light on zero-shot ingredients recognition.

PDF Details