Arrow Research search

Author name cluster

Ming Zhong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

AAAI Conference 2026 Conference Paper

Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

  • Jiacheng Liu
  • Mayi Xu
  • Qiankun Pi
  • Wenli Li
  • Ming Zhong
  • Yuanyuan Zhu
  • Mengchi Liu
  • Tieyun Qian

Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence. In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.

AAAI Conference 2026 Conference Paper

Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering

  • Yunfeng Ning
  • Mayi Xu
  • Jintao Wen
  • Qiankun Pi
  • Yuanyuan Zhu
  • Ming Zhong
  • Jiawei Jiang
  • Tieyun Qian

Large Language Models (LLMs) often suffer from hallucinations and outdated or incomplete knowledge. Retrieval-Augmented Generation (RAG) is proposed to address these issues by integrating external knowledge like that in knowledge graphs (KGs) into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information? (2) How to retrieve question-relevant anonymous entities? To address these challenges, we propose a novel Abstraction Reasoning on Graph (ARoG) framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. Hence, it supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. In addition to guiding LLMs to effectively retrieve knowledge from KGs, these abstraction strategies also strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness, establishing a new practical direction for privacy-protected RAG systems.

NeurIPS Conference 2025 Conference Paper

CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization

  • Yichen Yan
  • Ming Zhong
  • Qi Zhu
  • Xiaoling Gu
  • Jinpeng Chen
  • Huan Li

Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling the scorer to assign CoIDO scores to all data points. This unified scoring approach allows for direct ranking and selection of the most valuable subsets, completely bypassing the need for specialized algorithms. In our experiments, we trained the CoIDO Scorer using only 20% of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20% subset for instruction tuning. On the widely used LLaVA-1. 5-7B model across ten downstream tasks, this selected subset achieved an impressive 98. 2% of the performance of full-data fine-tuning, on average. Moreover, CoIDO outperforms all competitors in terms of both efficiency (lowest training FLOPs) and aggregated accuracy. Our code is available at: https: //github. com/SuDIS-ZJU/CoIDO

NeurIPS Conference 2025 Conference Paper

IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer

  • Zi Yang
  • Lei Qiu
  • FANG LYU
  • Ming Zhong
  • Zhilei Chai
  • Haojie Zhou
  • Huimin Cui
  • Xiaobing Feng

Compiler optimization is essential for improving program performance, yet modern compilers still depend on manually crafted transformation rules over intermediate representations (IRs). As compilers grow in complexity, maintaining these rule-based optimizations becomes increasingly labor-intensive and difficult to scale. Recent advances in large language models (LLMs) offer a promising alternative, but their effectiveness in compiler optimization remains limited—primarily due to the lack of IR-oriented datasets that expose models to diverse transformation samples in real-world scenarios ( optimization-sensitive samples ), hindering LLMs from learning rich and generalizable optimization strategies. In this paper, we introduce IR-OptSet, the first public optimization-sensitive dataset for advancing LLM-based IR optimizers. It comprises 170K LLVM IR samples from open-source repositories across 8 representative optimization domains. IR-OptSet defines two core tasks: Code Analysis and Optimized Code Generation, and provides tools for correctness verification, performance evaluation, and dataset expansion. In our experiments, fine-tuning three representative LLMs on IR-OptSet leads to significant accuracy improvements across both tasks. Moreover, the LLM fine-tuned with IR-OptSet outperforms traditional compiler with the -O3 option in 64 test cases in terms of performance. Further analysis reveals that IR-OptSet provides greater transformation diversity and representativeness than three widely used IR-oriented datasets, highlighting its potential to drive model-based IR optimization. IR-OptSet is publicly available at https: //huggingface. co/datasets/YangziResearch/IR-OptSet.

NeurIPS Conference 2024 Conference Paper

ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency

  • Ming Zhong
  • FANG LYU
  • Lulin Wang
  • Hongna Geng
  • Lei Qiu
  • Huimin Cui
  • Xiaobing Feng

Compiler backends are tasked with generating executable machine code for processors. With the proliferation of diverse processors, it is imperative for programmers to tailor specific compiler backends to accommodate each one. Meanwhile, compiler backend development is a laborious and time-consuming task, lacking effective automation methods. Although language models have demonstrated strong abilities in code related tasks, the lack of appropriate datasets for compiler backend development limits the application of language models in this field. In this paper, we introduce ComBack, the first public dataset designed for improving compiler backend development capabilities of language models. ComBack includes 178 backends for mainstream compilers and three tasks including statement-level completion, next-statement suggestion and code generation, representing common development scenarios. We conducted experiments by fine-tuning six pre-trained language models with ComBack, demonstrating its effectiveness in enhancing model accuracy across the three tasks. We further evaluated the top-performing model(CodeT5+) across the three tasks for new targets, comparing its accuracy with conventional methods (Fork-Flow), ChatGPT-3. 5-Turbo, and Code-LLaMA-34B-Instruct. Remarkably, fine-tuned CodeT5+ with only 220M parameters on ComBack outperformed Fork-Flow methods significantly and surpassed ChatGPT and Code-LLaMA. This suggests potential efficiency improvements in compiler development. ComBack is avaliable at https: //huggingface. co/datasets/docz1105/ComBack.

TMLR Journal 2024 Journal Article

Multi-LoRA Composition for Image Generation

  • Ming Zhong
  • Yelong Shen
  • Shuohang Wang
  • Yadong Lu
  • Yizhu Jiao
  • Siru Ouyang
  • Donghan Yu
  • Jiawei Han

Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: \textsc{LoRA Switch}, which alternates between different LoRAs at each denoising step, and \textsc{LoRA Composite}, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish \texttt{ComposLoRA}, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition. The code, benchmarks, LoRA weights, and all evaluation details are available on our project website: https://maszhongming.github.io/Multi-LoRA-Composition.

IJCAI Conference 2023 Conference Paper

Mimicking the Thinking Process for Emotion Recognition in Conversation with Prompts and Paraphrasing

  • Ting Zhang
  • Zhuang Chen
  • Ming Zhong
  • Tieyun Qian

Emotion recognition in conversation, which aims to predict the emotion for all utterances, has attracted considerable research attention in recent years. It is a challenging task since the recognition of the emotion in one utterance involves many complex factors, such as the conversational context, the speaker's background, and the subtle difference between emotion labels. In this paper, we propose a novel framework which mimics the thinking process when modeling these factors. Specifically, we first comprehend the conversational context with a history-oriented prompt to selectively gather information from predecessors of the target utterance. We then model the speaker's background with an experience-oriented prompt to retrieve the similar utterances from all conversations. We finally differentiate the subtle label semantics with a paraphrasing mechanism to elicit the intrinsic label related knowledge. We conducted extensive experiments on three benchmarks. The empirical results demonstrate the superiority of our proposed framework over the state-of-the-art baselines.

AAAI Conference 2022 Conference Paper

DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization

  • Ming Zhong
  • Yang Liu
  • Yichong Xu
  • Chenguang Zhu
  • Michael Zeng

Dialogue is an essential part of human communication and cooperation. Existing research mainly focuses on short dialogue scenarios in a one-on-one fashion. However, multiperson interactions in the real world, such as meetings or interviews, are frequently over a few thousand words. There is still a lack of corresponding research and powerful tools to understand and process such long dialogues. Therefore, in this work, we present a pre-training framework for long dialogue understanding and summarization. Considering the nature of long conversations, we propose a window-based denoising approach for generative pre-training. For a dialogue, it corrupts a window of text with dialogue-inspired noise, and guides the model to reconstruct this window based on the content of the remaining conversation. Furthermore, to process longer input, we augment the model with sparse attention which is combined with conventional attention in a hybrid manner. We conduct extensive experiments on five datasets of long dialogues, covering tasks of dialogue summarization, abstractive question answering and topic segmentation. Experimentally, we show that our pre-trained model DIALOGLM significantly surpasses the state-of-the-art models across datasets and tasks. Source code and all the pretrained models are available on our GitHub repository (https: //github. com/microsoft/DialogLM).

AAAI Conference 2021 Conference Paper

Enhancing Scientific Papers Summarization with Citation Graph

  • Chenxin An
  • Ming Zhong
  • Yiran Chen
  • Danqing Wang
  • Xipeng Qiu
  • Xuanjing Huang

Previous work for text summarization in scientific domain mainly focused on the content of the input document, but seldom considering its citation network. However, scientific papers are full of uncommon domain-specific terms, making it almost impossible for the model to understand its true meaning without the help of the relevant research community. In this paper, we redefine the task of scientific papers summarization by utilizing their citation graph and propose a citation graph-based summarization model (CGSUM) which can incorporate the information of both the source paper and its references. In addition, we construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains and 661K citation relationships. The entire dataset constitutes a large connected citation graph. Extensive experiments show that our model can achieve competitive performance when compared with the pretrained models even with a simple architecture. The results also indicates the citation graph is crucial to better understand the content of papers and generate high-quality summaries.