Arrow Research search

Author name cluster

Yao Wan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

AAAI Conference 2026 Conference Paper

Are We on the Right Way to Assess Document Retrieval-Augmented Generation?

  • Wenxuan Shen
  • Mingjia Wang
  • Yaochen Wang
  • Dongping Chen
  • Junjie Yang
  • Yao Wan
  • Weiwei Lin

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

TMLR Journal 2026 Journal Article

Wikipedia in the Era of LLMs: Evolution and Risks

  • Siming Huang
  • Yuliang Xu
  • Mingmeng Geng
  • Yao Wan
  • Dongping Chen

In this paper, we present a comprehensive analysis and monitoring framework for the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing article content and page views to study the recent changes in Wikipedia and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been affected by LLMs, with an impact of approximately 1% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift. Moreover, the effectiveness of RAG might decrease if the knowledge has been contaminated by LLMs. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks in NLP research.

TAAS Journal 2025 Journal Article

AdapCP: Collaborative Inference with Adaptive CNN Partition on Distributed Edge Servers

  • Sifan Zhao
  • Dezhong Yao
  • Yao Wan
  • Gang Wu
  • Hai Jin

Due to the limited resources of end devices, the task of Convolutional Neural Network (CNN) inference on the end-side is moving towards edge-end collaboration. However, existing collaborative methods mainly focus on offloading CNN inference tasks from end devices to a single-edge server, which leads to inefficient use of computational resources among nearby edge servers. Moreover, offloading the CNN inference task to a single third-party server may raise privacy concerns. To address these challenges, we propose a framework named AdapCP that introduces a collaborative and adaptive parallel acceleration strategy that utilizes the end device and multiple edge servers. AdapCP consists of two stages: (1) offloading to nearby servers and (2) parallel processing of the CNN inference. For the offloading phase, we use integer linear programming to find the partition points at the inter-layer level. For the parallel phase, we first investigate intra-layer structural splitting methods tailored for both convolutional and fully connected layers. Then, we employ a Deep Deterministic Policy Gradient (DDPG) algorithm based on the Dirichlet distribution to decide the partition points. Finally, we set a periodic update index to enhance AdapCP’s adaptability to dynamic environments. Empirical evaluations conducted on the Jetson nano demonstrate that AdapCP significantly reduces the total latency of CNN inference by an average factor of 2.21 \(\times\) compared to existing solutions.

NeurIPS Conference 2025 Conference Paper

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

  • Jiaxi Cheng
  • Yuliang Xu
  • Shoupeng Wang
  • Tao Ma
  • Yuchen He
  • Jinghe Zhang
  • Sihang Cai
  • Jiawei Zhen

Industrial Anomaly Detection (IAD) is an indispensable quality control technology in modern production processes. Recently, on account of the outstanding visual comprehension and cross-domain knowledge transfer capabilities of multimodal large language models (MLLMs), existing studies have explored the application of MLLMs in the IAD domain and established some multimodal IAD datasets. However, although the latest datasets contain various fundamental IAD tasks, they formulate tasks in a general question-and-answer format lacking a rigorous reasoning process, and they are relatively limited in the diversity of scenarios, which restricts their reliability in practical applications. In this paper, we propose AnomalyCoT, a multimodal Chain-of-Thought (CoT) dataset for multi-scenario IAD tasks. It consists of 37, 565 IAD samples with the CoT data and is defined by challenging composite IAD tasks. Meanwhile, the CoT data for each sample provides precise coordinates of anomaly regions, thereby improving visual comprehension of defects across different types. AnomalyCoT is constructed through a systematic pipeline and involves multiple manual operations. Based on AnomalyCoT, we conducted a comprehensive evaluation of various mainstream MLLMs and fine-tuned representative models in different ways. The final results show that Gemini-2. 0-flash achieved the best performance in the direct evaluation with an accuracy rate of 59. 6\%, while Llama 3. 2-Vision achieves the best performance after LoRA fine-tuning with an accuracy rate of 94. 0\%. Among all the fine-tuned models, the average accuracy improvement reaches 36. 5\%, demonstrating the potential of integrating CoT datasets in future applications within the IAD field. The code and data are available at \url{https: //github. com/Zhaolutuan/AnomalyCoT}.

NeurIPS Conference 2025 Conference Paper

Seeking and Updating with Live Visual Knowledge

  • Mingyang Fu
  • Yuyang Peng
  • Dongping Chen
  • Zetong Zhou
  • Benlin Liu
  • Yao Wan
  • Zhou Zhao
  • Philip S Yu

The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets. To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107, 143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge. Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning methods to update MLLMs with new visual knowledge. We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge. All the experimental dataset and source code are publicly available at: https: //livevqa. github. io.

NeurIPS Conference 2024 Conference Paper

HonestLLM: Toward an Honest and Helpful Large Language Model

  • Chujie Gao
  • Siyuan Wu
  • Yue Huang
  • Dongping Chen
  • Qihui Zhang
  • Zhengyan Fu
  • Yao Wan
  • Lichao Sun

Large Language Models (LLMs) have achieved remarkable success across various industries and applications, owing to their exceptional generative capabilities. Nevertheless, honesty and helpfulness, which ensure safe and useful real-world deployments, have been considered as the longstanding cornerstones in practice. In this paper, we first established comprehensive principles for honesty LLM and further created the HoneSet with 930 queries across six categories, which is designed to evaluate LLMs’ ability to maintain honesty. Then, we improved the honesty and helpfulness of LLMs in both training-free and fine-tuning settings. Specifically, we propose a training-free method named Curiosity-Driven Prompting, which enables LLMs to express their internal confusion and uncertainty about the given query and then optimize their responses. Moreover, we also propose a two-stage fine-tuning approach, inspired by curriculum learning, to enhance the honesty and helpfulness of LLMs. The method first teaches LLMs to distinguish between honest and dishonest, and then LLMs are trained to learn to respond more helpfully. Experimental results demonstrated that both of the two proposed methods improve the helpfulness of LLMs while making them maintain honesty. Our research has paved the way for more reliable and trustworthy LLMs in real-world applications.

NeurIPS Conference 2024 Conference Paper

Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models

  • Daizong Liu
  • Mingyu Yang
  • Xiaoye Qu
  • Pan Zhou
  • Xiang Fang
  • Keke Tang
  • Yao Wan
  • Lichao Sun

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding tasks. Nevertheless, these models are susceptible to adversarial examples. In real-world applications, existing LVLM attackers generally rely on the detailed prior knowledge of the model to generate effective perturbations. Moreover, these attacks are task-specific, leading to significant costs for designing perturbation. Motivated by the research gap and practical demands, in this paper, we make the first attempt to build a universal attacker against real-world LVLMs, focusing on two critical aspects: (i) restricting access to only the LVLM inputs and outputs. (ii) devising a universal adversarial patch, which is task-agnostic and can deceive any LVLM-driven task when applied to various inputs. Specifically, we start by initializing the location and the pattern of the adversarial patch through random sampling, guided by the semantic distance between their output and the target label. Subsequently, we maintain a consistent patch location while refining the pattern to enhance semantic resemblance to the target. In particular, our approach incorporates a diverse set of LVLM task inputs as query samples to approximate the patch gradient, capitalizing on the importance of distinct inputs. In this way, the optimized patch is universally adversarial against different tasks and prompts, leveraging solely gradient estimates queried from the model. Extensive experiments are conducted to verify the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs including LLaVA, MiniGPT-4, Flamingo, and BLIP-2, spanning a spectrum of tasks, all achieved without delving into the details of the model structures.

AAAI Conference 2022 Conference Paper

DANets: Deep Abstract Networks for Tabular Data Classification and Regression

  • Jintai Chen
  • Kuanlun Liao
  • Yao Wan
  • Danny Z. Chen
  • Jian Wu

Tabular data are ubiquitous in real world applications. Although many commonly-used neural components (e. g. , convolution) and extensible neural networks (e. g. , ResNet) have been developed by the machine learning community, few of them were effective for tabular data and few designs were adequately tailored for tabular data structures. In this paper, we propose a novel and flexible neural component for tabular data, called Abstract Layer (ABSTLAY), which learns to explicitly group correlative input features and generate higherlevel features for semantics abstraction. Also, we design a structure re-parameterization method to compress the trained ABSTLAY, thus reducing the computational complexity by a clear margin in the reference phase. A special basic block is built using ABSTLAYs, and we construct a family of Deep Abstract Networks (DANETs) for tabular data classification and regression by stacking such blocks. In DANETs, a special shortcut path is introduced to fetch information from raw tabular features, assisting feature interactions across different levels. Comprehensive experiments on seven real-world tabular datasets show that our ABSTLAY and DANETs are effective for tabular data classification and regression, and the computational complexity is superior to competitive methods. Besides, we evaluate the performance gains of DANET as it goes deep, verifying the extendibility of our method. Our code is available at https: //github. com/WhatAShot/DANet.

TIST Journal 2022 Journal Article

FedBERT: When Federated Learning Meets Pre-training

  • Yuanyishu Tian
  • Yao Wan
  • Lingjuan Lyu
  • Dezhong Yao
  • Hai Jin
  • Lichao Sun

The fast growth of pre-trained models (PTMs) has brought natural language processing to a new era, which has become a dominant technique for various natural language processing (NLP) applications. Every user can download the weights of PTMs, then fine-tune the weights for a task on the local side. However, the pre-training of a model relies heavily on accessing a large-scale of training data and requires a vast amount of computing resources. These strict requirements make it impossible for any single client to pre-train such a model. To grant clients with limited computing capability to participate in pre-training a large model, we propose a new learning approach, FedBERT, that takes advantage of the federated learning and split learning approaches, resorting to pre-training BERT in a federated way. FedBERT can prevent sharing the raw data information and obtain excellent performance. Extensive experiments on seven GLUE tasks demonstrate that FedBERT can maintain its effectiveness without communicating to the sensitive local data of clients.

AAAI Conference 2021 Conference Paper

KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning

  • Ye Liu
  • Yao Wan
  • Lifang He
  • Hao Peng
  • Philip S. Yu

Generative commonsense reasoning which aims to empower machines to generate sentences with the capacity of reasoning over a set of concepts is a critical bottleneck for text generation. Even the state-of-the-art pre-trained language generation models struggle at this task and often produce implausible and anomalous sentences. One reason is that they rarely consider incorporating the knowledge graph which can provide rich relational information among the commonsense concepts. To promote the ability of commonsense reasoning for text generation, we propose a novel knowledge graphaugmented pre-trained language generation model KG-BART, which encompasses the complex relations of concepts through the knowledge graph and produces more logical and natural sentences as output. Moreover, KG-BART can leverage the graph attention to aggregate the rich concept semantics that enhances the model generalization on unseen concept sets. Experiments on benchmark CommonGen dataset verify the effectiveness of our proposed approach by comparing with several strong pre-trained language generation models, particularly KG-BART outperforms BART by 5. 80, 4. 60, in terms of BLEU-3, 4. Moreover, we also show that the generated context by our model can work as background scenarios to benefit downstream commonsense QA tasks. 1