Arrow Research search

Author name cluster

Nan Tang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

AAAI Conference 2026 Conference Paper

Explainable Oracle Bone Script Recognition via Multimodal Pictographic Reasoning

  • Yin Wu
  • Zhengxuan Zhang
  • Jiayu Chen
  • Chang Xu
  • Yuyu Luo
  • Nan Tang
  • Hui Xiong

Oracle Bone Script, East Asia's earliest mature writing system from over 3,500 years ago, encodes ancient cognition through visual metaphors, yet remains largely undeciphered and inaccessible, severing modern society from its cultural roots. Traditional AI methods, while accurate in classification, treat glyphs as opaque data, neglecting their pictographic essence and failing to foster public understanding—exacerbating a heritage crisis amid linguistic evolution. We pioneer a paradigm shift toward AI-driven cultural democratization, introducing OracleVis, the first human-validated multimodal dataset of glyph-image-explanation triplets, curated through expert collaborations to overcome data scarcity, bias, and incompleteness in archaeological sources. Building on this, OBS-VM, an explainability-centric multimodal large language model fine-tuned on Qwen2-VL-7B, models pictographic reasoning by balancing semantic fidelity with interpretive transparency, transforming black-box predictions into cognition-aligned narratives. Rigorous evaluations, including benchmarks and a user study with 24 non-experts, reveal our system's superiority: it outperforms GPT-4o in pictographic rationality (3.79 vs. 3.58 in human evaluation) and achieves a 35.3% relative improvement in recognition accuracy, while interactive learning boosts knowledge gains (+5.5 vs. +1.7), interest (+1.9 vs. +0.4), and confidence (+2.0 vs. +0.3) over static methods. This work illuminates AI's potential to bridge ancient wisdom and contemporary audiences, redefining heritage preservation as an inclusive, socially impactful endeavor that turns cultural alienation into enlightened engagement.

TMLR Journal 2025 Journal Article

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

  • Xiangsen Chen
  • Xuan Feng
  • Shuo Chen
  • Matthieu Maitre
  • Sudipto Rakshit
  • Diana Duvieilh
  • Ashley Picone
  • Nan Tang

Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow---triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The code of CyberThreat-Eval benchmark is available at https://github.com/secintelligence/CyberThreat-Eval.

TMLR Journal 2025 Journal Article

Interactive Large Language Models for Reliable Answering under Incomplete Context

  • Jing-Cheng Pang
  • Heng-Bo Fan
  • Pengyuan Wang
  • Jia-Hao Xiao
  • Nan Tang
  • Si-Hang Yang
  • Chengxing Jia
  • Ming-Kun Xie

The rise of large language models (LLMs) has revolutionized the way humans interact with artificial intelligence systems. However, their reliability in sensitive applications—such as personal consultations or clinical decision-making—remains limited. A critical shortfall lies in LLMs’ inherent lack of interactivity: these models generate responses even when essential context or domain-specific knowledge is absent, risking inaccurate or misleading outputs. A potential approach to mitigate this issue is to enable LLMs to pose clarifying questions, thereby uncovering the missing information required to provide accurate responses. However, previous methods often tend to greedily prompt LLMs to ask questions. This burdens the user to respond to potentially irrelevant questions and makes the system less flexible. In this paper, we introduce LaMSeI (Language Model with Selective Interaction) method, which enhances LLMs’ ability to judge when interaction is necessary under ambiguous or incomplete contexts. The motivation of LaMSeI is to measure the level of LLMs’ uncertainty about the user query, and interacts with user only when the uncertainty is high. Additionally, we incorporate active learning techniques to select the most informative questions from question candidates, for effectively uncovering the missing context. Our empirical studies, across various challenging question answering benchmarks, where LLMs are posed queries with incomplete context, demonstrate the effectiveness of LaMSeI. The method improves answer accuracy from 31.9% to 50.9%, outperforming other leading question-answering frameworks. Moreover, in experiments involving human participants, LaMSeI consistently generates answers superior to or comparable to baselines in more than 82% of the cases. Moreover, we verify the performance of LaMSeI on various LLMs, such as LLAMA2, LLAMA3, Vicuna and GPT-3.5, highlighting its capability to improve interactive language models.

ICLR Conference 2025 Conference Paper

Learning View-invariant World Models for Visual Robotic Manipulation

  • Jing-Cheng Pang
  • Nan Tang
  • Kaiyuan Li
  • Yuting Tang
  • Xin-Qiang Cai
  • Zhen-Yu Zhang
  • Gang Niu 0001
  • Masashi Sugiyama

Robotic manipulation tasks often rely on visual inputs from cameras to perceive the environment. However, previous approaches still suffer from performance degradation when the camera’s viewpoint changes during manipulation. In this paper, we propose ReViWo (Representation learning for View-invariant World model), leveraging multi-view data to learn robust representations for control under viewpoint disturbance. ReViWo utilizes an autoencoder framework to reconstruct target images by an architecture that combines view-invariant representation (VIR) and view-dependent representation. To train ReViWo, we collect multi-view data in simulators with known view labels, meanwhile, ReViWo is simutaneously trained on Open X-Embodiment datasets without view labels. The VIR is then used to train a world model on pre-collected manipulation data and a policy through interaction with the world model. We evaluate the effectiveness of ReViWo in various viewpoint disturbance scenarios, including control under novel camera positions and frequent camera shaking, using the Meta-world & PandaGym environments. Besides, we also conduct experiments on real world ALOHA robot. The results demonstrate that ReViWo maintains robust performance under viewpoint disturbance, while baseline methods suffer from significant performance degradation. Furthermore, we show that the VIR captures task-relevant state information and remains stable for observations from novel viewpoints, validating the efficacy of the ReViWo approach.

NeurIPS Conference 2025 Conference Paper

nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning

  • Tianqi Luo
  • Chuhan Huang
  • Leixian Shen
  • Boyan Li
  • Shuyu Shen
  • Wei Zeng
  • Nan Tang
  • Yuyu Luo

Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nBench 2. 0, a new benchmark designed to evaluate Text2VIS systems in scenarios involving ambiguous queries. nvBench 2. 0 includes 7, 878 natural language queries and 24, 076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous Text2VIS tasks using nBench 2. 0. We also propose Step-Text2Vis, an LLM-based model trained on nvBench 2. 0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-Text2Vis outperforms all baselines, setting a new state-of-the-art for ambiguous Text2VIS tasks. Our source code and data are available at https: //nvbench2. github. io/

IJCAI Conference 2025 Conference Paper

RAMer: Reconstruction-based Adversarial Model for Multi-party Multi-modal Multi-label Emotion Recognition

  • Xudong Yang
  • Yizhang Zhu
  • Hanfeng Liu
  • Zeyi Wen
  • Nan Tang
  • Yuyu Luo

Conventional Multi-modal multi-label emotion recognition (MMER) assumes complete access to visual, textual, and acoustic modalities. However, real-world multi-party settings often violate this assumption, as non-speakers frequently lack acoustic and textual inputs, leading to a significant degradation in model performance. Existing approaches also tend to unify heterogeneous modalities into a single representation, overlooking each modality’s unique characteristics. To address these challenges, we propose RAMer (Reconstruction-based Adversarial Model for Emotion Recognition), which refines multi-modal representations by not only exploring modality commonality and specificity but crucially by leveraging reconstructed features, enhanced by contrastive learning, to overcome data incompleteness and enrich feature quality. RAMer also introduces a personality auxiliary task to complement missing modalities using modality-level attention, improving emotion reasoning. To further strengthen the model's ability to capture label and modality interdependency, we propose a stack shuffle strategy to enrich correlations between labels and modality-specific features. Experiments on three benchmarks, i. e. , MEmoR, CMU-MOSEI, and M³ED, demonstrate that RAMer achieves state-of-the-art performance in dyadic and multi-party MMER scenarios.

NeurIPS Conference 2025 Conference Paper

Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking

  • Changlun Li
  • Yao SHI
  • Chen Wang
  • Qiqi Duan
  • Runke RUAN
  • Weijie Huang
  • Haonan Long
  • Lijun Huang

Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real-world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM-driven trading strategies is their reliance on historical back-testing, inadvertently enabling LLMs to "time travel"—leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real-time market conditions. Utilizing a multi-agent architecture, DeepFund connects directly with real-time stock market data—specifically data published after each model’s pretraining cutoff—to ensure fair and leakage-free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions—including ticker-level analysis, investment decision-making, portfolio management, and risk control—reveal significant practical challenges. Notably, even cutting-edge models such as DeepSeek-V3 and Claude-3. 7-Sonnet incur net trading losses within DeepFund real-time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at https: //github. com/HKUSTDial/DeepFund.

NeurIPS Conference 2024 Conference Paper

Are Large Language Models Good Statisticians?

  • Yizhang Zhu
  • Shiyin Du
  • Boyan Li
  • Yuyu Luo
  • Nan Tang

Large Language Models (LLMs) have demonstrated impressive capabilities across a range of scientific tasks including mathematics, physics, and chemistry. Despite their successes, the effectiveness of LLMs in handling complex statistical tasks remains systematically under-explored. To bridge this gap, we introduce StatQA, a new benchmark designed for statistical analysis tasks. StatQA comprises 11, 623 examples tailored to evaluate LLMs' proficiency in specialized statistical tasks and their applicability assessment capabilities, particularly for hypothesis testing methods. We systematically experiment with representative LLMs using various prompting strategies and show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64. 83%, indicating significant room for improvement. Notably, while open-source LLMs (e. g. LLaMA-3) show limited capability, those fine-tuned ones exhibit marked improvements, outperforming all in-context learning-based methods (e. g. GPT-4o). Moreover, our comparative human experiments highlight a striking contrast in error types between LLMs and humans: LLMs primarily make applicability errors, whereas humans mostly make statistical task confusion errors. This divergence highlights distinct areas of proficiency and deficiency, suggesting that combining LLM and human expertise could lead to complementary strengths, inviting further investigation into their collaborative potential. Our source code and data are available at https: //statqa. github. io/.

NeurIPS Conference 2024 Conference Paper

CRAG - Comprehensive RAG Benchmark

  • Xiao Yang
  • Kai Sun
  • Hao Xin
  • Yushi Sun
  • Nikita Bhalla
  • Xiangsen Chen
  • Sajal Choudhary
  • Rongze D. Gui

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)’s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4, 409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve $\le 34\%$ accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https: //github. com/facebookresearch/CRAG/.

NeurIPS Conference 2024 Conference Paper

KALM: Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

  • Jing-Cheng Pang
  • Si-Hang Yang
  • kaiyuan Li
  • Xiong-Hui Chen
  • Nan Tang
  • Yang Yu

Reinforcement learning (RL) traditionally trains agents using interaction data, which limits their capabilities to the scope of the training data. To create more knowledgeable agents, leveraging knowledge from large language models (LLMs) has shown a promising way. Despite various attempts to combine LLMs with RL, there is commonly a semantic gap between action signals and LLM tokens, which hinders their integration. This paper introduces a novel approach, KALM (Knowledgeable Agents from Language Model Rollouts), to learn knowledgeable agents by bridging this gap. KALM extracts knowledge from LLMs in the form of imaginary rollouts, which agents can learn through offline RL. To overcome the limitation that LLMs are inherently text-based and may be incompatible with numerical environmental data, KALM fine-tunes the LLM to perform bidirectional translation between textual goals and rollouts. This process enables the LLM to understand the environment better, facilitating the generation of meaningful rollouts. Experiments on robotic manipulation tasks demonstrate that KALM allows agents to rephrase complex goals and tackle novel tasks requiring new optimal behaviors. KALM achieves a 46% success rate in completing 1400 various novel goals, significantly outperforming the 26% success rate of baseline methods. Project homepage: https: //kalmneurips2024. github. io.