Author name cluster

Wenjun Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

AAAI Conference 2026 Conference Paper

AerialVLA: A Vision-Language-Action Model for Aerial Navigation with Online Dialogue

Jinyu Chen
Hongyu Li
Zongheng Tang
Xiaoduo Li
Wenjun Wu
Si Liu

Visual Dialogue Navigation (VDN) aims to enable agents to reach target locations through dialogue with humans. The integration of VDN into Unmanned Aerial Vehicle (UAV) systems enhances human-machine interaction by enabling intuitive, hands-free operation, thereby unlocking vast applications. However, existing VDN models for UAVs can only perform navigation based on dialogue history, lacking proactive interaction capabilities to correct trajectories. Moreover, their sequential observation history recording mechanism struggles to accurately localize landmarks observed in the historical context, leading to ineffective utilization of referential information in new user instructions.To address these, we present AerialVLA, an end-to-end UAV navigation framework integrating dialogue comprehension, action decision-making, and navigational question generation. AerialVLA comprises three core components: i) we propose the Progress-Driven Navigation-Query Alternation mechanism to determine optimal questioning timing through navigation progress estimation autonomously. ii) To effectively model long-horizon history observation sequences, we develop the History Spatial-Temporal Fusion module that extracts discriminative spatial-temporal representations from historical observations. iii) Furthermore, to overcome data scarcity in training, we devise the Online Task-Driven Augmentation strategy that enhances learning through action-conditioned data augmentation. Experimental results demonstrate that AerialVLA achieves state-of-the-art navigation performance while exhibiting effective dialogue capabilities.Moreover, to better evaluate the agent's proactive dialogue and navigation abilities, our evaluation benchmark, named UAV Navigation with Online Dialogue (UNOD), incorporates an online dialogue interaction module. The UNOD assesses UAV agents' real-time questioning capabilities by leveraging an Air Commander Large Language Model to simulate human-UAV interactions during testing.

PDF Details DOI

JBHI Journal 2026 Journal Article

CLDAE: A Two Stage EEG-based Emotion Recognition Framework Combining Contrastive Learning and Dual-Attention Encoder

Rongqi Cao
Jian He
Yu Liang
Xiyuan Hu
Tianhao Peng
Wenjun Wu
Shuang Niu
Shahid Mumtaz

Electroencephalogram (EEG)-based emotion recognition systems face a persistent challenge in maintaining robust performance across subjects (generalization) and within subjects (personalization). Existing models for cross-subject recognition generally struggle to adapt to individual-specific neural signatures, while models with optimized within-subject performance typically require a large amount of personalized data. To address these limitations, this study proposes an EEG-based emotion recognition framework, CLDAE, that integrates a contrastive learning strategy and a dual-attention feature extraction mechanism. The CLDAE framework includes two stages: contrastive learning pre-training and emotion recognition fine-tuning. During the pre-training stage, a data augmentation method that combines EEG signals from different subjects is used to generate new training samples. Moreover, to extract discriminative features from the augmented data, the dual-attention encoder combines temporal and channel attention mechanisms. After pre-training, the CLDAE is fine-tuned for final recognition tasks. The proposed CLDAE is verified by experiments on two public datasets (DEAP and SEED-IV) and a private dataset (MAN). The experimental results demonstrate that the CLDAE achieves competitive performance in both within-subject and cross-subject emotion recognition, with 95. 12% and 75. 29% accuracy on the MAN dataset, respectively; thus, outperforming the baseline methods. These results validate the effectiveness of the proposed framework in both within-subject and cross-subject emotion recognition.

Details DOI

AAAI Conference 2026 Conference Paper

Decoupling Understanding from Reasoning via Problem Space Mapping for Small-Scale Model Reasoning

Li Wang
Changhao Zhang
Zengqi Xiu
Kai Lu
Xin Yu
Kui Zhang
Wenjun Wu

Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., up to 1.5B parameters) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Encode Geometric Diagram as Geo-Graph in Geometry Problem Solving

Wenjun Wu
Lingling Zhang
Bo Zhao
Bo Li
Xinyu Zhang
Yaqiang Wu

Geometry Problem Solving has become a hot topic these years due to its complexity of enabling the machine with geometric abstraction, multi-modal reasoning and mathematical capabilities. Majority of research works place their attention on the fusion of multi-modal data or the synergistic combination of neural and symbolic systems for performance improvement. However, their neglect of the unique characteristics of geometric diagrams, which distinguish them from natural images, impedes the further exploring of critical information in geometric diagrams. In this work, we introduce the novel concept of geo-graph and propose the Geo-Graph Geometry Problem Solving model which encodes the geometric diagram from a new perspective. The geo-graph is designed to include semantic, structural and spatial information in the diagram, which is crucial to subsequent problem reasoning stage. To facilitate the model's comprehension of the actual layout of geometric diagram, spatial and connecting attentions are devised to serve as intrinsic knowledge guidance for feature propagation. An extra cross-modal attention is used as external guidance to instruct the encoding of geo-graph to be related to specific problem target. Fused multi-modal features are then sent into a commonly used encoder-decoder framework for final solution generation. The model is first trained with three carefully designed pre-training tasks to establish its fundamental knowledge of geo-graph, leveraging numerous varied samples generated through a geo-graph-based augmentation method. Experiments on popular geometry problem solving datasets demonstrate the effectiveness and superiority of our model for geometric diagram encoding.

PDF Details DOI

AAAI Conference 2026 Conference Paper

UrbanNav: Learning Language-Guided Embodied Urban Navigation from Web-Scale Human Trajectories

Yanghong Mei
Yirong Yang
Longteng Guo
Qunbo Wang
Ming-Ming Yu
Xingjian He
Wenjun Wu
Jing Liu

Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

MingMing Yu
Fei Zhu
Wenzhuo Liu
Yirong Yang
Qunbo Wang
Wenjun Wu
Jing Liu

Embodied agents are expected to perform object navigation in dynamic, open-world environments. However, existing approaches typically rely on static trajectories and a fixed set of object categories during training, overlooking the real-world requirement for continual adaptation to evolving scenarios. To facilitate related studies, we introduce the continual object navigation benchmark, which requires agents to acquire navigation skills for new object categories while avoiding catastrophic forgetting of previously learned knowledge. To tackle this challenge, we propose C-Nav, a continual visual navigation framework that integrates two key innovations: (1) A dual-path anti-forgetting mechanism, which comprises feature distillation that aligns multi-modal inputs into a consistent representation space to ensure representation consistency, and feature replay that retains temporal features within the action decoder to ensure policy consistency. (2) An adaptive sampling strategy that selects diverse and informative experiences, thereby reducing redundancy and minimizing memory overhead. Extensive experiments across multiple model architectures demonstrate that C-Nav consistently outperforms existing approaches, achieving superior performance even compared to baselines with full trajectory retention, while significantly lowering memory requirements. The code will be publicly available at \url{https: //bigtree765. github. io/C-Nav-project}.

PDF Details

NeurIPS Conference 2025 Conference Paper

Causal-R: A Causal-Reasoning Geometry Problem Solver for Optimized Solution Exploration

Wenjun Wu
Lingling Zhang
Bo Zhao
Muye Huang
Qianying Wang
Jun Liu

The task of geometry problem solving has been a long-standing focus in the automated mathematics community and draws growing attention due to its complexity for both symbolic and neural models. Although prior studies have explored various effective approaches for enhancing problem solving performances, two fundamental challenges remain unaddressed, which are essential to the application in practical scenarios. First, the multi-step reasoning gap between the initial geometric conditions and ultimate problem goal leads to a great search space for solution exploration. Second, obtaining multiple interpretable and shorter solutions remains an open problem. In this work, we introduce the Causal-Reasoning Geometry Problem Solver to overcome these challenges. Specifically, the Causal Graph Reasoning theory is proposed to perform symbolic reasoning before problem solving. Several causal graphs are constructed according to predefined rule base, where each graph is composed of primitive nodes, causal edges and prerequisite edges. By applying causal graph deduction from initial conditions, the reachability status of nodes are iteratively conveyed by causal edges until reaching the target nodes, representing feasible causal deduction paths. In this way, the search space of solutions is compressed from the beginning, the end and intermediate reasoning paths, while ensuring the interpretability and variety of solutions. To achieve this, we further propose Forward Matrix Deduction which transforms the causal graphs into matrices and vectors, and applies matrix operations to update the status value of reachable nodes in iterations. Finally, multiple solutions can be generated by tracing back from the target nodes after validation. Experiments demonstrate the effectiveness of our method to obtain multiple shorter and interpretable solutions. Code is available after acceptance.

PDF Details

NeurIPS Conference 2025 Conference Paper

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Muye Huang
Lingling Zhang
Jie Ma
Han Lai
Fangzhi Xu
Yifei Li
Wenjun Wu
Yaqiang Wu

Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.

PDF Details

AAAI Conference 2025 Conference Paper

EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding

Muye Huang
Han Lai
Xinyu Zhang
Wenjun Wu
Jie Ma
Lingling Zhang
Jun Liu

Chart understanding enables automated data analysis for humans, which requires models to achieve highly accurate visual comprehension. While existing Visual Language Models (VLMs) have shown progress in chart understanding, the lack of high-quality training data and comprehensive evaluation benchmarks hinders VLM chart comprehension. In this paper, we introduce EvoChart, a novel self-training method for generating synthetic chart data to enhance VLMs' capabilities in real-world chart comprehension. We also propose EvoChart-QA, a noval benchmark for measuring models' chart comprehension abilities in real-world scenarios. Specifically, EvoChart is a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart-QA consists of 650 distinct real-world charts collected from 140 different websites and 1,250 expert-curated questions that focus on chart understanding. Experimental results on various open-source and proprietary VLMs tested on EvoChart-QA demonstrate that even the best proprietary model, GPT-4o, achieves only 49.8% accuracy. Moreover, the EvoChart method significantly boosts the performance of open-source VLMs on real-world chart understanding tasks, achieving 54.2% accuracy on EvoChart-QA.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Siwei Wen
Junyan Ye
Peilin Feng
Hengrui Kang
Zichen Wen
Yize Chen
Jiang Wu
Wenjun Wu

With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100, 000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The code, model weights, and dataset can be found here: https: //github. com/opendatalab/FakeVLM.

PDF Details

ICLR Conference 2025 Conference Paper

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang
Donglin Yang
Ziqin Wang
Hohin Kwan
Jinyu Chen
Wenjun Wu
Hongsheng Li 0001
Yue Liao

Developing agents capable of navigating to a target location based on language instructions and visual information, known as vision-language navigation (VLN), has attracted widespread interest. Most research has focused on ground-based agents, while UAV-based VLN remains relatively underexplored. Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings, relying on predefined discrete action spaces and neglecting the inherent disparities in agent movement dynamics and the complexity of navigation tasks between ground and aerial environments. To address these disparities and challenges, we propose solutions from three perspectives: platform, benchmark, and methodology. To enable realistic UAV trajectory simulation in VLN tasks, we propose the OpenUAV platform, which features diverse environments, realistic flight control, and extensive algorithmic support. We further construct a target-oriented VLN dataset consisting of approximately 12k trajectories on this platform, serving as the first dataset specifically designed for realistic UAV VLN tasks. To tackle the challenges posed by complex aerial environments, we propose an assistant-guided UAV object search benchmark called UAV-Need-Help, which provides varying levels of guidance information to help UAVs better accomplish realistic VLN tasks. We also propose a UAV navigation LLM that, given multi-view images, task descriptions, and assistant instructions, leverages the multimodal understanding capabilities of the MLLM to jointly process visual and textual information, and performs hierarchical trajectory generation. The evaluation results of our method significantly outperform the baseline models, while there remains a considerable gap between our results and those achieved by human operators, underscoring the challenge presented by the UAV-Need-Help task.

Details

NeurIPS Conference 2025 Conference Paper

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Xiangyu Wang
Donglin Yang
Yue Liao
Wenhao Zheng
Wenjun Wu
Bin Dai
Hongsheng Li
Si Liu

Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.

PDF Details

AAAI Conference 2025 Conference Paper

VProChart: Answering Chart Question Through Visual Perception Alignment Agent and Programmatic Solution Reasoning

Muye Huang
Lingling Zhang
Han Lai
Wenjun Wu
Xinyu Zhang
Jun Liu

Charts are widely used for data visualization across various fields, including education, research, and business. Chart Question Answering (CQA) is an emerging task focused on the automatic interpretation and reasoning of data presented in charts. However, chart images are inherently difficult to interpret, and chart-related questions often involve complex logical and numerical reasoning, which hinders the performance of existing models. This paper introduces VProChart, a novel framework designed to address these challenges in CQA by integrating a lightweight Visual Perception Alignment Agent (VPAgent) and a Programmatic Solution Reasoning approach. VPAgent aligns and models chart elements based on principles of human visual perception, enhancing the understanding of chart context. The Programmatic Solution Reasoning approach leverages large language models (LLMs) to transform natural language reasoning questions into structured solution programs, facilitating precise numerical and logical reasoning. Extensive experiments on benchmark datasets such as ChartQA and PlotQA demonstrate that VProChart significantly outperforms existing methods, highlighting its capability in understanding and reasoning with charts.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning

Xin Yu
Rongye Shi
Pu Feng
Yongkai Tian
Simin Li
Shuhao Liao
Wenjun Wu

Incorporating symmetry as an inductive bias into multi-agent reinforcement learning (MARL) has led to improvements in generalization, data efficiency, and physical consistency. While prior research has succeeded in using perfect symmetry prior, the realm of partial symmetry in the multi-agent domain remains unexplored. To fill in this gap, we introduce the partially symmetric Markov game, a new subclass of the Markov game. We then theoretically show that the performance error introduced by utilizing symmetry in MARL is bounded, implying that the symmetry prior can still be useful in MARL even in partial symmetry situations. Motivated by this insight, we propose the Partial Symmetry Exploitation (PSE) framework that is able to adaptively incorporate symmetry prior in MARL under different symmetry-breaking conditions. Specifically, by adaptively adjusting the exploitation of symmetry, our framework is able to achieve superior sample efficiency and overall performance of MARL algorithms. Extensive experiments are conducted to demonstrate the superior performance of the proposed framework over baselines. Finally, we implement the proposed framework in real-world multi-robot testbed to show its superiority.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention

Xin Hu
Lingling Zhang
Jun Liu
Xinyu Zhang
Wenjun Wu
Qianying Wang

Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors. Our code will be released at https: //github. com/AIProCode/GPA.

PDF Details DOI