Arrow Research search

Author name cluster

Wen Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

JBHI Journal 2026 Journal Article

A Novel Grasping Robot Control Method Using Motion Execution BCI Combining Knowledge Reasoning

  • Rui Li
  • Jing Liu
  • Jinli Liu
  • Shiqiang Yang
  • Weiping Liu
  • Ke Deng
  • Wen Wang

Recently, with the growing number of disabled people, brain-controlled technology offers a novel way to help patients restore their daily abilities. However, the conventional brain-controlled system based on the motion related task lacks intelligence in real-world environments. To address above problem, this study proposed a share-controlled system combining a precise hand movement (PHM)-based brain computer interface (BCI) system and knowledge-driven reasoning method. Six types of precise hand movements were selected to design novel motion execution paradigm for BCI system. A feature intermediate fusion convolutional neural network was employed to accurately decode electroencephalogram. Furthermore, a shared control grasping technology based on knowledge-based reasoning combined PHM-based BCI system was designed for grasping robot, which enhancing the system's intelligence and versatility in selecting objects. To verify the improvement of proposed method, experiments were conducted with 15 healthy subjects and 2 patients. The proposed method achieved an average accuracy of 82. 80 ± 6. 08%, with the highest accuracy reaching 94. 27%. All the experimental results demonstrate the effectiveness of the proposed shared control method.

AAAI Conference 2026 Conference Paper

GUI-G²: Gaussian Reward Modeling for GUI Grounding

  • Fei Tang
  • Zhangxuan Gu
  • Zhengxi Lu
  • Xuyang Liu
  • Shuheng Shen
  • Changhua Meng
  • Wen Wang
  • Wenqi Zhang

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G2, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

AAAI Conference 2026 Conference Paper

Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding

  • Rui-Chen Zheng
  • Wenrui Liu
  • Hui-Peng Du
  • Qinglin Zhang
  • Chong Deng
  • Qian Chen
  • Wen Wang
  • Yang Ai

Existing speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure of speech, where information is distributed unevenly over time. To address this, we propose VARSTok, a VAriable-frame-Rate Speech Tokenizer that adapts token allocation based on local feature similarity. VARSTok introduces two key innovations: (1) a temporal-aware density peak clustering algorithm that adaptively segments speech into variable-length units, and (2) a novel implicit duration coding scheme that embeds both content and temporal span into a single token index, eliminating the need for auxiliary duration predictors. Extensive experiments show that VARSTok significantly outperforms strong fixed-rate baselines. Notably, it achieves superior reconstruction naturalness while using up to 23% fewer tokens than a 40 Hz fixed-frame-rate baseline. VARSTok further yields lower word error rates and improved naturalness in zero-shot text-to-speech synthesis. To the best of our knowledge, this is the first work to demonstrate that a fully dynamic, variable-frame-rate acoustic speech tokenizer can be seamlessly integrated into downstream speech language models.

AAAI Conference 2026 Conference Paper

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

  • Han Yin
  • Yafeng Chen
  • Chong Deng
  • Luyao Cheng
  • Hui Wang
  • Chao-Hong Tan
  • Qian Chen
  • Wen Wang

The Speaker Diarization and Recognition (SDR) task aims to predict ``who spoke when and what'' within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

NeurIPS Conference 2025 Conference Paper

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

  • Weixiang Yan
  • Haitian Liu
  • Tengxiao Wu
  • Qian Chen
  • Wen Wang
  • Haoyuan Chai
  • Jiayi Wang

Large language models (LLMs) have achieved significant performance progress in various natural language processing applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. We ensure that ClinicalBench does not have data leakage. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 general and medical-domain LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.

AAAI Conference 2025 Conference Paper

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

  • Yuchen Tian
  • Weixiang Yan
  • Qian Yang
  • Xuandong Zhao
  • Qian Chen
  • Wen Wang
  • Ziyang Luo
  • Lei Ma

Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs.

NeurIPS Conference 2025 Conference Paper

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

  • Hao Zhong
  • Muzhi Zhu
  • Zongze Du
  • Zheng Huang
  • Canyu Zhao
  • Mingyu Liu
  • Wen Wang
  • Hao Chen

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because "optimal" keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

AAAI Conference 2025 Conference Paper

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

  • Jiaqing Liu
  • Chong Deng
  • Qinglin Zhang
  • Shilin Zhou
  • Qian Chen
  • Hai Yu
  • Wen Wang

Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.

ECAI Conference 2025 Conference Paper

Solving the Min-Max Multiple Traveling Salesmen Problem via Learning-Based Path Generation and Optimal Splitting

  • Wen Wang
  • Xiangchen Wu
  • Liang Wang 0006
  • Hao Hu 0001
  • Xianping Tao
  • Linghao Zhang

This study addresses the Min-Max Multiple Traveling Salesmen Problem (m3-TSP), which aims to coordinate tours for multiple salesmen such that the length of the longest tour is minimized. Due to its NP-hard nature, exact solvers become impractical under the assumption that P ≠ NP. As a result, learning-based approaches have gained traction for their ability to rapidly generate high-quality approximate solutions. Among these, two-stage methods combine learning-based components with classical solvers, simplifying the learning objective. However, this decoupling often disrupts consistent optimization, potentially degrading solution quality. To address this issue, we propose a novel two-stage framework named Generate-and-Split (GaS), which integrates reinforcement learning (RL) with an optimal splitting algorithm in a joint training process. The splitting algorithm offers near-linear scalability with respect to the number of cities and guarantees optimal splitting in Euclidean space for any given path. To facilitate the joint optimization of the RL component with the algorithm, we adopt an LSTM-enhanced model architecture to address partial observability. Extensive experiments show that the proposed GaS framework significantly outperforms existing learning-based approaches in both solution quality and transferability.

NeurIPS Conference 2025 Conference Paper

ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing

  • Huadai Liu
  • Kaicheng Luo
  • Jialei Wang
  • Wen Wang
  • Qian Chen
  • Zhou Zhao
  • Wei Xue

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in the out-of-distribution Movie Gen Audio benchmark. The project page is available at https: //ThinkSound-Project. github. io.

NeurIPS Conference 2025 Conference Paper

Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration

  • Wenju Sun
  • Qingyong Li
  • Wen Wang
  • Yang Liu
  • Yangliao Geng
  • Boyang Li

Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primarily approach this by minimizing differences between task-specific experts and the unified model, either from a parameter-level or a task-loss perspective. However, parameter-level methods exhibit a significant performance gap compared to the upper bound, while task-loss approaches entail costly secondary training procedures. In contrast, we observe that performance degradation closely correlates with feature drift, i. e. , differences in feature representations of the same sample caused by model merging. Motivated by this observation, we propose Layer-wise Optimal Task Vector Merging (LOT Merging), a technique that explicitly minimizes feature drift between task-specific experts and the unified model in a layer-by-layer manner. LOT Merging can be formulated as a convex quadratic optimization problem, enabling us to analytically derive closed-form solutions for the parameters of linear and normalization layers. Consequently, LOT Merging achieves efficient model consolidation through basic matrix operations. Extensive experiments across vision and vision-language benchmarks demonstrate that LOT Merging significantly outperforms baseline methods, achieving improvements of up to 4. 4% (ViT-B/32) over state-of-the-art approaches. The source code is available at https: //github. com/SunWenJu123/model-merging.

IROS Conference 2024 Conference Paper

Research on Autonomous Navigation of Dual-mode Wheel-legged Robot

  • Wen Wang
  • Xiaobin Xu 0003
  • Ziheng Chen 0009
  • Jian Yang 0032
  • Yingying Ran
  • Zhiying Tan
  • Minzhou Luo

In order to improve the terrain adaptability and energy efficiency of wheel-legged robot in complex environment, a dual-mode navigation system based on robot energy consumption model is proposed. Firstly, the obstacle trafficability is evaluated according to the maximum obstacle crossing capability of the robot, and the two-dimensional grid map is preprocessed. Secondly, the established energy consumption model is integrated into the evaluation function of A* algorithm, and the eight-adjacency expansion mode is improved to search the surrounding nodes according to the obstacle characteristics of the robot. In the obstacle-crossing area, the obstacle-crossing and obstacle-bypassing modes are intelligently switched based on the principle of minimum energy consumption. Finally, a dual-mode robot navigation system is built, and the experimental results show that the proposed navigation system reduces the average energy consumption, path length, and steering angle by 16. 8%, 24. 7%, and 31. 18%, respectively.

JBHI Journal 2022 Journal Article

Dynamic Sepsis Prediction for Intensive Care Unit Patients Using XGBoost-Based Model With Novel Time-Dependent Features

  • Shuhui Liu
  • Bo Fu
  • Wen Wang
  • Mei Liu
  • Xin Sun

Sepsis is a systemic inflammatory response caused by pathogens such as bacteria. Because its pathogenesis is not clear, the clinical manifestations of patients vary greatly, and the alarming incidence and mortality pose a great threat to patients and medical systems, especially in the ICU (Intensive Care Unit). The traditional judgment criteria have the problem of low specificity. Artificial intelligence models could greatly improve the accuracy of sepsis prediction and judgment. Based on the XGBoost machine learning framework taking demographic, vital signs, laboratory tests and medical intervention data as input, this paper proposes a novel model for dynamically predicting sepsis and assessing risk. To realize the model, two methods for feature construction are introduced. For the observed time-series data of vital signs and laboratory tests, the time-dependent method performs to construct the time-dependent characteristics after the statistical screening. For the clinical intervention data, the statistical counting method is applied to construct count-dependent characteristics. Moreover, a new objective function is proposed for the XGBoost framework, and the first-order and second-order gradients of the objective function are also given for model training. Compared with the state-of-the-art methods at present, the proposed model has the best performance, with AUROC improved by 5. 4% on the MIMIC-III dataset and 2. 1% on PhysioNet Challenge 2019 dataset. The data processing and training methods of this model can be conveniently applied in different electronic health record systems and has a wide application prospect.

AAAI Conference 2021 Conference Paper

Graph-Based Tri-Attention Network for Answer Ranking in CQA

  • Wei Zhang
  • Zeyuan Chen
  • Chao Dong
  • Wen Wang
  • Hongyuan Zha
  • Jianyong Wang

In community-based question answering (CQA) platforms, automatic answer ranking for a given question is critical for finding potentially popular answers in early times. The mainstream approaches learn to generate answer ranking scores based on the matching degree between question and answer representations as well as the influence of respondents. However, they encounter two main limitations: (1) Correlations between answers in the same question are often overlooked. (2) Question and respondent representations are built independently of specific answers before affecting answer representations. To address the limitations, we devise a novel graph-based tri-attention network, namely GTAN, which has two innovations. First, GTAN proposes to construct a graph for each question and learn answer correlations from each graph through graph neural networks (GNNs). Second, based on the representations learned from GNNs, an alternating tri-attention method is developed to alternatively build target-aware respondent representations, answerspecific question representations, and context-aware answer representations by attention computation. GTAN finally integrates the above representations to generate answer ranking scores. Experiments on three real-world CQA datasets demonstrate GTAN significantly outperforms state-of-the-art answer ranking methods, validating the rationality of the network architecture.

IJCAI Conference 2018 Conference Paper

Learning Sequential Correlation for User Generated Textual Content Popularity Prediction

  • Wen Wang
  • Wei Zhang
  • Jun Wang
  • Junchi Yan
  • Hongyuan Zha

Popularity prediction of user generated textual content is critical for prioritizing information in the web, which alleviates heavy information overload for ordinary readers. Most previous studies model each content instance separately for prediction and thus overlook the sequential correlations between instances of a specific user. In this paper, we go deeper into this problem based on the two observations for each user, i. e. , sequential content correlation and sequential popularity correlation. We propose a novel deep sequential model called User Memory-augmented recurrent Attention Network (UMAN). This model encodes the two correlations by updating external user memories which is further leveraged for target text representation learning and popularity prediction. The experimental results on several real-world datasets validate the benefits of considering these correlations and demonstrate UMAN achieves best performance among several strong competitors.