Arrow Research search

Author name cluster

Zehao Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

AAAI Conference 2026 Conference Paper

Multi-level Style Preference Optimization: An Adaptive Detection Framework for Human-Machine Hybrid Text

  • Zehao Wang
  • Lianwei Wu
  • Wenbo An
  • Hang Zhang
  • Yaxiong Wang

Large language model (LLM) generated texts now rival human quality, creating four text categories: purely machine-generated, machine-rewritten, machine-polished, and human-written content. Traditional detection methods face significant challenges in human-machine hybrid scenarios where LLMs perform rewriting or polishing, as existing approaches focus on single-level features and fail to capture subtle, multi-layered machine traces. To address this, we propose the Multi-level Style Preference Optimization (MSPO) framework, capturing machine style features at multiple granularities: sequence-level (overall consistency), phrase-level (distinctive n-gram patterns), and lexical-level (word selection distributions). We further incorporate four text complexity indicators (Type-Token Ratio, Average Sentence Length, Average Word Length, and Punctuation Ratio) to dynamically adjust optimization parameters based on human-machine text complexity differences, enhancing adaptability across diverse text types. Additionally, we construct a comprehensive detection dataset spanning three representative domains (scientific writing, news articles, and creative writing) across four text types (human-written, purely machine-generated, machine-rewritten, and machine-polished), generated using state-of-the-art LLMs for robust evaluation. Experimental results demonstrate that MSPO significantly outperforms existing methods across all text types. On the challenging rewritten texts, MSPO achieves up to 82.14% AUROC, representing an improvement of 11.15 percentage points over the strongest baseline ImBD, while maintaining robust cross-domain generalizability across scientific, news, and creative writing domains.

AAAI Conference 2026 Conference Paper

Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

  • Zehao Wang
  • Xinpeng Liu
  • Yudonglin Zhang
  • Xiaoqian Wu
  • Zhou Fang
  • Yifan Fang
  • Junfu Pu
  • Cewu Lu

Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, etc. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations related to object/noun concepts. Verb concepts, which are crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the first to investigate the verb hallucination phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination in relation to verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a baseline method based on fine-tuning with rich verb knowledge, achieving decent superiority. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

IROS Conference 2025 Conference Paper

3DWSNet: A Novel 3D Wavelet Spiking Neural Network for Event-based Action Recognition

  • Junkang Fang
  • Yonghao Dang
  • Wending Zhao
  • Bo Yu
  • Zehao Wang
  • Jianqin Yin

In robotics applications, event cameras provide low-latency and high-dynamic-range sensing by asynchronously detecting brightness changes, making them well-suited for capturing fast motions and subtle cues in dynamic environments. However, most existing Spiking Neural Network (SNN)-based methods enhance spatial information by stacking multiple frames of events, while neglecting the explicit modeling of high-and low-frequency components in the event stream. To address this limitation, we proposes a 3D Wavelet Spiking Neural Network (3DWSNet), which integrates a 3D wavelet transform with a cascaded Wavelet Spiking Convolution (WSC) module as its core. Specifically, the 3D wavelet transform decomposes input data into eight frequency sub-bands across spatial and temporal dimensions, enabling the model to preserve fine-grained high-frequency details while enriching low-frequency motion representations. The cascaded WSC architecture further improves the extraction of multi-scale spatio-temporal features by integrating information from feature maps at different resolutions. Extensive experiments show that our 3DWSNet significantly outperforms SOTA SNN performances on the CIFAR-10, CIFAR-100, DVS128 Gesture, and CIFAR10-DVS datasets. The source code will be publicly released soon.

TMLR Journal 2025 Journal Article

Diversity-Driven View Subset Selection for Indoor Novel View Synthesis

  • Zehao Wang
  • Han Zhou
  • Matthew B. Blaschko
  • Tinne Tuytelaars
  • Minye Wu

Novel view synthesis of indoor scenes can be achieved by capturing a monocular video sequence of the environment. However, redundant information caused by artificial movements in the input video data reduces the efficiency of scene modeling. To address this, we formulate the problem as a combinatorial optimization task for view subset selection. In this work, we propose a novel subset selection framework that integrates a comprehensive diversity-based measurement with well-designed utility functions. We provide a theoretical analysis of these utility functions and validate their effectiveness through extensive experiments. Furthermore, we introduce IndoorTraj, a novel dataset designed for indoor novel view synthesis, featuring complex and extended trajectories that simulate intricate human behaviors. Experiments on IndoorTraj show that our framework consistently outperforms baseline strategies while using only 5–20% of the data, highlighting its remarkable efficiency and effectiveness.

ICLR Conference 2025 Conference Paper

Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models

  • Hanzhu Chen
  • Xu Shen 0001
  • Jie Wang 0005
  • Zehao Wang
  • Qitan Lv
  • Junjie He
  • Rong Wu
  • Feng Wu 0001

Despite the impressive performance of general large language models(LLMs), many of their applications in specific domains (e.g., low-data and knowledge-intensive) still confront significant challenges. Supervised fine-tuning (SFT)---where a general LLM is further trained on a small labeled dataset to adapt for specific tasks or domains---has shown great power for developing domain-specific LLMs. However, existing SFT data primarily consist of Question and Answer (Q&A) pairs, which poses a significant challenge for LLMs to comprehend the correlation and logic of knowledge underlying the Q&A. To address this challenge, we propose a conceptually flexible and general framework to boost SFT, namely Knowledge Graph-Driven Supervised Fine-Tuning (KG-SFT). The key idea of KG-SFT is to generate high-quality explanations for each Q&A pair via a structured knowledge graph to enhance the knowledge comprehension and manipulation of LLMs. Specifically, KG-SFT consists of three components: Extractor, Generator, and Detector. For a given Q&A pair, (i) Extractor first identifies entities within Q&A pairs and extracts relevant reasoning subgraphs from external KGs, (ii) Generator then produces corresponding fluent explanations utilizing these reasoning subgraphs, and (iii) finally, Detector performs sentence-level knowledge conflicts detection on these explanations to guarantee the reliability. KG-SFT focuses on generating high-quality explanations to improve the quality of Q&A pair, which reveals a promising direction for supplementing existing data augmentation methods. Extensive experiments on fifteen different domains and six different languages demonstrate the effectiveness of KG-SFT, leading to an accuracy improvement of up to 18% and an average of 8.7% in low-data scenarios.

NeurIPS Conference 2025 Conference Paper

LogicTree: Improving Complex Reasoning of LLMs via Instantiated Multi-step Synthetic Logical Data

  • Zehao Wang
  • Lin Yang
  • Jie Wang
  • Kehan Wang
  • Hanzhu Chen
  • Bin Wang
  • Jianye Hao
  • Defu Lian

Despite their remarkable performance on various tasks, Large Language Models (LLMs) still struggle with logical reasoning, particularly in complex and multi-step reasoning processes. Among various efforts to enhance LLMs' reasoning capabilities, synthesizing large-scale, high-quality logical reasoning datasets has emerged as a promising direction. However, existing methods often rely on predefined templates for logical reasoning data generation, limiting their adaptability to real-world scenarios. To address the limitation, we propose LogicTree, a novel framework for efficiently synthesizing multi-step logical reasoning dataset that excels in both complexity and instantiation. By iteratively searching for applicable logic rules based on structural pattern matching to perform backward deduction, LogicTree constructs multi-step logic trees that capture complex reasoning patterns. Furthermore, we employ a two-stage LLM-based approach to instantiate various real-world scenarios for each logic tree, generating consistent real-world reasoning processes that carry contextual significance. This helps LLMs develop generalizable logical reasoning abilities across diverse scenarios rather than merely memorizing templates. Experiments on multiple benchmarks demonstrate that our approach achieves an average improvement of 9. 4\% in accuracy on complex logical reasoning tasks.

ICRA Conference 2024 Conference Paper

K-BMPC: Derivative-based Koopman Bilinear Model Predictive Control For Tractor-trailer Trajectory Tracking With Unknown Parameters

  • Zehao Wang
  • Han Zhang 0056
  • Jingchuan Wang

Nonlinear dynamics bring difficulties to controller design for control-affine systems such as tractor-trailer vehicles, especially when the parameters in the dynamics are unknown. To address this constraint, we propose a derivative-based lifting function construction method, show that the corresponding infinite dimensional Koopman bilinear model over the lifting function is equivalent to the original control-affine system. Further, we analyze the propagation and bounds of state prediction errors caused by the truncation in derivative order. The identified finite dimensional Koopman bilinear model would serve as predictive model in the next step. Koopman Bilinear Model Predictive control (K-BMPC) is proposed to solve the trajectory tracking problem. We linearize the bilinear model around the estimation of the lifted state and control input. Then the bilinear Model Predictive Control problem is approximated by a quadratic programming problem. Further, the estimation is updated at each iteration until the convergence is reached. Moreover, we implement our algorithm on a tractor-trailer system, taking into account the longitudinal and side slip effects. The open-loop simulation shows the proposed Koopman bilinear model captures the dynamics with unknown parameters and has good prediction performance. Closed-loop tracking results show the proposed K-BMPC exhibits elevated tracking precision with the commendable computational efficiency. The experimental results demonstrate the feasibility of K-BMPC.

EAAI Journal 2023 Journal Article

Cross-modal information balance-aware reasoning network for image-text retrieval

  • Xueyang Qin
  • Lishuang Li
  • Fei Hao
  • Guangyao Pang
  • Zehao Wang

As a fundamental multimodal task, image-text retrieval bridges the gap between vision and language. Current mainstream methods exploit attention mechanisms to discover potential alignments between visual regions and textual words while ignoring the imbalance of image-text information. To this end, we propose a Cross-modal Information Balance-aware Reasoning Network (CIBRN), adopting information balance and similarity reasoning mechanisms to distinguish matched and unmatched image-text pairs in the paper. Specifically, a two-stage information balance scheme is employed to balance image-text information. In the first stage, a Graph Convolutional Network (GCN) with multiple convolution kernels is used to convert elements that only exist in a single modality into common elements to achieve intra-modal information balance indirectly. In the second stage, we propose an information “Add-Reduce” mechanism to realize inter-modal information balance by adding a random feature based on Gaussian distribution to each textual “word” and reducing fixed-length information from each visual “region”. Subsequently, a block-based hierarchical matching method and mean-based fully connected layers are proposed to reason the relevance of images and texts. Extensive experiments on two benchmark datasets, i. e. , Flickr30K and MSCOCO, demonstrate the effectiveness of the proposed model CIBRN and achieve advanced results compared to the state-of-the-art method, with a gain of 7. 0% and 3. 0% on rSum, respectively.

AAAI Conference 2023 Conference Paper

Layout-Aware Dreamer for Embodied Visual Referring Expression Grounding

  • Mingxiao Li
  • Zehao Wang
  • Tinne Tuytelaars
  • Marie-Francine Moens

In this work, we study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment and localize a remote object described by a concise high-level natural language instruction. When facing such a situation, a human tends to imagine what the destination may look like and to explore the environment based on prior knowledge of the environmental layout, such as the fact that a bathroom is more likely to be found near a bedroom than a kitchen. We have designed an autonomous agent called Layout-aware Dreamer (LAD), including two novel modules, that is, the Layout Learner and the Goal Dreamer to mimic this cognitive decision process. The Layout Learner learns to infer the room category distribution of neighboring unexplored areas along the path for coarse layout estimation, which effectively introduces layout common sense of room-to-room transitions to our agent. To learn an effective exploration of the environment, the Goal Dreamer imagines the destination beforehand. Our agent achieves new state-of-the-art performance on the public leaderboard of REVERIE dataset in challenging unseen test environments with improvement on navigation success rate (SR) by 4.02% and remote grounding success (RGS) by 3.43% comparing to previous previous state of the art. The code is released at https://github.com/zehao-wang/LAD.

ICLR Conference 2022 Conference Paper

Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

  • Tianlong Chen 0001
  • Zhenyu Zhang 0015
  • Pengjun Wang
  • Santosh Balachandra
  • Haoyu Ma
  • Zehao Wang
  • Zhangyang Wang

Recent studies demonstrate the deep networks, even robustified by the state-of-the-art adversarial training (AT), still suffer from large robust generalization gaps, in addition to the much more expensive training costs than standard training. In this paper, we investigate this intriguing problem from a new perspective, i.e., $\textit{injecting appropriate forms of sparsity}$ during adversarial training. We introduce two alternatives for sparse adversarial training: (i) $\textit{static sparsity}$, by leveraging recent results from the lottery ticket hypothesis to identify critical sparse subnetworks arising from the early training; (ii) $\textit{dynamic sparsity}$, by allowing the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training. We find both static and dynamic sparse methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting, meanwhile significantly saving training and inference FLOPs. Extensive experiments validate our proposals with multiple network architectures on diverse datasets, including CIFAR-10/100 and Tiny-ImageNet. For example, our methods reduce robust generalization gap and overfitting by $34.44\%$ and $4.02\%$, with comparable robust/standard accuracy boosts and $87.83\%$/$87.82\%$ training/inference FLOPs savings on CIFAR-100 with ResNet-18. Besides, our approaches can be organically combined with existing regularizers, establishing new state-of-the-art results in AT. All codes are included.