Arrow Research search

Author name cluster

Wenhao Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
2 author rows

Possible papers

23

NeurIPS Conference 2025 Conference Paper

AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees

  • Yangning Li
  • Shaoshen Chen
  • Yinghui Li
  • Yankai Chen
  • Hai-Tao Zheng
  • Hui Wang
  • Wenhao Jiang
  • Philip S Yu

The quadratic complexity of self-attention limits Large Language Models (LLMs) in processing long contexts, a capability vital for many advanced applications. Context compression aims to mitigate this computational barrier while preserving essential semantic information. However, existing methods often falter: explicit methods can sacrifice local detail, while implicit ones may exhibit positional biases, struggle with information degradation, or fail to capture long-range semantic dependencies. We introduce AdmTree, a novel framework for adaptive, hierarchical context compression designed with a core focus on maintaining high semantic fidelity while keep efficiency. AdmTree dynamically segments input based on information density, employing gist tokens to summarize variable-length segments as leaves in a semantic binary tree. This structure, combined with a lightweight aggregation mechanism and a frozen backbone LLM (minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By effectively preserving fine-grained details alongside global semantic coherence, mitigating position bias, and adapting dynamically to content, AdmTree comprehensively preserves the semantic information of lengthy context.

AAAI Conference 2025 Conference Paper

EXCGEC: A Benchmark for Edit-Wise Explainable Chinese Grammatical Error Correction

  • Jingheng Ye
  • Shang Qin
  • Yinghui Li
  • Xuxin Cheng
  • Libo Qin
  • Hai-Tao Zheng
  • Ying Shen
  • Peng Xing

Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not established a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive evaluation suite by leveraging existing automatic metrics and conducting human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning both tasks.

NeurIPS Conference 2025 Conference Paper

SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought

  • Guanghao Li
  • Wenhao Jiang
  • Mingfeng Chen
  • Yan Li
  • Hao Yu
  • Shuting Dong
  • Tao Ren
  • Ming Tang

Chain-of-Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step-by-step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain-of-Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage—deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT ( Stepwise Cognitive Optimization Using Teachers ), a lightweight fine-tuning framework that enables Flow CoT-style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross-attention-based retrospective module that integrates outputs from previous iterations while preserving the model’s original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1. 8\% gains under fine-tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations—refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.

ICML Conference 2025 Conference Paper

TimeStacker: A Novel Framework with Multilevel Observation for Capturing Nonstationary Patterns in Time Series Forecasting

  • Qinglong Liu
  • Cong Xu 0004
  • Wenhao Jiang
  • Kaixuan Wang
  • Lin Ma 0003
  • Haifeng Li 0001

Real-world time series inherently exhibit significant non-stationarity, posing substantial challenges for forecasting. To address this issue, this paper proposes a novel prediction framework, TimeStacker, designed to overcome the limitations of existing models in capturing the characteristics of non-stationary signals. By employing a unique stacking mechanism, TimeStacker effectively captures global signal features while thoroughly exploring local details. Furthermore, the framework integrates a frequency-based self-attention module, significantly enhancing its feature modeling capabilities. Experimental results demonstrate that TimeStacker achieves outstanding performance across multiple real-world datasets, including those from the energy, finance, and weather domains. It not only delivers superior predictive accuracy but also exhibits remarkable advantages with fewer parameters and higher computational efficiency.

JBHI Journal 2024 Journal Article

Insula-Medial Prefrontal Cortex Functional Connectivity Modulated by Transcutaneous Auricular Vagus Nerve Stimulation: An fMRI Study

  • Yujiao Zhang
  • Pan Lin
  • Ruimin Wang
  • Jiang Zhou
  • Xiaoquan Xu
  • Wenhao Jiang
  • Xiongying Pu
  • Sheng Ge

Transcutaneous auricular vagus nerve stimulation (taVNS) is an emerging neuromodulation technology that has been reported to be beneficial in the treatment of diseases by several studies, but its exact mechanism of action is still unclear. It has been demonstrated that taVNS can influence interoceptive signals. Notably, the processing of interoceptive signals is directly related to many diseases, such as depression, anxiety, and insomnia. The insula and the medial prefrontal cortex (MPFC) communicate during the bottom-up transmission of taVNS-induced signals, and both play a role in interoceptive signal processing. By focusing on the insula and MPFC, our research pioneers detail the potential interactions between interoceptive signal processing and the neuromodulation effects of taVNS, providing novel insights into the neurobiological mechanisms of taVNS. Two functional connectivity (FC) analyses (region of interest-based and seed-based) were used in this study. We observed that negative connectivity between the insula and the MPFC was significantly weakened following taVNS, while there were no statistical changes in the sham group. Our findings elucidate potential mechanisms linking vagal activity with intrinsic FC among specific brain regions and networks. Specifically, our results indicate that taVNS may enhance the ability to flexibly balance interoceptive awareness and cognitive experiences by modulating the FC between the insula and MPFC. The modulation effects may impact body-brain interactions, suggesting the mechanism of taVNS in therapeutic applications.

ICLR Conference 2024 Conference Paper

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

  • Bin Zhu
  • Bin Lin 0014
  • Munan Ning
  • Yang Yan
  • Jiaxi Cui
  • Hongfa Wang
  • Yatian Pang
  • Wenhao Jiang

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N ≥ 3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining and then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities.

ECAI Conference 2024 Conference Paper

MCAHNN: Multi-Channel EEG Emotion Recognition Using Attention Mechanism Based on Householder Reflection

  • Qinglong Liu
  • Wenhao Jiang
  • Shihang Ding
  • Kaixuan Wang
  • Hongjian Bo
  • Cong Xu 0004
  • Lin Ma 0003
  • Haifeng Li 0001

Emotions are integral to human cognition, exerting a profound influence on physiological responses, cognitive processes, and decision-making capabilities. Electroencephalography (EEG)-based emotion classification provides a significant methodological approach for the exploration of emotional states. Despite its potential, most current methodologies face challenges in delineating the representational patterns across different brain regions and in effectively classifying emotions from EEG signals. In response, a novel model for emotion recognition is proposed in this paper, which utilizes a multi-channel attention mechanism, designated as MCAHNN. This model incorporates Householder Reflection to enhance the attention mechanism, facilitating the extraction of inter-channel EEG features and simulating inter-regional brain dynamics. Furthermore, 1D convolution is employed to analyze intra-channel relationships. The proposed model has been evaluated on the publicly available DEAP dataset and further tested on the SEED dataset. Experimental results confirm that the MCAHNN model achieves state-of-the-art performance, demonstrating its effectiveness in classifying emotions within multi-center datasets. Code is publicly available at https: //github. com/Oreoreoreor/MCAHNN.

ICML Conference 2022 Conference Paper

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

  • Ziyu Wang
  • Wenhao Jiang
  • Yiming Zhu
  • Li Yuan 0007
  • Yibing Song
  • Wei Liu 0005

Recently, MLP-like vision models have achieved promising performances on mainstream visual recognition tasks. In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models. However, existing MLP-like models fuse tokens through static fusion operations, lacking adaptability to the contents of the tokens to be mixed. Thus, customary information fusion procedures are not effective enough. To this end, this paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion. Critically, we propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed. To reduce the time complexity and improve the robustness, a dimensionality reduction technique and a multi-segment fusion mechanism are adopted. Our proposed DynaMixer model (97M parameters) achieves 84. 3% top-1 accuracy on the ImageNet-1K dataset without extra training data, performing favorably against the state-of-the-art vision MLP models. When the number of parameters is reduced to 26M, it still achieves 82. 7% top-1 accuracy, surpassing the existing MLP-like models with a similar capacity. The code is available at \url{https: //github. com/ziyuwwang/DynaMixer}.

ICML Conference 2022 Conference Paper

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

  • Teng Wang 0007
  • Wenhao Jiang
  • Zhichao Lu
  • Feng Zheng 0001
  • Ran Cheng 0004
  • Chengguo Yin
  • Ping Luo 0002

Existing vision-language pre-training (VLP) methods primarily rely on paired image-text datasets, which are either annotated by enormous human labors or crawled from the internet followed by elaborate data cleaning techniques. To reduce the dependency on well-aligned image-text pairs, it is promising to directly leverage the large-scale text-only and image-only corpora. This paper proposes a data augmentation method, namely cross-modal CutMix (CMC), for implicit cross-modal alignment learning in unpaired VLP. Specifically, CMC transforms natural sentences in the textual view into a multi-modal view, where visually-grounded words in a sentence are randomly replaced by diverse image patches with similar semantics. There are several appealing proprieties of the proposed CMC. First, it enhances the data diversity while keeping the semantic meaning intact for tackling problems where the aligned data are scarce; Second, by attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising. Furthermore, we present a new unpaired VLP method, dubbed as VLMixer, that integrates CMC with contrastive learning to pull together the uni-modal and multi-modal views for better instance-level alignments among different modalities. Extensive experiments on five downstream tasks show that VLMixer could surpass previous state-of-the-art unpaired VLP methods.

IJCAI Conference 2021 Conference Paper

Multi-Target Invisibly Trojaned Networks for Visual Recognition and Detection

  • Xinzhe Zhou
  • Wenhao Jiang
  • Sheng Qi
  • Yadong Mu

Visual backdoor attack is a recently-emerging task which aims to implant trojans in a deep neural model. A trojaned model responds to a trojan-invoking trigger in a fully predictable manner while functioning normally otherwise. As a key motivating fact to this work, most triggers adopted in existing methods, such as a learned patterned block that overlays a benigh image, can be easily noticed by human. In this work, we take image recognition and detection as the demonstration tasks, building trojaned networks that are significantly less human-perceptible and can simultaneously attack multiple targets in an image. The main technical contributions are two-folds: first, under a relaxed attack mode, we formulate trigger embedding as an image steganography-and-steganalysis problem that conceals a secret image in another image in a decipherable and almost invisible way. In specific, a variable number of different triggers can be encoded into a same secret image and fed to an encoder module that does steganography. Secondly, we propose a generic split-and-merge scheme for training a trojaned model. Neurons are split into two sets, trained either for normal image recognition / detection or trojaning the model. To merge them, we novelly propose to hide trojan neurons within the nullspace of the normal ones, such that the two sets do not interfere with each other and the resultant model exhibits similar parameter statistics to a clean model. Comprehensive experiments are conducted on the datasets PASCAL VOC and Microsoft COCO (for detection) and a subset of ImageNet (for recognition). All results clearly demonstrate the effectiveness of our proposed visual trojan method.

IJCAI Conference 2021 Conference Paper

Self-Supervised Video Action Localization with Adversarial Temporal Transforms

  • Guoqiang Gong
  • Liangfeng Zheng
  • Wenhao Jiang
  • Yadong Mu

Weakly-supervised temporal action localization aims to locate intervals of action instances with only video-level action labels for training. However, the localization results generated from video classification networks are often not accurate due to the lack of temporal boundary annotation of actions. Our motivating insight is that the temporal boundary of action should be stably predicted under various temporal transforms. This inspires a self-supervised equivariant transform consistency constraint. We design a set of temporal transform operations, including naive temporal down-sampling to learnable attention-piloted time warping. In our model, a localization network aims to perform well under all transforms, and another policy network is designed to choose a temporal transform at each iteration that adversarially brings localization result inconsistent with the localization network's. Additionally, we devise a self-refine module to enhance the completeness of action intervals harnessing temporal and semantic contexts. Experimental results on THUMOS14 and ActivityNet demonstrate that our model consistently outperforms the state-of-the-art weakly-supervised temporal action localization methods.

AAAI Conference 2020 Conference Paper

Recurrent Nested Model for Sequence Generation

  • Wenhao Jiang
  • Lin Ma
  • Wei Lu

Depth has been shown beneficial to neural network models. In this paper, we make an attempt to make the encoder-decoder model deeper for sequence generation. We propose a module that can be plugged into the middle between the encoder and decoder to increase the depth of the whole model. The proposed module follows a nested structure, which is divided into blocks with each block containing several recurrent transition steps. To reduce the training difficulty and preserve the necessary information for the decoder during transitions, inter-block connections and intra-block connections are constructed in our model. The inter-block connections provide the thought vectors from the current block to all the subsequent blocks. The intra-block connections connect all the hidden states entering the current block to the current transition step. The advantages of our model are illustrated on the image captioning and code captioning tasks.

AAAI Conference 2020 Conference Paper

Temporally Grounding Language Queries in Videos by Contextual Boundary-Aware Prediction

  • Jingwen Wang
  • Lin Ma
  • Wenhao Jiang

The task of temporally grounding language queries in videos is to temporally localize the best matched video segment corresponding to a given language (sentence). It requires certain models to simultaneously perform visual and linguistic understandings. Previous work predominantly ignores the precision of segment localization. Sliding window based methods use predefined search window sizes, which suffer from redundant computation, while existing anchor-based approaches fail to yield precise localization. We address this issue by proposing an end-to-end boundary-aware model, which uses a lightweight branch to predict semantic boundaries corresponding to the given linguistic information. To better detect semantic boundaries, we propose to aggregate contextual information by explicitly modeling the relationship between the current element and its neighbors. The most con- fident segments are subsequently selected based on both anchor and boundary predictions at the testing stage. The proposed model, dubbed Contextual Boundary-aware Prediction (CBP), outperforms its competitors with a clear margin on three public datasets.

AAAI Conference 2019 Conference Paper

Hierarchical Photo-Scene Encoder for Album Storytelling

  • Bairui Wang
  • Lin Ma
  • Wei Zhang
  • Wenhao Jiang
  • Feng Zhang

In this paper, we propose a novel model with a hierarchical photo-scene encoder and a reconstructor for the task of album storytelling. The photo-scene encoder contains two subencoders, namely the photo and scene encoders, which are stacked together and behave hierarchically to fully exploit the structure information of the photos within an album. Specifically, the photo encoder generates semantic representation for each photo while exploiting temporal relationships among them. The scene encoder, relying on the obtained photo representations, is responsible for detecting the scene changes and generating scene representations. Subsequently, the decoder dynamically and attentively summarizes the encoded photo and scene representations to generate a sequence of album representations, based on which a story consisting of multiple coherent sentences is generated. In order to fully extract the useful semantic information from an album, a reconstructor is employed to reproduce the summarized album representations based on the hidden states of the decoder. The proposed model can be trained in an end-to-end manner, which results in an improved performance over the state-of-the-arts on the public visual storytelling (VIST) dataset. Ablation studies further demonstrate the effectiveness of the proposed hierarchical photo-scene encoder and reconstructor.

AAAI Conference 2018 Conference Paper

Learning to Guide Decoding for Image Captioning

  • Wenhao Jiang
  • Lin Ma
  • Xinpeng Chen
  • Hanwang Zhang
  • Wei Liu

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset.

IJCAI Conference 2017 Conference Paper

Theoretic Analysis and Extremely Easy Algorithms for Domain Adaptive Feature Learning

  • Wenhao Jiang
  • Cheng Deng
  • Wei Liu
  • Feiping Nie
  • Fu-lai Chung
  • Heng Huang

Domain adaptation problems arise in a variety of applications, where a training dataset from the source domain and a test dataset from the target domain typically follow different distributions. The primary difficulty in designing effective learning models to solve such problems lies in how to bridge the gap between the source and target distributions. In this paper, we provide comprehensive analysis of feature learning algorithms used in conjunction with linear classifiers for domain adaptation. Our analysis shows that in order to achieve good adaptation performance, the second moments of the source domain distribution and target domain distribution should be similar. Based on our new analysis, a novel extremely easy feature learning algorithm for domain adaptation is proposed. Furthermore, our algorithm is extended by leveraging multiple layers, leading to another feature learning algorithm. We evaluate the effectiveness of the proposed algorithms in terms of domain adaptation tasks on Amazon review and spam datasets from the ECML/PKDD 2006 discovery challenge.

AAAI Conference 2016 Conference Paper

The l2,1-Norm Stacked Robust Autoencoders for Domain Adaptation

  • Wenhao Jiang
  • Hongchang Gao
  • Fu-lai Chung
  • Heng Huang

Recently, deep learning methods that employ stacked denoising autoencoders (SDAs) have been successfully applied in domain adaptation. Remarkable performance in multi-domain sentiment analysis datasets has been reported, making deep learning a promising approach to domain adaptation problems. SDAs are distinguished by learning robust data representations for recovering the original features that have been artificially corrupted with noise. The idea has been further exploited to marginalize out the random corruptions by a stateof-the-art method called mSDA. In this paper, a deep learning method for domain adaptation called 2, 1-norm stacked robust autoencoders ( 2, 1-SRA) is proposed to learn useful representations for domain adaptation tasks. Each layer of 2, 1-SRA contains two steps: a robust linear reconstruction step which is based on 2, 1 robust regression and a non-linear squashing transformation step. The experimental results demonstrate that the proposed method is very effective in multiple cross domain classification datasets which include Amazon review dataset, spam dataset from ECML/PKDD discovery challenge 2006 and 20 newsgroups dataset.

IJCAI Conference 2015 Conference Paper

Robust Dictionary Learning with Capped l1-Norm

  • Wenhao Jiang
  • Feiping Nie
  • Heng Huang

Expressing data vectors as sparse linear combinations of basis elements (dictionary) is widely used in machine learning, signal processing, and statistics. It has been found that dictionaries learned from data are more effective than off-the-shelf ones. Dictionary learning has become an important tool for computer vision. Traditional dictionary learning methods use quadratic loss function which is known sensitive to outliers. Hence they could not learn the good dictionaries when outliers exist. In this paper, aiming at learning dictionaries resistant to outliers, we proposed capped `1-norm based dictionary learning and an efficient iterative re-weighted algorithm to solve the problem. We provided theoretical analysis and carried out extensive experiments on real word datasets and synthetic datasets to show the effectiveness of our method.