Arrow Research search

Author name cluster

Yidong Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

AAAI Conference 2026 Conference Paper

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

  • Biao Fu
  • Donglei Yu
  • Minpeng Liao
  • Chengxi Li
  • Xinjie Chen
  • Yidong Chen
  • Kai Fan
  • Xiaodong Shi

Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have shown strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on both in-domain (MuST-C) and out-of-domain (Europarl-ST) En-De and En-Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.

AAAI Conference 2026 Conference Paper

PLaST: Towards Paralinguistic-aware Speech Translation

  • Yi Li
  • Rui Zhao
  • Ruiquan Zhang
  • Jinsong Su
  • Daimeng Wei
  • Min Zhang
  • Yidong Chen

Speech translation (ST) aims to translate speech from a source language into text in the target language. Naturally, speech signals contain paralinguistic cues beyond linguistic content, which could influence or even alter the interpretation of a lexically identical sentence, thereby yielding distinct translations. However, existing ST models lack direct and sufficient modeling of paralinguistic information, which limits their ability to perceive paralinguistic cues and understand speech comprehensively, leading to degraded translation performance. In response, we propose Paralinguistic-aware Speech Translation (PLaST), a novel dual-branch framework which directly leverages paralinguistic cues beyond the linguistic content. Specifically, PLaST employs a speech encoder and a style extractor to independently generate linguistic and paralinguistic representations, respectively. To obtain a purified linguistic representation aligned with the text representation, a hierarchical Optimal Transport (OT) is applied on the layer-wise outputs from an LLM decoder. Then, the paralinguistic information is retrieved and refined with an Attention-based Retrieval (AR) module, with the linguistic representation serving as queries to enable joint guidance for semantic understanding and translation generation. PLaST outperforms the strong baseline with an average of 5.0 directional and 4.5 global contrastive likelihood scores on the paralinguistic-sensitive benchmark ContraProST, demonstrating its superior capability in paralinguistic perception. Further experiments on the standard speech translation benchmark CoVoST-2 show that PLaST generalizes well to typical ST scenarios.

AAAI Conference 2026 Conference Paper

RFI: Rectified Flow Intervention for Mitigating Object Hallucination in Large Vision-Language Models

  • Junyu Cheng
  • Zhibiao Liang
  • Yidong Chen
  • Shuangyin Li

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation by integrating visual and textual data. However, these models frequently exhibit object hallucination problems: generating outputs that are inconsistent with the input image. Existing improved methods for mitigating hallucinations still suffer from two key limitations: dynamic approaches based on logits or attention mechanisms risk suppressing valuable linguistic priors, whereas static methods that employ fixed intervention vectors lack the flexibility to adapt to diverse images and questions. To address these issues, we propose RFI (Rectified Flow Intervention), a novel approach that harnesses the linear trajectory design of rectified flow for input-specific adaptation and employs gradient correction to ensure coherent generation, effectively combining the adaptability of dynamic methods with the stability of static ones. RFI dynamically predicts latent-space intervention vectors while requiring only a single forward pass in LVLMs per question, achieving computational efficiency (1.09x latency overhead for 100 new tokens). Extensive experiments show RFI significantly reduces hallucinations, achieving superior performance compared to existing advanced methods, highlighting its effectiveness as a lightweight plug-and-play method for reducing LVLM's hallucination in practical applications.

AAAI Conference 2024 Conference Paper

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

  • Rui Zhao
  • Liang Zhang
  • Biao Fu
  • Cong Hu
  • Jinsong Su
  • Yidong Chen

Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.

AAAI Conference 2024 Conference Paper

Layer-Wise Representation Fusion for Compositional Generalization

  • Yafang Zheng
  • Lei Lin
  • Shuangtao Li
  • Yuxuan Yuan
  • Zhaohong Lai
  • Shan Liu
  • Biao Fu
  • Yidong Chen

Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the ``shallow'' residual connections within each layer fail to fuse previous layers' information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel Layer-wise Representation Fusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process effectively through introducing a fuse-attention module at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal. Codes are available at https://github.com/thinkaboutzero/LRF.

IJCAI Conference 2023 Conference Paper

Efficient Sign Language Translation with a Curriculum-based Non-autoregressive Decoder

  • Pei Yu
  • Liang Zhang
  • Biao Fu
  • Yidong Chen

Most existing studies on Sign Language Translation (SLT) employ AutoRegressive Decoding Mechanism (AR-DM) to generate target sentences. However, the main disadvantage of the AR-DM is high inference latency. To address this problem, we introduce Non-AutoRegressive Decoding Mechanism (NAR-DM) into SLT, which generates the whole sentence at once. Meanwhile, to improve its decoding ability, we integrate the advantages of curriculum learning and NAR-DM and propose a Curriculum-based NAR Decoder (CND). Specifically, the lower layers of the CND are expected to predict simple tokens that could be predicted correctly using source-side information solely. Meanwhile, the upper layers could predict complex tokens based on the lower layers' predictions. Therefore, our CND significantly reduces the model's inference latency while maintaining its competitive performance. Moreover, to further boost the performance of our CND, we propose a mutual learning framework, containing two decoders, i. e. , an AR decoder and our CND. We jointly train the two decoders and minimize the KL divergence between their outputs, which enables our CND to learn the forward sequential knowledge from the strengthened AR decoder. Experimental results on PHOENIX2014T and CSL-Daily demonstrate that our model consistently outperforms all competitive baselines and achieves 7. 92/8. 02× speed-up compared to the AR SLT model respectively. Our source code is available at https: //github. com/yp20000921/CND.

IJCAI Conference 2023 Conference Paper

Exploring Effective Inter-Encoder Semantic Interaction for Document-Level Relation Extraction

  • Liang Zhang
  • Zijun Min
  • Jinsong Su
  • Pei Yu
  • Ante Wang
  • Yidong Chen

In document-level relation extraction (RE), the models are required to correctly predict implicit relations in documents via relational reasoning. To this end, many graph-based methods have been proposed for this task. Despite their success, these methods still suffer from several drawbacks: 1) their interaction between document encoder and graph encoder is usually unidirectional and insufficient; 2) their graph encoders often fail to capture the global context of nodes in document graph. In this paper, we propose a document-level RE model with a Graph-Transformer Network (GTN). The GTN includes two core sublayers: 1) the graph-attention sublayer that simultaneously models global and local contexts of nodes in the document graph; 2) the cross-attention sublayer, enabling GTN to capture the non-entity clue information from the document encoder. Furthermore, we introduce two auxiliary training tasks to enhance the bidirectional semantic interaction between the document encoder and GTN: 1) the graph node reconstruction that can effectively train our cross-attention sublayer to enhance the semantic transition from the document encoder to GTN; 2) the structure-aware adversarial knowledge distillation, by which we can effectively transfer the structural information of GTN to the document encoder. Experimental results on four benchmark datasets prove the effectiveness of our model. Our source code is available at https: //github. com/DeepLearnXMU/DocRE-BSI.

AAAI Conference 2023 Conference Paper

Exploring Self-Distillation Based Relational Reasoning Training for Document-Level Relation Extraction

  • Liang Zhang
  • Jinsong Su
  • Zijun Min
  • Zhongjian Miao
  • Qingguo Hu
  • Biao Fu
  • Xiaodong Shi
  • Yidong Chen

Document-level relation extraction (RE) aims to extract relational triples from a document. One of its primary challenges is to predict implicit relations between entities, which are not explicitly expressed in the document but can usually be extracted through relational reasoning. Previous methods mainly implicitly model relational reasoning through the interaction among entities or entity pairs. However, they suffer from two deficiencies: 1) they often consider only one reasoning pattern, of which coverage on relational triples is limited; 2) they do not explicitly model the process of relational reasoning. In this paper, to deal with the first problem, we propose a document-level RE model with a reasoning module that contains a core unit, the reasoning multi-head self-attention unit. This unit is a variant of the conventional multi-head self-attention and utilizes four attention heads to model four common reasoning patterns, respectively, which can cover more relational triples than previous methods. Then, to address the second issue, we propose a self-distillation training framework, which contains two branches sharing parameters. In the first branch, we first randomly mask some entity pair feature vectors in the document, and then train our reasoning module to infer their relations by exploiting the feature information of other related entity pairs. By doing so, we can explicitly model the process of relational reasoning. However, because the additional masking operation is not used during testing, it causes an input gap between training and testing scenarios, which would hurt the model performance. To reduce this gap, we perform conventional supervised training without masking operation in the second branch and utilize Kullback-Leibler divergence loss to minimize the difference between the predictions of the two branches. Finally, we conduct comprehensive experiments on three benchmark datasets, of which experimental results demonstrate that our model consistently outperforms all competitive baselines. Our source code is available at https://github.com/DeepLearnXMU/DocRE-SD

AAAI Conference 2018 Conference Paper

Deep Semantic Role Labeling With Self-Attention

  • Zhixing Tan
  • Mingxuan Wang
  • Jun Xie
  • Yidong Chen
  • Xiaodong Shi

Semantic Role Labeling (SRL) is believed to be a crucial step towards natural language understanding and has been widely studied. Recent years, end-to-end SRL with recurrent neural networks (RNN) has gained increasing attention. However, it remains a major challenge for RNNs to handle structural information and long range dependencies. In this paper, we present a simple and effective architecture for SRL which aims to address these problems. Our model is based on self-attention which can directly capture the relationships between two tokens regardless of their distance. Our single model achieves F1 = 83. 4 on the CoNLL-2005 shared task dataset and F1 = 82. 7 on the CoNLL-2012 shared task dataset, which outperforms the previous state-of-the-art results by 1. 8 and 1. 0 F1 score respectively. Besides, our model is computationally efficient, and the parsing speed is 50K tokens per second on a single Titan X GPU.