Author name cluster

Biao Fu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

AAAI Conference 2026 Conference Paper

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

Biao Fu
Donglei Yu
Minpeng Liao
Chengxi Li
Xinjie Chen
Yidong Chen
Kai Fan
Xiaodong Shi

Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have shown strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on both in-domain (MuST-C) and out-of-domain (Europarl-ST) En-De and En-Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Rui Zhao
Liang Zhang
Biao Fu
Cong Hu
Jinsong Su
Yidong Chen

Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.

PDF Details DOI

ECAI Conference 2024 Conference Paper

Improving Non-Autoregressive Sign Language Translation with Random Ordering Progressive Prediction Pretraining

Pei Yu
Changhao Lai
Cong Hu
Shan Liu
Liang Zhang
Biao Fu
Yidong Chen 0001

Recently, the Non-AutoRegressive (NAR) decoding mechanism, effectively reducing the inference latency of text generation, has been applied to Sign Language Translation (SLT). Typically, the current best NAR SLT model using a Curriculum-based Non-autoregressive Decoder (CND) outperforms AutoRegressive (AR) baselines in speed and performance. Although it has been proven that AutoRegressive Pre-trained Language Models (AR-PLMs) further boost the performance of AR SLT models, combining NAR Pretrained Language Models (NAR-PLMs) with NAR SLT model remains challenge due to (1) existing NAR-PLMs’ inability to model token dependencies between decoder layers, crucial for NAR SLT models using CND; (2) the modality gap between the decoder’s inputs of the NAR-PLMs and NAR SLT models. To address these, we propose a Random Ordering Progressive Prediction Pre-training task for NAR SLT models using CND, enabling the decoder to predict target sequences in diverse orderings and enhancing the modeling of target token dependencies between layers. Moreover, we propose a CTC-enhanced Soft Copy method to incorporate target-side information in the decoder’s inputs, alleviating the modality gap. Experimental results on PHOENIX-2014T and CSL-Daily demonstrate that our model consistently outperforms all strong baselines and achieves competitive performance with AR SLT models equipped with AR-PLMs.

Details

AAAI Conference 2024 Conference Paper

Layer-Wise Representation Fusion for Compositional Generalization

Yafang Zheng
Lei Lin
Shuangtao Li
Yuxuan Yuan
Zhaohong Lai
Shan Liu
Biao Fu
Yidong Chen

Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the ``shallow'' residual connections within each layer fail to fuse previous layers' information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel Layer-wise Representation Fusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process effectively through introducing a fuse-attention module at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal. Codes are available at https://github.com/thinkaboutzero/LRF.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Efficient Sign Language Translation with a Curriculum-based Non-autoregressive Decoder

Pei Yu
Liang Zhang
Biao Fu
Yidong Chen

Most existing studies on Sign Language Translation (SLT) employ AutoRegressive Decoding Mechanism (AR-DM) to generate target sentences. However, the main disadvantage of the AR-DM is high inference latency. To address this problem, we introduce Non-AutoRegressive Decoding Mechanism (NAR-DM) into SLT, which generates the whole sentence at once. Meanwhile, to improve its decoding ability, we integrate the advantages of curriculum learning and NAR-DM and propose a Curriculum-based NAR Decoder (CND). Specifically, the lower layers of the CND are expected to predict simple tokens that could be predicted correctly using source-side information solely. Meanwhile, the upper layers could predict complex tokens based on the lower layers' predictions. Therefore, our CND significantly reduces the model's inference latency while maintaining its competitive performance. Moreover, to further boost the performance of our CND, we propose a mutual learning framework, containing two decoders, i. e. , an AR decoder and our CND. We jointly train the two decoders and minimize the KL divergence between their outputs, which enables our CND to learn the forward sequential knowledge from the strengthened AR decoder. Experimental results on PHOENIX2014T and CSL-Daily demonstrate that our model consistently outperforms all competitive baselines and achieves 7. 92/8. 02× speed-up compared to the AR SLT model respectively. Our source code is available at https: //github. com/yp20000921/CND.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Exploring Self-Distillation Based Relational Reasoning Training for Document-Level Relation Extraction

Liang Zhang
Jinsong Su
Zijun Min
Zhongjian Miao
Qingguo Hu
Biao Fu
Xiaodong Shi
Yidong Chen

Document-level relation extraction (RE) aims to extract relational triples from a document. One of its primary challenges is to predict implicit relations between entities, which are not explicitly expressed in the document but can usually be extracted through relational reasoning. Previous methods mainly implicitly model relational reasoning through the interaction among entities or entity pairs. However, they suffer from two deficiencies: 1) they often consider only one reasoning pattern, of which coverage on relational triples is limited; 2) they do not explicitly model the process of relational reasoning. In this paper, to deal with the first problem, we propose a document-level RE model with a reasoning module that contains a core unit, the reasoning multi-head self-attention unit. This unit is a variant of the conventional multi-head self-attention and utilizes four attention heads to model four common reasoning patterns, respectively, which can cover more relational triples than previous methods. Then, to address the second issue, we propose a self-distillation training framework, which contains two branches sharing parameters. In the first branch, we first randomly mask some entity pair feature vectors in the document, and then train our reasoning module to infer their relations by exploiting the feature information of other related entity pairs. By doing so, we can explicitly model the process of relational reasoning. However, because the additional masking operation is not used during testing, it causes an input gap between training and testing scenarios, which would hurt the model performance. To reduce this gap, we perform conventional supervised training without masking operation in the second branch and utilize Kullback-Leibler divergence loss to minimize the difference between the predictions of the two branches. Finally, we conduct comprehensive experiments on three benchmark datasets, of which experimental results demonstrate that our model consistently outperforms all competitive baselines. Our source code is available at https://github.com/DeepLearnXMU/DocRE-SD

PDF Details DOI