Author name cluster

Nan Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

AAAI Conference 2026 Conference Paper

Optimal Look-back Horizon for Time Series Forecasting in Federated Learning

Dahao Tang
Nan Yang
Yanli Li
Zhiyu Zhu
Zhibo Jin
Dong Yuan

Selecting an appropriate look-back horizon remains a fundamental challenge in time series forecasting (TSF), particularly in federated learning scenarios where data is decentralized, heterogeneous, and often non-independent. While recent work has explored horizon selection by preserving forecasting-relevant information in an intrinsic space, these approaches are primarily restricted to centralized and independently distributed settings. This paper presents a principled framework for adaptive horizon selection in federated time series forecasting through an intrinsic space formulation. We introduce a synthetic data generator that captures essential temporal structures in client data, including autoregressive dependencies, seasonality, and trend, while incorporating client-specific heterogeneity. Building on this model, we define a transformation that maps time series windows into an intrinsic representation space with well-defined geometric and statistical properties. We then derive a decomposition of the forecasting loss into a Bayesian term, which reflects irreducible uncertainty, and an approximation term, which accounts for finite-sample effects and limited model capacity. Our analysis shows that while increasing the look-back horizon improves the identifiability of deterministic patterns, it also increases approximation error due to higher model complexity and reduced sample efficiency. We prove that the total forecasting loss is minimized at the smallest horizon where the irreducible loss starts to saturate, while the approximation loss continues to rise. This work provides a rigorous theoretical foundation for adaptive horizon selection for time series forecasting in federated learning.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

Xuanyu Chen
Nan Yang
Shuai Wang
Dong Yuan

The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Synthetic Forgetting Without Access: A Few-Shot Zero-Glance Framework for Machine Unlearning

Qipeng Song
Nan Yang
Ziqi Xu
Yue Li
WEI SHAO
Feng Xia

Machine unlearning aims to eliminate the influence of specific data from trained models to ensure privacy compliance. However, most existing methods assume full access to the original training dataset, which is often impractical. We address a more realistic yet challenging setting: few-shot zero-glance, where only a small subset of the retained data is available and the forget set is entirely inaccessible. We introduce GFOES, a novel framework comprising a Generative Feedback Network (GFN) and a two-phase fine-tuning procedure. GFN synthesises Optimal Erasure Samples (OES), which induce high loss on target classes, enabling the model to forget class-specific knowledge without access to the original forget data, while preserving performance on retained classes. The two-phase fine-tuning procedure enables aggressive forgetting in the first phase, followed by utility restoration in the second. Experiments on three image classification datasets demonstrate that GFOES achieves effective forgetting at both logit and representation levels, while maintaining strong performance using only 5% of the original data. Our framework offers a practical and scalable solution for privacy-preserving machine learning under data-constrained conditions.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Chain-of-Retrieval Augmented Generation

Liang Wang
Haonan Chen
Nan Yang
Xiaolong Huang
Zhicheng Dou
Furu Wei

This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than $10$ points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.

PDF Details

AAAI Conference 2025 Conference Paper

Class and Attribute-Aware Logit Adjustment for Generalized Long-Tail Learning

Xiaoling Zhou
Ou Wu
Nan Yang

Compared to conventional long-tail learning, which focuses on addressing class-wise imbalances, generalized long-tail (GLT) learning considers that samples within each class still conform to long-tailed distributions due to varying attributes, known as attribute imbalance. In the presence of such imbalance, the assumption of equivalence between the class-conditional probability densities of the training and testing sets is no longer tenable. Existing GLT approaches typically employ regularization techniques to avoid directly modeling the class-conditional probability density (CCPD) ratio between training and test data, leading to suboptimal performance. This study aims to directly estimate this ratio, for which a novel class-attribute aware logit-adjusted (CALA) loss incorporating both the CCPD ratio and the class priors is presented. Two new GLT learning methods, named Heuristic-CALA and Meta-CALA, are then proposed, which estimate the CCPD ratio in the CALA loss by leveraging the neighborhood information of samples. Extensive experiments across diverse scenarios susceptible to class and attribute imbalances showcase the state-of-the-art performance of Meta-CALA. Furthermore, while Heuristic-CALA exhibits inferior performance compared to Meta-CALA, it incurs only negligible additional training time compared to the Cross-Entropy loss, yet surpasses existing methods by a significant margin.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability

Zhiyu Zhu
Zhibo Jin
Jiayu Zhang 0001
Nan Yang
Jiahao Huang
Jianlong Zhou
Fang Chen 0001

The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9\%, text interpretability by an average of 58.83\%, and accelerates processing speed by 63.95\%. Our code is publicly accessible at https://github.com/LMBTough/NIB.

Details

EAAI Journal 2025 Journal Article

Point cloud semantic segmentation network based on graph convolution and attention mechanism

Nan Yang
Yong Wang
Lei Zhang
Bin Jiang

Point cloud data provides rich three-dimensional spatial information. Accurate three-dimensional point cloud semantic segmentation algorithms enhance environmental understanding and perception, with wide-ranging applications in autonomous driving and scene analysis. However, Graph Neural Networks often struggle to retain semantic relationships among neighboring points during feature extraction, potentially leading to the omission of critical features during aggregation. To address these challenges, we propose a novel network, the Feature-Enhanced Residual Attention Network. This network includes an innovative graph convolution module, the Neighborhood-Enhanced Convolutional Aggregation Module, which utilizes K-Nearest Neighbor and Dilated K-Nearest Neighbor techniques to construct diverse dynamic graphs and aggregate features, thereby prioritizing essential information. This approach significantly enhances the expressiveness and generalization capabilities of the network. Additionally, we introduce a new spatial attention module designed to capture semantic relationships among points. Experimental results demonstrate that the Feature-Enhanced Residual Attention Network outperforms benchmark models, achieving an average intersection ratio of 61. 3% and an overall accuracy of 86. 7% on the Stanford Large-Scale Three-dimensional Indoor Spaces dataset, thereby significantly improving semantic segmentation performance.

Details DOI

AAAI Conference 2025 Conference Paper

SMamba: Sparse Mamba for Event-based Object Detection

Nan Yang
Yang Wang
Zhanwen Liu
Meng Li
Yisheng An
Xiangmo Zhao

Transformer-based methods have achieved remarkable performance in event-based object detection, owing to the global modeling ability. However, they neglect the influence of non-event and noisy regions and process them uniformly, leading to high computational overhead. To mitigate computation cost, some researchers propose window attention based sparsification strategies to discard unimportant regions, which sacrifices the global modeling ability and results in suboptimal performance. To achieve better trade-off between accuracy and efficiency, we propose Sparse Mamba (SMamba), which performs adaptive sparsification to reduce computational effort while maintaining global modeling capability. Specifically, a Spatio-Temporal Continuity Assessment module is proposed to measure the information content of tokens and discard uninformative ones by leveraging the spatiotemporal distribution differences between activity and noise events. Based on the assessment results, an Information-Prioritized Local Scan strategy is designed to shorten the scan distance between high-information tokens, facilitating interactions among them in the spatial dimension. Furthermore, to extend the global interaction from 2D space to 3D representations, a Global Channel Interaction module is proposed to aggregate channel information from a global spatial perspective. Results on three datasets (Gen1, 1Mpx, and eTram) demonstrate that our model outperforms other methods in both performance and efficiency.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Learning to Rank in Generative Retrieval

Yongqi Li
Nan Yang
Liang Wang
Furu Wei
Wenjie Li

Generative retrieval stands out as a promising new paradigm in text retrieval that aims to generate identifier strings of relevant passages as the retrieval target. This generative paradigm taps into powerful generative language models, distinct from traditional sparse or dense retrieval methods. However, only learning to generate is insufficient for generative retrieval. Generative retrieval learns to generate identifiers of relevant passages as an intermediate goal and then converts predicted identifiers into the final passage rank list. The disconnect between the learning objective of autoregressive models and the desired passage ranking target leads to a learning gap. To bridge this gap, we propose a learning-to-rank framework for generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn to rank passages directly, optimizing the autoregressive model toward the final passage ranking target via a rank loss. This framework only requires an additional learning-to-rank training phase to enhance current generative retrieval systems and does not add any burden to the inference stage. We conducted experiments on three public benchmarks, and the results demonstrate that LTRGR achieves state-of-the-art performance among generative retrieval methods. The code and checkpoints are released at https://github.com/liyongqi67/LTRGR.

PDF Details DOI

JMLR Journal 2024 Journal Article

Pygmtools: A Python Graph Matching Toolkit

Runzhong Wang
Ziao Guo
Wenzheng Pan
Jiale Ma
Yikai Zhang
Nan Yang
Qi Liu
Longxuan Wei

Graph matching aims to find node-to-node matching among multiple graphs, which is a fundamental yet challenging problem. To facilitate graph matching in scientific research and industrial applications, pygmtools is released, which is a Python graph matching toolkit that implements a comprehensive collection of two-graph matching and multi-graph matching solvers, covering both learning-free solvers as well as learning-based neural graph matching solvers. Our implementation supports numerical backends including Numpy, PyTorch, Jittor, Paddle, runs on Windows, MacOS and Linux, and is friendly to install and configure. Comprehensive documentations covering beginner's guide, API reference and examples are available online. pygmtools is open-sourced under Mulan PSL v2 license. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

PDF Details

AAAI Conference 2023 Conference Paper

Combining Adversaries with Anti-adversaries in Training

Xiaoling Zhou
Nan Yang
Ou Wu

Adversarial training is an effective learning technique to improve the robustness of deep neural networks. In this study, the influence of adversarial training on deep learning models in terms of fairness, robustness, and generalization is theoretically investigated under more general perturbation scope that different samples can have different perturbation directions (the adversarial and anti-adversarial directions) and varied perturbation bounds. Our theoretical explorations suggest that the combination of adversaries and anti-adversaries (samples with anti-adversarial perturbations) in training can be more effective in achieving better fairness between classes and a better tradeoff between robustness and generalization in some typical learning scenarios (e.g., noisy label learning and imbalance learning) compared with standard adversarial training. On the basis of our theoretical findings, a more general learning objective that combines adversaries and anti-adversaries with varied bounds on each training sample is presented. Meta learning is utilized to optimize the combination weights. Experiments on benchmark datasets under different learning scenarios verify our theoretical findings and the effectiveness of the proposed methodology.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang
Furu Wei
Li Dong
Hangbo Bao
Nan Yang
Ming Zhou

Pre-trained language models (e. g. , BERT (Devlin et al. , 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al. , 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i. e. , the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al. , 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2. 0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

PDF Details

AAAI Conference 2019 Conference Paper

Read + Verify: Machine Reading Comprehension with Unanswerable Questions

Minghao Hu
Furu Wei
Yuxing Peng
Zhen Huang
Nan Yang
Dongsheng Li

Machine reading comprehension with unanswerable questions aims to abstain from answering when no answer can be inferred. In addition to extract answers, previous works usually predict an additional “no-answer” probability to detect unanswerable cases. However, they fail to validate the answerability of the question by verifying the legitimacy of the predicted answer. To address this problem, we propose a novel read-then-verify system, which not only utilizes a neural reader to extract candidate answers and produce noanswer probabilities, but also leverages an answer verifier to decide whether the predicted answer is entailed by the input snippets. Moreover, we introduce two auxiliary losses to help the reader better handle answer extraction as well as noanswer detection, and investigate three different architectures for the answer verifier. Our experiments on the SQuAD 2. 0 dataset show that our system obtains a score of 74. 2 F1 on test set, achieving state-of-the-art results at the time of submission (Aug. 28th, 2018).

PDF Details

NeurIPS Conference 2019 Conference Paper

Unified Language Model Pre-training for Natural Language Understanding and Generation

Li Dong
Nan Yang
Wenhui Wang
Furu Wei
Xiaodong Liu
Yu Wang
Jianfeng Gao
Ming Zhou

This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2. 0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40. 51 (2. 04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35. 75 (0. 86 absolute improvement), the CoQA generative question answering F1 score to 82. 5 (37. 1 absolute improvement), the SQuAD question generation BLEU-4 to 22. 12 (3. 75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2. 67 (human performance is 2. 65). The code and pre-trained models are available at https: //github. com/microsoft/unilm.

PDF Details

AAAI Conference 2018 Conference Paper

S-Net: From Answer Extraction to Answer Synthesis for Machine Reading Comprehension

Chuanqi Tan
Furu Wei
Nan Yang
Bowen Du
Weifeng Lv
Ming Zhou

In this paper, we present a novel approach to machine reading comprehension for the MS-MARCO dataset. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset deﬁnes the task as answering a question from multiple passages and the words in the answer are not necessary in the passages. We therefore develop an extraction-then-synthesis framework to synthesize answers from extraction results. Speciﬁcally, the answer extraction model is ﬁrst employed to predict the most important sub-spans from the passage as evidence, and the answer synthesis model takes the evidence as additional features along with the question and passage to further elaborate the ﬁnal answers. We build the answer extraction model with state-ofthe-art neural networks for single passage reading comprehension, and propose an additional task of passage ranking to help answer extraction in multiple passages. The answer synthesis model is based on the sequence-to-sequence neural networks with extracted evidences as features. Experiments show that our extraction-then-synthesis method outperforms state-of-the-art methods.

PDF Details

AAAI Conference 2018 Conference Paper

Sequential Copying Networks

Qingyu Zhou
Nan Yang
Furu Wei
Ming Zhou

Copying mechanism shows effectiveness in sequence-tosequence based neural network models for text generation tasks, such as abstractive sentence summarization and question generation. However, existing works on modeling copying or pointing mechanism only considers single word copying from the source sentences. In this paper, we propose a novel copying framework, named Sequential Copying Networks (SeqCopyNet), which not only learns to copy single words, but also copies sequences from the input sentence. It leverages the pointer networks to explicitly select a subspan from the source side to target side, and integrates this sequential copying mechanism to the generation process in the encoder-decoder paradigm. Experiments on abstractive sentence summarization and question generation tasks show that the proposed SeqCopyNet can copy meaningful spans and outperforms the baseline models.

PDF Details

AAAI Conference 2016 Conference Paper

Jointly Modeling Topics and Intents with Global Order Structure

Bei Chen
Jun Zhu
Nan Yang
Tian Tian
Ming Zhou
Bo Zhang

Modeling document structure is of great importance for discourse analysis and related applications. The goal of this research is to capture the document intent structure by modeling documents as a mixture of topic words and rhetorical words. While the topics are relatively unchanged through one document, the rhetorical functions of sentences usually change following certain orders in discourse. We propose GMM-LDA, a topic modeling based Bayesian unsupervised model, to analyze the document intent structure cooperated with order information. Our model is ﬂexible that has the ability to combine the annotations and do supervised learning. Additionally, entropic regularization can be introduced to model the signiﬁcant divergence between topics and intents. We perform experiments in both unsupervised and supervised settings, results show the superiority of our model over several state-of-the-art baselines.

PDF Details

IJCAI Conference 2015 Conference Paper

Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation

Yaming Sun
Lei Lin
Duyu Tang
Nan Yang
Zhenzhou Ji
Xiaolong Wang

Given a query consisting of a mention (name string) and a background document, entity disambiguation calls for linking the mention to an entity from reference knowledge base like Wikipedia. Existing studies typically use hand-crafted features to represent mention, context and entity, which is laborintensive and weak to discover explanatory factors of data. In this paper, we address this problem by presenting a new neural network approach. The model takes consideration of the semantic representations of mention, context and entity, encodes them in continuous vector space and effectively leverages them for entity disambiguation. Specifically, we model variable-sized contexts with convolutional neural network, and embed the positions of context words to factor in the distance between context word and mention. Furthermore, we employ neural tensor network to model the semantic interactions between context and mention. We conduct experiments for entity disambiguation on two benchmark datasets from TAC-KBP 2009 and 2010. Experimental results show that our method yields state-of-the-art performances on both datasets.

PDF Details