Author name cluster

Xiaoling Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

ASKD: Reinforcement Learning-Style Knowledge Distillation with Quality-Adaptive Skewness

Mingjie Zhang
Xiaoling Zhou
Yuxiao Luo
Yiyu Liu
Shikun Zhang
Wei Ye

Knowledge distillation (KD) is a widely adopted technique for transferring the capabilities of large teacher models to smaller student models, thereby significantly reducing inference costs and memory consumption. However, existing KD methods are all constrained by an inherent greedy optimization objective, rooted in the assumption of teacher superiority: "Trust all teacher-generated outputs (TGOs)" and "Distrust any student-generated outputs (SGOs) unsupported by the teacher". We propose ASKD, a novel KD method with adaptive skewness determined by sample quality, refining this objective to: "Learn TGOs proportionally to their quality, and distrust only low-quality unsupported SGOs". ASKD comprises three key components: (1) A reinforcement learning-style optimization formulation to mitigate the inherent approximation bias in sample-based Kullback-Leibler (KL) divergence approximations used by previous KD methods; (2) Well-designed quality supervision signals to map and achieve adaptive skewness in skewed KL loss, pioneering the usage of sample quality to adjust learning magnitudes; (3) A gradient-clip function on high-quality SGOs for findings that high-quality SGOs in KL loss fail to yield positive updates and even cause adverse effects on some samples. Extensive experiments indicate that ASKD builds high-performance student models across various tasks, including instruction following, mathematical reasoning, and code generation, outperforming state-of-the-art methods comprehensively and surpassing GRPO-like approaches that use advantages as multiplicative factors. We also provide detailed mathematical proofs demonstrating properties such as Lipschitz continuity of the update coefficient and uniform convergence of the loss function, ensuring theoretical rigor for key components of ASKD.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Boosting Resilience of Large Language Models through Causality-Driven Robust Optimization

Xiaoling Zhou
Mingjie Zhang
Zhemg Lee
Yuncheng Hua
chengli xing
Wei Ye
Flora Salim
Shikun Zhang

Large language models (LLMs) have achieved remarkable achievements across diverse applications; however, they remain plagued by spurious correlations and the generation of hallucinated content. Despite extensive efforts to enhance the resilience of LLMs, existing approaches either rely on indiscriminate fine-tuning of all parameters, resulting in parameter inefficiency and lack of specificity, or depend on post-processing techniques that offer limited adaptability and flexibility. This study introduces a novel Causality-driven Robust Optimization (CdRO) approach that selectively updates model components sensitive to causal reasoning, enhancing model causality while preserving valuable pretrained knowledge to mitigate overfitting. Our method begins by identifying the parameter components within LLMs that capture causal relationships, achieved through comparing the training dynamics of parameter matrices associated with the original samples, as well as augmented counterfactual and paraphrased variants. These comparisons are then fed into a lightweight logistic regression model, optimized in real time to dynamically identify and adapt the causal components within LLMs. The identified parameters are subsequently optimized using an enhanced policy optimization algorithm, where the reward function is designed to jointly promote both model generalization and robustness. Extensive experiments across various tasks using twelve different LLMs demonstrate the superior performance of our framework, underscoring its significant effectiveness in reducing the model’s dependence on spurious associations and mitigating hallucinations.

PDF Details

AAAI Conference 2025 Conference Paper

Class and Attribute-Aware Logit Adjustment for Generalized Long-Tail Learning

Xiaoling Zhou
Ou Wu
Nan Yang

Compared to conventional long-tail learning, which focuses on addressing class-wise imbalances, generalized long-tail (GLT) learning considers that samples within each class still conform to long-tailed distributions due to varying attributes, known as attribute imbalance. In the presence of such imbalance, the assumption of equivalence between the class-conditional probability densities of the training and testing sets is no longer tenable. Existing GLT approaches typically employ regularization techniques to avoid directly modeling the class-conditional probability density (CCPD) ratio between training and test data, leading to suboptimal performance. This study aims to directly estimate this ratio, for which a novel class-attribute aware logit-adjusted (CALA) loss incorporating both the CCPD ratio and the class priors is presented. Two new GLT learning methods, named Heuristic-CALA and Meta-CALA, are then proposed, which estimate the CCPD ratio in the CALA loss by leveraging the neighborhood information of samples. Extensive experiments across diverse scenarios susceptible to class and attribute imbalances showcase the state-of-the-art performance of Meta-CALA. Furthermore, while Heuristic-CALA exhibits inferior performance compared to Meta-CALA, it incurs only negligible additional training time compared to the Cross-Entropy loss, yet surpasses existing methods by a significant margin.

PDF Details DOI

ICLR Conference 2025 Conference Paper

HaDeMiF: Hallucination Detection and Mitigation in Large Language Models

Xiaoling Zhou
Mingjie Zhang
Zhemg Lee
Wei Ye 0004
Shikun Zhang

The phenomenon of knowledge hallucinations has raised substantial concerns about the security and reliability of deployed large language models (LLMs). Current methods for detecting hallucinations primarily depend on manually designed individual metrics, such as prediction uncertainty and consistency, and fall short in effectively calibrating model predictions, thus constraining their detection accuracy and applicability in practical applications. In response, we propose an advanced framework, termed HaDeMiF, for detecting and mitigating hallucinations in LLMs. Specifically, hallucinations within the output and semantic spaces of LLMs are comprehensively captured through two compact networks—a novel, interpretable tree model known as the Deep Dynamic Decision Tree (D3T) and a Multilayer Perceptron (MLP)—which take as input a set of prediction characteristics and the hidden states of tokens, respectively. The predictions of LLMs are subsequently calibrated using the outputs from the D3T and MLP networks, aiming to mitigate hallucinations and enhance model calibration. HaDeMiF can be applied during both the inference and fine-tuning phases of LLMs, introducing less than 2% of the parameters relative to the LLMs through the training of two small-scale networks. Extensive experiments conclusively demonstrate the effectiveness of our framework in hallucination detection and model calibration across text generation tasks with responses of varying lengths.

Details

IJCAI Conference 2025 Conference Paper

Robustness to Spurious Correlations via Dynamic Knowledge Transfer

Xiaoling Zhou
Wei Ye
Zhemg Lee
Shikun Zhang

Spurious correlations pose a significant challenge to the robustness of statistical models, often resulting in unsatisfactory performance when distributional shifts occur between training and testing data. To address this, we propose to transfer knowledge across spuriously correlated categories within the deep feature space. Specifically, samples' deep features are enriched using semantic vectors extracted from both their respective category distributions and those of their spuriously correlated counterparts, enabling the generation of diverse class-specific factual and counterfactual augmented deep features. We then demonstrate the feasibility of optimizing a surrogate robust loss instead of conducting explicit augmentations by considering an infinite number of augmentations. As spurious correlations between samples and classes evolve during training, we develop a reinforcement learning-based training framework called Dynamic Knowledge Transfer (DKT) to facilitate dynamic adjustments in the direction and intensity of knowledge transfer. Within this framework, a target network is trained using the derived robust loss to enhance robustness, while a strategy network generates sample-wise augmentation strategies in a dynamic and automatic way. Extensive experiments validate the effectiveness of the DKT framework in mitigating spurious correlations, achieving state-of-the-art performance across three typical learning scenarios susceptible to such correlations.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Boosting Model Resilience via Implicit Adversarial Data Augmentation

Xiaoling Zhou
Wei Ye
Zhemg Lee
Rui Xie
Shikun Zhang

Data augmentation plays a pivotal role in enhancing and diversifying training data. Nonetheless, consistently improving model performance in varied learning scenarios, especially those with inherent data biases, remains challenging. To address this, we propose to augment the deep features of samples by incorporating their adversarial and anti-adversarial perturbation distributions, enabling adaptive adjustment in the learning difficulty tailored to each sample’s specific characteristics. We then theoretically reveal that our augmentation process approximates the optimization of a surrogate loss function as the number of augmented copies increases indefinitely. This insight leads us to develop a meta-learning-based framework for optimizing classifiers with this novel loss, introducing the effects of augmentation while bypassing the explicit augmentation process. We conduct extensive experiments across four common biased learning scenarios: long-tail learning, generalized long-tail learning, noisy label learning, and subpopulation shift learning. The empirical results demonstrate that our method consistently achieves state-of-the-art performance, highlighting its broad adaptability.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Combining Adversaries with Anti-adversaries in Training

Xiaoling Zhou
Nan Yang
Ou Wu

Adversarial training is an effective learning technique to improve the robustness of deep neural networks. In this study, the influence of adversarial training on deep learning models in terms of fairness, robustness, and generalization is theoretically investigated under more general perturbation scope that different samples can have different perturbation directions (the adversarial and anti-adversarial directions) and varied perturbation bounds. Our theoretical explorations suggest that the combination of adversaries and anti-adversaries (samples with anti-adversarial perturbations) in training can be more effective in achieving better fairness between classes and a better tradeoff between robustness and generalization in some typical learning scenarios (e.g., noisy label learning and imbalance learning) compared with standard adversarial training. On the basis of our theoretical findings, a more general learning objective that combines adversaries and anti-adversaries with varied bounds on each training sample is presented. Meta learning is utilized to optimize the combination weights. Experiments on benchmark datasets under different learning scenarios verify our theoretical findings and the effectiveness of the proposed methodology.

PDF Details DOI

AAAI Conference 2020 Conference Paper

A Recurrent Model for Collective Entity Linking with Adaptive Features

Xiaoling Zhou
Yukai Miao
Wei Wang
Jianbin Qin

The vast amount of web data enables us to build knowledge bases with unprecedented quality and coverage. Named Entity Disambiguation (NED) is an important task that automatically resolves ambiguous mentions in free text to correct target entries in the knowledge base. Traditional machine learning based methods for NED were outperformed and made obsolete by the state-of-the-art deep learning based models. However, deep learning models are more complex, requiring large amount of training data and lengthy training and parameter tuning time. In this paper, we revisit traditional machine learning techniques and propose a light-weight, tuneable and time-efﬁcient method without using deep learning or deep learning generated features. We propose novel adaptive features that focus on extracting discriminative features to better model similarities between candidate entities and the mention’s context. We learn a local ranking model based on traditional and the new adaptive features based on the learning-to-rank framework. While arriving at linking decisions individually via the local model, our method also takes into consideration the correlation between decisions by running multiple recurrent global models, which can be deemed as a learned local search method. Our method attains performances comparable to the state-of-the-art deep learning-based methods on NED benchmark datasets while being signiﬁcantly faster to train.

PDF Details

AAAI Conference 2019 Conference Paper

Antonym-Synonym Classification Based on New Sub-Space Embeddings

Muhammad Asif Ali
Yifang Sun
Xiaoling Zhou
Wei Wang
Xiang Zhao

Distinguishing antonyms from synonyms is a key challenge for many NLP applications focused on the lexical-semantic relation extraction. Existing solutions relying on large-scale corpora yield low performance because of huge contextual overlap of antonym and synonym pairs. We propose a novel approach entirely based on pre-trained embeddings. We hypothesize that the pre-trained embeddings comprehend a blend of lexical-semantic information and we may distill the task-specific information using Distiller, a model proposed in this paper. Later, a classifier is trained based on features constructed from the distilled sub-spaces along with some word level features to distinguish antonyms from synonyms. Experimental results show that the proposed model outperforms existing research on antonym synonym distinction in both speed and performance.

PDF Details