Author name cluster

Yiping Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

JBHI Journal 2025 Journal Article

DynSeizureGAT: Multi-Band Dynamic Graph Attention Network for Interpretable Seizure Detection and Analysis of Drug-Resistant Epilepsy Using SEEG

Yiping Wang
Jinjie Guo
Ziyu Jia
Gongpeng Cao
Yanfeng Yang
Guixia Kang
Jinguo Huang

The dynamic propagation of epileptic discharges complicates Drug-Resistant Epilepsy (DRE) seizure detection using traditional machine learning methods and Stereotactic Electroencephalography (SEEG). Several challenges remain unresolved in prior studies: (1) incomprehensive representations of epileptic brain network features; (2) lacking of flexible and dynamic mechanisms to learn brain network evolving features; and (3) the absence of model mechanisms interpretation corresponds with seizure mechanisms. In response, we propose a novel multi-band dynamic graph attention network, DynSeizureGAT, to detect and analyze DRE seizures with precision and interpretability. Specifically, a seizure network sequence is first constructed by integrating a multi-band directed transfer function matrix and enhanced epileptic index node features. Second, a dynamic graph attention module is integrated to dynamically weigh the contribution of various spatial scales. Third, spatial-spectral-temporal attention mechanisms enhance the model’s capacity to better characterize and interpret the ictal and interictal states. Extensive experiments are conducted on the large-scale public clinical SEEG dataset (OpenNeuro). The proposed model demonstrates high seizure detection performance, achieving an average of 94. 6% accuracy, 93. 4% sensitivity, and 96. 4% specificity. In addition, the importance of frequency bands and dynamic abnormal connectivity patterns is successfully quantified and visualized, which contributes most to the explainability. Experimental results indicate that DynSeizureGAT demonstrates strong dynamic propagation feature learning capability, corresponding with seizure propagation mechanisms, and is promising to assist DRE epileptogenic zone localization.

Details DOI

JBHI Journal 2025 Journal Article

Efficient Slice-Patch Selection Transformer for Interpretable Alzheimer's Disease Diagnosis Using Structural MRI

Gongpeng Cao
Manli Zhang
Yiping Wang
Yuting Zhang
Jinjie Guo
Jinguo Huang
Guixia Kang

Structural magnetic resonance imaging (sMRI) plays a crucial role in the early screening of Alzheimer's disease (AD). Recent advances in vision Transformers (ViTs) demonstrate strong potential for sMRI-based computer-aided diagnosis by capturing long-range dependencies. However, their high computational demands and opaque decision-making hinder clinical application. To address these challenges, we explore the inherent structural redundancy in brain sMRI and propose an efficient and interpretable slice-patch selection Transformer (SPSFormer) framework that selectively focuses on task-relevant sMRI slices and patches, significantly reducing the computational overhead of existing ViTs. Specifically, SPSFormer employs lightweight learnable scorers placed before and within the ViT recognizer to estimate the importance of slice and patch candidates. Subsequently, a perturbed-maximum based differentiable Top-k operator is constructed to select the top-scoring elements for end-to-end training. We conduct rigorous cross-dataset validation (NACC, ADNI, AIBL) to evaluate generalizability. Across DeiT and Swin recognizers, SPSFormer reduces required GFLOPs by approximately $2-4\times$ while maintaining diagnostic accuracy. Analysis of learned selection policies highlights key regions (e. g. , hippocampus, parahippocampal gyrus, amygdala, thalamus) consistent with established AD neuropathology, supporting interpretability. The model's predicted AD probability shows significant associations with cognitive and biomarker measures, confirming neurobiological validity, and offers prognostic value: higher predicted AD probability is associated with shorter time to conversion from mild cognitive impairment to AD. These findings suggest that coupling high computational efficiency with intrinsic explainability offers a promising direction to clinically deployable, trustworthy artificial intelligence for AD detection.

Details DOI

AAAI Conference 2025 Conference Paper

Infer Human’s Intentions Before Following Natural Language Instructions

Yanming Wan
Yue Wu
Yiping Wang
Jiayuan Mao
Natasha Jaques

For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang
Qing Yang
Zhiyuan Zeng
Liliang Ren
Liyuan Liu
Baolin Peng
Hao Cheng
Xuehai He

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2. 5-Math-1. 5B, we identify a single example that elevates model performance on MATH500 from 36. 0\% to 73. 6\% (8. 6\% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17. 6\% to 35. 7\% (7. 0\% non-format gain). This result matches the performance obtained using the 1. 2k DeepScaleR subset (MATH500: 73. 6\%, average: 35. 9\%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74. 8\%, average: 36. 6\%). Similar substantial improvements are observed across various models (Qwen2. 5-Math-7B, Llama3. 2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1. 5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term \textit{post-saturation generalization}. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e. g. , by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, models, and data are open source at https: //github. com/ypwang61/One-Shot-RLVR.

PDF Details

NeurIPS Conference 2024 Conference Paper

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Yiping Wang
Yifang Chen
Wendan Yan
Alex Fang
Wenjing Zhou
Kevin Jamieson
Simon S. Du

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e. g. , CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e. g. , CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce $\textbf{negCLIPLoss}$, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, $\textbf{NormSim}$, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp [Gadre et al. , 2023]. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5. 3\% improvement on ImageNet-1k and a 2. 8\% improvement on 38 downstream evaluation tasks. Moreover, both $\textbf{negCLIPLoss}$ and $\textbf{NormSim}$ are compatible with existing techniques. By combining our methods with the current best methods DFN [Fang et al. , 2023] and HYPE [Kim et al. , 2024], we can boost average performance on downstream tasks by 0. 9\%, achieving a new state-of-the-art on the DataComp-medium benchmark.

PDF Details DOI

ICLR Conference 2024 Conference Paper

JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Yuandong Tian
Yiping Wang
Zhenyu Zhang 0015
Beidi Chen
Simon S. Du

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection), and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre- trained models (OPT, Pythia) verify our theoretical findings. The code is at https://github.com/facebookresearch/luckmatters/tree/yuandong3.

Details

ICML Conference 2023 Conference Paper

Improved Active Multi-Task Representation Learning via Lasso

Yiping Wang
Yifang Chen 0001
Kevin Jamieson 0001
Simon S. Du

To leverage the copious amount of data from source tasks and overcome the scarcity of the target task samples, representation learning based on multi-task pretraining has become a standard approach in many applications. However, up until now, most existing works design a source task selection strategy from a purely empirical perspective. Recently, Chen et al. , 2022 gave the first active multi-task representation learning (A-MTRL) algorithm which adaptively samples from source tasks and can provably reduce the total sample complexity using the L2-regularized-target-source-relevance parameter $\nu^2$. But their work is theoretically suboptimal in terms of total source sample complexity and is less practical in some real-world scenarios where sparse training source task selection is desired. In this paper, we address both issues. Specifically, we show the strict dominance of the L1-regularized-relevance-based ($\nu^1$-based) strategy by giving a lower bound for the $\nu^2$-based strategy. When $\nu^1$ is unknown, we propose a practical algorithm that uses the LASSO program to estimate $\nu^1$. Our algorithm successfully recovers the optimal result in the known case. In addition to our sample complexity results, we also characterize the potential of our $\nu^1$-based strategy in sample-cost-sensitive settings. Finally, we provide experiments on real-world computer vision datasets to illustrate the effectiveness of our proposed method.

Details

NeurIPS Conference 2023 Conference Paper

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Yuandong Tian
Yiping Wang
Beidi Chen
Simon S. Du

Transformer architecture has shown impressive performance in multiple research domains and has become the backbone of many neural network models. However, there is limited understanding on how it works. In particular, with a simple predictive loss, how the representation emerges from the gradient \emph{training dynamics} remains a mystery. In this paper, for 1-layer transformer with one self-attention layer plus one decoder layer, we analyze its SGD training dynamics for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as a \emph{discriminative scanning algorithm}: starting from uniform attention, it gradually attends more to distinct key tokens for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high co-occurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winner-takes-all, but stops due to a \emph{phase transition} that is controllable by the learning rate of the decoder layer, leaving (almost) fixed token combination. We verify this \textbf{\emph{scan and snap}} dynamics on synthetic and real-world data (WikiText-103).

PDF Details

NeurIPS Conference 2022 Conference Paper

C-Mixup: Improving Generalization in Regression

Huaxiu Yao
Yiping Wang
Linjun Zhang
James Y. Zou
Chelsea Finn

Improving the generalization of deep networks is an important open challenge, particularly in domains without plentiful data. The mixup algorithm improves generalization by linearly interpolating a pair of examples and their corresponding labels. These interpolated examples augment the original training set. Mixup has shown promising results in various classification tasks, but systematic analysis of mixup in regression remains underexplored. Using mixup directly on regression labels can result in arbitrarily incorrect labels. In this paper, we propose a simple yet powerful algorithm, C-Mixup, to improve generalization on regression tasks. In contrast with vanilla mixup, which picks training examples for mixing with uniform probability, C-Mixup adjusts the sampling probability based on the similarity of the labels. Our theoretical analysis confirms that C-Mixup with label similarity obtains a smaller mean square error in supervised regression and meta-regression than vanilla mixup and using feature similarity. Another benefit of C-Mixup is that it can improve out-of-distribution robustness, where the test distribution is different from the training distribution. By selectively interpolating examples with similar labels, it mitigates the effects of domain-associated information and yields domain-invariant representations. We evaluate C-Mixup on eleven datasets, ranging from tabular to video data. Compared to the best prior approach, C-Mixup achieves 6. 56%, 4. 76%, 5. 82% improvements in in-distribution generalization, task generalization, and out-of-distribution robustness, respectively. Code is released at https: //github. com/huaxiuyao/C-Mixup.

PDF Details