Arrow Research search

Author name cluster

Yang Bai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

JMLR Journal 2026 Journal Article

Classification Under Local Differential Privacy with Model Reversal and Model Averaging

  • Caihong Qin
  • Yang Bai

Local differential privacy has become a central topic in data privacy research, offering strong privacy guarantees by perturbing user data at the source and removing the need for a trusted curator. However, the noise introduced by local differential privacy often significantly reduces data utility. To address this issue, we reinterpret private learning under local differential privacy as a transfer learning problem, where the noisy data serve as the source domain and the unobserved clean data as the target. We propose novel techniques specifically designed for local differential privacy to improve classification performance without compromising privacy: (1) a noised binary feedback-based evaluation mechanism for estimating dataset utility; (2) model reversal, which salvages underperforming classifiers by inverting their decision boundaries; and (3) model averaging, which assigns weights to multiple reversed classifiers based on their estimated utility. We provide theoretical excess risk bounds under local differential privacy and demonstrate how our methods reduce this risk. Empirical results on both simulated and real-world datasets show substantial improvements in classification accuracy. [abs] [ pdf ][ bib ] &copy JMLR 2026. ( edit, beta )

AAAI Conference 2026 Conference Paper

Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

  • Yang Zhou
  • Zhenting Sheng
  • Mingrui Tan
  • Yuting Song
  • Jun Zhou
  • Yu Heng Kwan
  • Lian Leng Low
  • Yang Bai

Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose Note2Chat, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o.

ICLR Conference 2025 Conference Paper

Image-level Memorization Detection via Inversion-based Inference Perturbation

  • Yue Jiang
  • Haokun Lin
  • Yang Bai
  • Bo Peng 0002
  • Zhili Liu
  • Yueming Lyu
  • Yong Yang
  • Xing Zheng

Recent studies have discovered that widely used text-to-image diffusion models can replicate training samples during image generation, a phenomenon known as memorization. Existing detection methods primarily focus on identifying memorized prompts. However, in real-world scenarios, image owners may need to verify whether their proprietary or personal images have been memorized by the model, even in the absence of paired prompts or related metadata. We refer to this challenge as image-level memorization detection, where current methods relying on original prompts fall short. In this work, we uncover two characteristics of memorized images after perturbing the inference procedure: lower similarity of the original images and larger magnitudes of TCNP. Building on these insights, we propose Inversion-based Inference Perturbation (IIP), a new framework for image-level memorization detection. Our approach uses unconditional DDIM inversion to derive latent codes that contain core semantic information of original images and optimizes random prompt embeddings to introduce effective perturbation. Memorized images exhibit distinct characteristics within the proposed pipeline, providing a robust basis for detection. To support this task, we construct a comprehensive setup for the image-level memorization detection, carefully curating datasets to simulate realistic memorization scenarios. Using this setup, we evaluate our IIP framework across three different memorization settings, demonstrating its state-of-the-art performance in identifying memorized images in various settings, even in the presence of data augmentation attacks.

IROS Conference 2025 Conference Paper

RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

  • Liudi Yang
  • Yang Bai
  • George Eskandar
  • Fengyi Shen
  • Mohammad Altillawi
  • Dong Chen
  • Soumajit Majumder
  • Ziyuan Liu

We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

IROS Conference 2025 Conference Paper

RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

  • Yang Bai
  • Liudi Yang
  • George Eskandar
  • Fengyi Shen
  • Dong Chen
  • Mohammad Altillawi
  • Ziyuan Liu
  • Gitta Kutyniok

Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another— a key step for cross-embodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

AAAI Conference 2025 Conference Paper

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

  • Chun-Mei Feng
  • Yang Bai
  • Tao Luo
  • Zhen Li
  • Salman Khan
  • Wangmeng Zuo
  • Rick Siow Mong Goh
  • Yong Liu

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation → VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

AAAI Conference 2024 Conference Paper

An Empirical Study of CLIP for Text-Based Person Search

  • Min Cao
  • Yang Bai
  • Ziyin Zeng
  • Mang Ye
  • Min Zhang

Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.

AAAI Conference 2024 Conference Paper

What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception

  • Wanfang Su
  • Lixing Chen
  • Yang Bai
  • Xi Lin
  • Gaolei Li
  • Zhe Qu
  • Pan Zhou

Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaborative encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at https://github.com/77SWF/CMiMC.

AAAI Conference 2023 Conference Paper

AutoGraph: Optimizing DNN Computation Graph for Parallel GPU Kernel Execution

  • Yuxuan Zhao
  • Qi Sun
  • Zhuolun He
  • Yang Bai
  • Bei Yu

Deep learning frameworks optimize the computation graphs and intra-operator computations to boost the inference performance on GPUs, while inter-operator parallelism is usually ignored. In this paper, a unified framework, AutoGraph, is proposed to obtain highly optimized computation graphs in favor of parallel executions of GPU kernels. A novel dynamic programming algorithm, combined with backtracking search, is adopted to explore the optimal graph optimization solution, with the fast performance estimation from the mixed critical path cost. Accurate runtime information based on GPU Multi-Stream launched with CUDA Graph is utilized to determine the convergence of the optimization. Experimental results demonstrate that our method achieves up to 3.47x speedup over existing graph optimization methods. Moreover, AutoGraph outperforms state-of-the-art parallel kernel launch frameworks by up to 1.26x.

NeurIPS Conference 2023 Conference Paper

Implicit Regularization in Over-Parameterized Support Vector Machine

  • Yang Sui
  • Xin He
  • Yang Bai

In this paper, we design a regularization-free algorithm for high-dimensional support vector machines (SVMs) by integrating over-parameterization with Nesterov's smoothing method, and provide theoretical guarantees for the induced implicit regularization phenomenon. In particular, we construct an over-parameterized hinge loss function and estimate the true parameters by leveraging regularization-free gradient descent on this loss function. The utilization of Nesterov's method enhances the computational efficiency of our algorithm, especially in terms of determining the stopping criterion and reducing computational complexity. With appropriate choices of initialization, step size, and smoothness parameter, we demonstrate that unregularized gradient descent achieves a near-oracle statistical convergence rate. Additionally, we verify our theoretical findings through a variety of numerical experiments and compare the proposed method with explicit regularization. Our results illustrate the advantages of employing implicit regularization via gradient descent in conjunction with over-parameterization in sparse SVMs.

IJCAI Conference 2023 Conference Paper

RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search

  • Yang Bai
  • Min Cao
  • Daming Gao
  • Ziqiang Cao
  • Chen Chen
  • Zhenfeng Fan
  • Liqiang Nie
  • Min Zhang

Text-based person search aims to retrieve the specified person images given a textual description. The key to tackling such a challenging task is to learn powerful multi-modal representations. Towards this, we propose a Relation and Sensitivity aware representation learning method (RaSa), including two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For one thing, existing methods cluster representations of all positive pairs without distinction and overlook the noise problem caused by the weak positive pairs where the text and the paired image have noise correspondences, thus leading to overfitting learning. RA offsets the overfitting risk by introducing a novel positive relation detection task (i. e. , learning to distinguish strong and weak positive pairs). For another thing, learning invariant representation under data augmentation (i. e. , being insensitive to some transformations) is a general practice for improving representation's robustness in existing methods. Beyond that, we encourage the representation to perceive the sensitive transformation by SA (i. e. , learning to detect the replaced words), thus promoting the representation's robustness. Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6. 94%, 4. 45% and 15. 35% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively. Code is available at: https: //github. com/Flame-Chasers/RaSa.

NeurIPS Conference 2022 Conference Paper

Untargeted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection

  • Yiming Li
  • Yang Bai
  • Yong Jiang
  • Yong Yang
  • Shu-Tao Xia
  • Bo Li

Deep neural networks (DNNs) have demonstrated their superiority in practice. Arguably, the rapid development of DNNs is largely benefited from high-quality (open-sourced) datasets, based on which researchers and developers can easily evaluate and improve their learning methods. Since the data collection is usually time-consuming or even expensive, how to protect their copyrights is of great significance and worth further exploration. In this paper, we revisit dataset ownership verification. We find that existing verification methods introduced new security risks in DNNs trained on the protected dataset, due to the targeted nature of poison-only backdoor watermarks. To alleviate this problem, in this work, we explore the untargeted backdoor watermarking scheme, where the abnormal model behaviors are not deterministic. Specifically, we introduce two dispersibilities and prove their correlation, based on which we design the untargeted backdoor watermark under both poisoned-label and clean-label settings. We also discuss how to use the proposed untargeted backdoor watermark for dataset ownership verification. Experiments on benchmark datasets verify the effectiveness of our methods and their resistance to existing backdoor defenses.

NeurIPS Conference 2021 Conference Paper

Clustering Effect of Adversarial Robust Models

  • Yang Bai
  • Xin Yan
  • Yong Jiang
  • Shu-Tao Xia
  • Yisen Wang

Adversarial robustness has received increasing attention along with the study of adversarial examples. So far, existing works show that robust models not only obtain robustness against various adversarial attacks but also boost the performance in some downstream tasks. However, the underlying mechanism of adversarial robustness is still not clear. In this paper, we interpret adversarial robustness from the perspective of linear components, and find that there exist some statistical properties for comprehensively robust models. Specifically, robust models show obvious hierarchical clustering effect on their linearized sub-networks, when removing or replacing all non-linear components (e. g. , batch normalization, maximum pooling, or activation layers). Based on these observations, we propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy. Our code is available at https: //github. com/bymavis/Adv Weight NeurIPS2021.

AAAI Conference 2021 Short Paper

Global Fusion Attention for Vision and Language Understanding (Student Abstract)

  • Zixin Guo
  • Chen Liang
  • Ziyu Wan
  • Yang Bai

We extend the popular transformer architecture to a multimodal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also observe the improvements in sample efficiency of reinforcement learning through the experiments on grounded language understanding tasks of BabyAI platform.

YNIMG Journal 2021 Journal Article

Spontaneous transient brain states in EEG source space in disorders of consciousness

  • Yang Bai
  • Jianghong He
  • Xiaoyu Xia
  • Yong Wang
  • Yi Yang
  • Haibo Di
  • Xiaoli Li
  • Ulf Ziemann

Spontaneous transient states were recently identified by functional magnetic resonance imaging and magnetoencephalography in healthy subjects. They organize and coordinate neural activity in brain networks. How spontaneous transient states are altered in abnormal brain conditions is unknown. Here, we conducted a transient state analysis on resting-state electroencephalography (EEG) source space and developed a state transfer analysis to patients with disorders of consciousness (DOC). They uncovered different neural coordination patterns, including spatial power patterns, temporal dynamics, spectral shifts, and connectivity construction varies at potentially very fast (millisecond) time scales, in groups with different consciousness levels: healthy subjects, patients in minimally conscious state (MCS), and patients with vegetative state/unresponsive wakefulness syndrome (VS/UWS). Machine learning based on transient state features reveal high classification accuracy between MCS and VS/UWS. This study developed methodology of transient states analysis on EEG source space and abnormal brain conditions. Findings correlate spontaneous transient states with human consciousness and suggest potential roles of transient states in brain disease assessment.

IJCAI Conference 2020 Conference Paper

Infobox-to-text Generation with Tree-like Planning based Attention Network

  • Yang Bai
  • Ziran Li
  • Ning Ding
  • Ying Shen
  • Hai-Tao Zheng

We study the problem of infobox-to-text generation that aims to generate a textual description from a key-value table. Representing the input infobox as a sequence, previous neural methods using end-to-end models without order-planning suffer from the problems of incoherence and inadaptability to disordered input. Recent planning-based models only implement static order-planning to guide the generation, which may cause error propagation between planning and generation. To address these issues, we propose a Tree-like PLanning based Attention Network (Tree-PLAN) which leverages both static order-planning and dynamic tuning to guide the generation. A novel tree-like tuning encoder is designed to dynamically tune the static order-plan for better planning by merging the most relevant attributes together layer by layer. Experiments conducted on two datasets show that our model outperforms previous methods on both automatic and human evaluation, and demonstrate that our model has better adaptability to disordered input.

YNICL Journal 2017 Journal Article

TDCS modulates cortical excitability in patients with disorders of consciousness

  • Yang Bai
  • Xiaoyu Xia
  • Jiannan Kang
  • Yi Yang
  • Jianghong He
  • Xiaoli Li

Transcranial direct current stimulation (tDCS) has been reported to be a promising technique for consciousness improvement for patients with disorders of consciousness (DOC). However, there has been no direct electrophysiological evidence to demonstrate the efficacy of tDCS on patients with DOC. Therefore, we aim to measure the cortical excitability changes induced by tDCS in patients with DOC, to find electrophysiological evidence supporting the therapeutic efficacy of tDCS on patients with DOC. In this study, we enrolled sixteen patients with DOC, including nine vegetative state (VS) and seven minimally conscious state (MCS) (six females and ten males). TMS-EEG was applied to assess cortical excitability changes after twenty minutes of anodal tDCS of the left dorsolateral prefrontal cortex. Global cerebral excitability were calculated to quantify cortical excitability in the temporal domain: four time intervals (0-100, 100-200, 200-300, 300-400 ms). Then local cerebral excitability in the significantly altered time windows were investigated (frontal, left/right hemispheres, central, and posterior). Compared to baseline and sham stimulation, we found that global cerebral excitability increased in early time windows (0-100 and 100-200 ms) for patients with MCS; for the patients with VS, global cerebral excitability increased in the 0-100 ms interval but decreased in the 300-400 ms interval. The local cerebral excitability was significantly different between MCS and VS. The results indicated that tDCS can effectively modulate the cortical excitability of patients with DOC; and the changes in excitability in temporal and spatial domains are different between patients with MCS and those with VS.

IROS Conference 2012 Conference Paper

Location and orientation estimation with an electrosense robot

  • Yonatan Silverman
  • James Snyder
  • Yang Bai
  • Malcolm A. MacIver

We have designed an underwater robot that uses perturbations of an emitted electric field to sense, localize, and map its environment. This system is inspired by weakly electric fish, which emit an electric field to sense objects, localize prey, and communicate. When nearby objects distort the electric field, electroreceptors (fish) or voltage sensors (robot) detect these perturbations. Further analysis of the perturbations can reveal information about the associated target, such as size, shape, and distance. One difficulty with extracting distance-to-target for our robotic electrosense platform is that the measurements are dependent on orientation of the robot with respect to the object. We solve this problem by applying techniques from range-only SLAM, with modifications for some of the ways in which electrosense differs from the sensors typically used. We use two different Bayesian filters to estimate the orientation and position separately. Using this approach, we show that our electrosense robot can accurately localize and orient itself, and improve its estimate of position and orientation using motion.

IROS Conference 2012 Conference Paper

Sensing capacitance of underwater objects in bio-inspired electrosense

  • Yang Bai
  • James Snyder
  • Yonatan Silverman
  • Michael A. Peshkin
  • Malcolm A. MacIver

Certain electric fish use a self-generated AC electric field to navigate and hunt. Thousands of sensors on the surface of the fish's body detect the pattern of amplitude and phase distortions of the field caused by nearby objects. Prior research has suggested that phase distortions may be especially useful for recognition of live objects. Here we present the first study of the utility of phase information in a robotic implementation of active electrosense. Using our robotic implementation, we investigated how the phase information depends on the frequency of the emitted field, the conductivity of the surrounding water, and object properties. An analytical model was developed serving as qualitative explanation of the dependency. We show that in certain situations phase information enables discrimination between two objects that are otherwise very similar in the amplitude of their electric images. We also show the utility of probing objects with multiple frequencies.

IROS Conference 2012 Conference Paper

Underwater object tracking using electrical impedance tomography

  • James Snyder
  • Yonatan Silverman
  • Yang Bai
  • Malcolm A. MacIver

Few effective technologies exist for sensing in dark or murky underwater situations. For this reason, we have been exploring the use of a novel biologically-inspired approach to non-visual sensing based on the detection of perturbations to a self generated electric field. This is used by many species of neotropical nocturnal freshwater fish. This approach, termed active electrosense, presents unique challenges for sensing and tracking of nearby objects. We explore two methods for estimating the velocity of objects through active electrosense. The first of these methods uses a simple cross-correlation method, which depends on the uniformity of the electric field. We show some of the ramifications of making this assumption for a self-generated field around a cylindrical pod-shaped sensor in a rectangular tank. We then evaluate the use of methods developed for electrical impedance tomography (EIT) for localization and tracking. This is an unusual application of EIT in that typical applications involve surrounding the volume of interest (such as the thorax of humans) with sensor/emitters. Here, rather than this “outside in” approach, we are using EIT “inside out. ” In simulation, we nonetheless find significant improvements in the accuracy of estimated velocity when using the EIT approach. Additionally, we demonstrate how EIT may be used for accurate position estimation. Under the conditions evaluated, the computation time for inversion is low enough to make its use feasible as a primary position and velocity estimator in an on- line system or as a secondary system to augment a computationally inexpensive estimator.