Author name cluster

Xiao Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers

2 author rows

AAAI Conference 2026 Conference Paper

EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

Xiao Yang
Xuejiao Zhao
Zhiqi Shen

Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical tasks. However, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR data. To address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR tasks. EHRStruct defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR datasets. We use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical models. We further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement methods for structured data reasoning. Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of LLMs. In response, we propose SEMaster, a code-augmented method that achieves state-of-the-art performance and offers practical insights to guide future research.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ReflexDiffusion: Reflection-Enhanced Trajectory Planning for High-lateral-acceleration Scenarios in Autonomous Driving

Xuemei Yao
Xiao Yang
Jianbin Sun
Liuwei Xie
Xuebin Shao
Xiyu Fang
Hang Su
Kewei Yang

Generating safe and reliable trajectories for autonomous vehicles in long-tail scenarios remains a significant challenge, particularly for High-lateral-acceleration maneuvers such as sharp turns that represent critical safety situations. Existing trajectory planners exhibit systematic failures in these scenarios due to data imbalance, resulting in insufficient representation of vehicle dynamics, road geometry, and environmental constraints in high-risk situations, leading to suboptimal or unsafe trajectory prediction when vehicles operate near their physical boundaries. In this paper, we introduce ReflexDiffusion, a novel inference-stage framework that enhances diffusion-based trajectory planners through reflective adjustment. Our method introduces a gradient-based adjustment mechanism during the iterative denoising process: after each standard trajectory update, we compute the gradient between conditional and unconditional noise predictions to explicitly amplify critical conditioning signals, including road curvature and lateral vehicle dynamics. This amplification enforces strict adherence to physical constraints, particularly improving stability during high-lateral-acceleration maneuvers where precise vehicle-road interaction is paramount. Evaluated on the nuPlan Test14-hard benchmark, ReflexDiffusion achieves a 14.1% improvement in driving score for high-lateral-acceleration scenarios compared to state-of-the-art methods. This demonstrates that inference-time trajectory optimization can effectively compensate for training data sparsity by dynamically reinforcing safety-critical constraints at the handling limits. The framework's architecture-agnostic design enables direct deployment across existing diffusion-based planners, offering a practical solution for improving autonomous vehicle safety in challenging driving conditions.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu
Ronghao Fu
Lang Sun
Haoran Liu
Xiao Yang
Weipeng Zhang
Xu Na
Zhuoran Duan

The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

PDF Details DOI

AAAI Conference 2025 Conference Paper

A Generalizable Anomaly Detection Method in Dynamic Graphs

Xiao Yang
Xuejiao Zhao
Zhiqi Shen

Anomaly detection aims to identify deviations from normal patterns within data. This task is particularly crucial in dynamic graphs, which are common in applications like social networks and cybersecurity, due to their evolving structures and complex relationships. Although recent deep learning-based methods have shown promising results in anomaly detection on dynamic graphs, they often lack of generalizability. In this study, we propose GeneralDyG, a method that samples temporal ego-graphs and sequentially extracts structural and temporal features to address the three key challenges in achieving generalizability: Data Diversity, Dynamic Feature Capture, and Computational Cost. Extensive experimental results demonstrate that our proposed GeneralDyG significantly outperforms state-of-the-art methods on four real-world datasets.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

GraphProt: Certified Black-Box Shielding Against Backdoored Graph Models

Xiao Yang
Yuni Lai
Kai Zhou
Gaolei Li
Jianhua Li
Hang Zhang

Graph learning models have been empirically proven to be vulnerable to backdoor threats, wherein adversaries submit trigger-embedded inputs to manipulate the model predictions. Current graph backdoor defenses manifest several limitations: 1) dependence on model-related details, 2) necessitation of additional fine-tuning, and 3) reliance on extra explainability tools, all of which are infeasible under stringent privacy policies. To address those limitations, we propose GraphProt, a certified black-box defense method to suppress backdoor attacks on GNN-based graph classifiers. Our GraphProt operates in a model-agnostic manner and solely leverages graph input. Specifically, GraphProt first introduces designed topology-feature-filtration to mitigate graph anomalies. Subsequently, subgraphs are sampled via a formulated strategy integrating topology and features, followed by a robust model inference through a majority vote-based subgraph prediction ensemble. Our results across benchmark attacks and datasets show GraphProt effectively reduces attack success rates while preserving regular graph classification accuracy.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

Jialong Zhou
Lichao Wang
Xiao Yang

The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration faces critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN's effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization. The code is available at https: //github. com/JialongZhou666/GUARDIAN.

PDF Details

IJCAI Conference 2025 Conference Paper

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

Jirong Zha
Yuxuan Fan
Xiao Yang
Chen Gao
Xinlei Chen

3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

OSTAR: Optimized Statistical Text-classifier with Adversarial Resistance

Yuhan Yao
Feifei Kou
Lei Shi
Xiao Yang
Zhongbao Zhang
Suguo Zhu
Jiwei Zhang
Lirong Qiu

The advancements in generative models and the real-world attack of machine-generated text(MGT) create a demand for more robust detection methods. The existing MGT detection methods for adversarial environments primarily consist of manually designed statistical-based methods and fine-tuned classifier-based approaches. Statistical-based methods extract intrinsic features but suffer from rigid decision boundaries vulnerable to adaptive attacks, while fine-tuned classifiers achieve outstanding performance at the cost of overfitting to superficial textual feature. We argue that the key to detection in current adversarial environments lies in how to extract intrinsic invariant features and ensure that the classifier possesses dynamic adaptability. In that case, we propose OSTAR, a novel MGT detection framework designed for adversarial environments which composed of a statistical enhanced classifier and a Multi-Faceted Contrastive Learning(MFCL). In the classifier aspect, our Multi-Dimensional Statistical Profiling (MDSP) module extracts intrinsic difference between human and machine texts, complementing classifiers with useful stable features. In the model optimization aspect, the MFCL strategy enhances robustness by contrasting feature variations before and after text attacks, jointly optimizing statistical feature mapping and baseline pre-trained models. Experimental results on three public datasets under various adversarial scenarios demonstrate that our framework outperforms existing MGT detection methods, achieving state-of-the-art performance and robust against attacks. The code is available at https: //github. com/BUPT-SN/OSTAR.

PDF Details

NeurIPS Conference 2025 Conference Paper

PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Wang Wang
Xiao Yang
Qingyong Hu
Jack Tang
Can Liu
Dengbo He
Yuntao Wang
Yingcong Chen

Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration of various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied by six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal‑processing and deep‑learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open‑source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart‑cockpit systems.

PDF Details

NeurIPS Conference 2025 Conference Paper

R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

Yuante Li
Xu Yang
Xiao Yang
Xisen Wang
Weiqing Liu
Jiang Bian

Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short R&D-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. R&D-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, R&D-Agent(Q) achieves up to 2× higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor–model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https: //github. com/microsoft/RD-Agent.

PDF Details

NeurIPS Conference 2025 Conference Paper

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang
Zhuangqun Huang
Yiwei Liao
Sagar Bhavsar
Amogh Param
Tammy Stark
Adel Ahmadyan
Xiao Yang

We introduce WearVQA, the first benchmark specifically designed to evaluate the visual questionanswering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlikeprior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique chal-lenges of ego-centric interaction—where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2, 500 carefullycurated image-question-answer triplets, spanning 7 diverse image domains including both text-centricand general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable usingonly the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluationframework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QAaccuracy as low as 24–52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark forguiding technicial advancement towards robust, real-world multi-modal wearables AI systems.

PDF Details

NeurIPS Conference 2024 Conference Paper

BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning

Jianming Pan
Zeqi Ye
Xiao Yang
Xu Yang
Weiqing Liu
Lewen Wang
Jiang Bian

Data-driven decision-making processes increasingly utilize end-to-end learnable deep neural networks to render final decisions. Sometimes, the output of the forward functions in certain layers is determined by the solutions to mathematical optimization problems, leading to the emergence of differentiable optimization layers that permit gradient back-propagation. However, real-world scenarios often involve large-scale datasets and numerous constraints, presenting significant challenges. Current methods for differentiating optimization problems typically rely on implicit differentiation, which necessitates costly computations on the Jacobian matrices, resulting in low efficiency. In this paper, we introduce BPQP, a differentiable convex optimization framework designed for efficient end-to-end learning. To enhance efficiency, we reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the Karush–Kuhn–Tucker (KKT) matrix. This reformulation enables the use of first-order optimization algorithms in calculating the backward pass gradients, allowing our framework to potentially utilize any state-of-the-art solver. As solver technologies evolve, BPQP can continuously adapt and improve its efficiency. Extensive experiments on both simulated and real-world datasets demonstrate that BPQP achieves a significant improvement in efficiency—typically an order of magnitude faster in overall execution time compared to other differentiable optimization layers. Our results not only highlight the efficiency gains of BPQP but also underscore its superiority over differential optimization layer baselines.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

CRAG - Comprehensive RAG Benchmark

Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Nikita Bhalla
Xiangsen Chen
Sajal Choudhary
Rongze D. Gui

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)’s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4, 409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve $\le 34\%$ accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https: //github. com/facebookresearch/CRAG/.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Diffusion Models are Certifiably Robust Classifiers

Huanran Chen
Yinpeng Dong
Shitong Shao
Zhongkai Hao
Xiao Yang
Hang Su
Jun Zhu

Generative learning, recognized for its effective modeling of data distributions, offers inherent advantages in handling out-of-distribution instances, especially for enhancing robustness to adversarial attacks. Among these, diffusion classifiers, utilizing powerful diffusion models, have demonstrated superior empirical robustness. However, a comprehensive theoretical understanding of their robustness is still lacking, raising concerns about their vulnerability to stronger future attacks. In this study, we prove that diffusion classifiers possess $O(1)$ Lipschitzness, and establish their certified robustness, demonstrating their inherent resilience. To achieve non-constant Lipschitzness, thereby obtaining much tighter certified robustness, we generalize diffusion classifiers to classify Gaussian-corrupted data. This involves deriving the evidence lower bounds (ELBOs) for these distributions, approximating the likelihood using the ELBO, and calculating classification probabilities via Bayes' theorem. Experimental results show the superior certified robustness of these Noised Diffusion Classifiers (NDCs). Notably, we achieve over 80\% and 70\% certified robustness on CIFAR-10 under adversarial perturbations with $\ell_2$ norms less than 0. 25 and 0. 5, respectively, using a single off-the-shelf diffusion model without any additional data.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Yijun Yang
Ruiyuan Gao
Xiao Yang
Jianyuan Zhong
Qiang Xu

Recent advancements in Text-to-Image models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work contents, despite existing countermeasures such as Not-Safe-For-Work classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I a novel moderation framework that adopts a generative approach to enhance Text-to-Image models’ robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a large language model to conditionally transform text guidance embeddings within the Text-to-Image models into natural language for effective adversarial prompt detection, without compromising the models’ inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at https: //github. com/cure-lab/GuardT2I.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Improving Robustness of 3D Point Cloud Recognition from a Fourier Perspective

Yibo Miao
Yinpeng Dong
Jinlai Zhang
Lijia Yu
Xiao Yang
Xiao-Shan Gao

Although 3D point cloud recognition has achieved substantial progress on standard benchmarks, the typical models are vulnerable to point cloud corruptions, leading to security threats in real-world applications. To improve the corruption robustness, various data augmentation methods have been studied, but they are mainly limited to the spatial domain. As the point cloud has low information density and significant spatial redundancy, it is challenging to analyze the effects of corruptions. In this paper, we focus on the frequency domain to observe the underlying structure of point clouds and their corruptions. Through graph Fourier transform (GFT), we observe a correlation between the corruption robustness of point cloud recognition models and their sensitivity to different frequency bands, which is measured by the GFT spectrum of the model’s Jacobian matrix. To reduce the sensitivity and improve the corruption robustness, we propose Frequency Adversarial Training (FAT) that adopts frequency-domain adversarial examples as data augmentation to train robust point cloud recognition models against corruptions. Theoretically, we provide a guarantee of FAT on its out-of-distribution generalization performance. Empirically, we conduct extensive experiments with various network architectures to validate the effectiveness of FAT, which achieves the new state-of-the-art results.

PDF Details DOI

ICML Conference 2024 Conference Paper

MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion

Di Chang
Yichun Shi
Quankai Gao
Hongyi Xu
Jessica Fu
Guoxian Song
Qing Yan
Yizhe Zhu

In this work, we propose MagicPose, a diffusion-based model for 2D human pose and facial expression retargeting. Specifically, given a reference image, we aim to generate a person’s new images by controlling the poses and facial expressions while keeping the identity unchanged. To this end, we propose a two-stage training strategy to disentangle human motions and appearance (e. g. , facial expressions, skin tone, and dressing), consisting of (1) the pre-training of an appearance-control block and (2) learning appearance-disentangled pose control. Our novel design enables robust appearance control over generated human images, including body, facial attributes, and even background. By leveraging the prior knowledge of image diffusion models, MagicPose generalizes well to unseen human identities and complex poses without the need for additional fine-tuning. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion. The project website is here. The code is available here.

Details

NeurIPS Conference 2024 Conference Paper

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models

Yichi Zhang
Yao Huang
Yitong Sun
Chang Liu
Zhe Zhao
Zhengwei Fang
Yifan Wang
Huanran Chen

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: https: //multi-trust. github. io/.

PDF Details DOI

ICLR Conference 2024 Conference Paper

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi
Peng Wang
Jianglong Ye
Long Mai
Kejie Li
Xiao Yang

We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

Details

EAAI Journal 2023 Journal Article

An automatic classifier for monitoring applied behaviors of cage-free laying hens with deep learning

Xiao Yang
Ramesh Bist
Sachin Subedi
Zihao Wu
Tianming Liu
Lilong Chai

Details DOI

NeurIPS Conference 2023 Conference Paper

On Evaluating Adversarial Robustness of Large Vision-Language Models

Yunqing Zhao
Tianyu Pang
Chao Du
Xiao Yang
Chongxuan Li
Ngai-Man (Man) Cheung
Min Lin

Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than large language models such as ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e. g. , vision). To this end, we propose evaluating the robustness of open-source large VLMs in the most realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP, and then transfer these adversarial examples to other VLMs such as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we observe that black-box queries on these VLMs can further improve the effectiveness of targeted evasion, resulting in a surprisingly high success rate for generating targeted responses. Our findings provide a quantitative understanding regarding the adversarial vulnerability of large VLMs and call for a more thorough examination of their potential security flaws before deployment in practice. Our project page: https: //yunqing-me. github. io/AttackVLM/.

PDF Details

AAAI Conference 2023 Conference Paper

RGBD1K: A Large-Scale Dataset and Benchmark for RGB-D Object Tracking

Xue-Feng Zhu
Tianyang Xu
Zhangyong Tang
Zucheng Wu
Haodong Liu
Xiao Yang
Xiao-Jun Wu
Josef Kittler

RGB-D object tracking has attracted considerable attention recently, achieving promising performance thanks to the symbiosis between visual and depth channels. However, given a limited amount of annotated RGB-D tracking data, most state-of-the-art RGB-D trackers are simple extensions of high-performance RGB-only trackers, without fully exploiting the underlying potential of the depth channel in the offline training stage. To address the dataset deficiency issue, a new RGB-D dataset named RGBD1K is released in this paper. The RGBD1K contains 1,050 sequences with about 2.5M frames in total. To demonstrate the benefits of training on a larger RGB-D data set in general, and RGBD1K in particular, we develop a transformer-based RGB-D tracker, named SPT, as a baseline for future visual object tracking studies using the new dataset. The results, of extensive experiments using the SPT tracker demonstrate the potential of the RGBD1K dataset to improve the performance of RGB-D tracking, inspiring future developments of effective tracker designs. The dataset and codes will be available on the project homepage: https://github.com/xuefeng-zhu5/RGBD1K.

PDF Details DOI

EAAI Journal 2023 Journal Article

Unsupervised image-to-image translation in multi-parametric MRI of bladder cancer

Zhiying Chen
Lingkai Cai
Chunxiao Chen
Xue Fu
Xiao Yang
Baorui Yuan
Qiang Lu
Huiyu Zhou

Details DOI

ICRA Conference 2022 Conference Paper

A Novel Multimodal Human-Exoskeleton Interface Based on EEG and sEMG Activity for Rehabilitation Training

Kecheng Shi
Rui Huang 0008
Fengjun Mu
Zhinan Peng
Ke Huang
Yizhe Qin
Xiao Yang
Hong Cheng 0002

Despite the advances in the field of human-robot interface (HRI) based on biological neural signal, the use of the sole electroencephalography (EEG) signal to help robotic exoskeleton predict the limb movement is currently no mature in rehabilitation training, due to its unreliability. Multimodal HRI represents a very recent solution to enhance the performance of single modal HRI. These HRI normally include the EEG signal with surface electromyography (sEMG) signal. However, their use for the lower limb movement prediction in hemiplegia is still limited, and the deep fusion feature of sEMG and EEG signal is ignored. This paper proposes a Dense co-attention mechanism-based Multimodal Enhance fusion Network (DMEFNet) for the lower limb movement prediction in hemiplegia. The DMEFNet can realize the mapping and deep fusion between the sEMG and EEG signal features and get a high accuracy movement prediction of the lower limbs. A sEMG and EEG data acquisition experiment and an incomplete asynchronous data collection paradigm are designed to verify the effectiveness of DMEFNet. The experimental results show that DMEFNet has a good movement prediction performance in both within-subject and cross-subject situations, reaching an accuracy of 82. 96% and 88. 44% respectively.

Details

AAAI Conference 2022 Conference Paper

DDG-DA: Data Distribution Generation for Predictable Concept Drift Adaptation

Wendi Li
Xiao Yang
Weiqing Liu
Yingce Xia
Jiang Bian

In many real-world scenarios, we often deal with streaming data that is sequentially collected over time. Due to the nonstationary nature of the environment, the streaming data distribution may change in unpredictable ways, which is known as concept drift. To handle concept drift, previous methods first detect when/where the concept drift happens and then adapt models to fit the distribution of the latest data. However, there are still many cases that some underlying factors of environment evolution are predictable, making it possible to model the future concept drift trend of the streaming data, while such cases are not fully explored in previous work. In this paper, we propose a novel method DDG-DA, that can effectively forecast the evolution of data distribution and improve the performance of models. Specifically, we first train a predictor to estimate the future data distribution, then leverage it to generate training samples, and finally train models on the generated data. We conduct experiments on three realworld tasks (forecasting on stock price trend, electricity load and solar irradiance) and obtain significant improvement on multiple widely-used models.

PDF Details

NeurIPS Conference 2021 Conference Paper

Accumulative Poisoning Attacks on Real-time Data

Tianyu Pang
Xiao Yang
Yinpeng Dong
Hang Su
Jun Zhu

Collecting training data from untrusted sources exposes machine learning services to poisoning adversaries, who maliciously manipulate training data to degrade the model accuracy. When trained on offline datasets, poisoning adversaries have to inject the poisoned data in advance before training, and the order of feeding these poisoned batches into the model is stochastic. In contrast, practical systems are more usually trained/fine-tuned on sequentially captured real-time data, in which case poisoning adversaries could dynamically poison each data batch according to the current model state. In this paper, we focus on the real-time settings and propose a new attacking strategy, which affiliates an accumulative phase with poisoning attacks to secretly (i. e. , without affecting accuracy) magnify the destructive effect of a (poisoned) trigger batch. By mimicking online learning and federated learning on MNIST and CIFAR-10, we show that model accuracy significantly drops by a single update step on the trigger batch after the accumulative phase. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects, with no need to explore complex techniques.

PDF Details

NeurIPS Conference 2020 Conference Paper

Boosting Adversarial Training with Hypersphere Embedding

Tianyu Pang
Xiao Yang
Yinpeng Dong
Kun Xu
Jun Zhu
Hang Su

Adversarial training (AT) is one of the most effective defenses against adversarial attacks for deep learning models. In this work, we advocate incorporating the hypersphere embedding (HE) mechanism into the AT procedure by regularizing the features onto compact manifolds, which constitutes a lightweight yet effective module to blend in the strength of representation learning. Our extensive analyses reveal that AT and HE are well coupled to benefit the robustness of the adversarially trained models from several aspects. We validate the effectiveness and adaptability of HE by embedding it into the popular AT frameworks including PGD-AT, ALP, and TRADES, as well as the FreeAT and FastAT strategies. In the experiments, we evaluate our methods under a wide range of adversarial attacks on the CIFAR-10 and ImageNet datasets, which verifies that integrating HE can consistently enhance the model robustness for each AT framework with little extra computation.

PDF Details

AAAI Conference 2019 Conference Paper

Adversarial Training for Community Question Answer Selection Based on Multi-Scale Matching

Xiao Yang
Madian Khabsa
Miaosen Wang
Wei Wang
Ahmed Hassan Awadallah
Daniel Kifer
C. Lee Giles

Community-based question answering (CQA) websites represent an important source of information. As a result, the problem of matching the most valuable answers to their corresponding questions has become an increasingly popular research topic. We frame this task as a binary (relevant/irrelevant) classification problem, and present an adversarial training framework to alleviate label imbalance issue. We employ a generative model to iteratively sample a subset of challenging negative samples to fool our classification model. Both models are alternatively optimized using REIN- FORCE algorithm. The proposed method is completely different from previous ones, where negative samples in training set are directly used or uniformly down-sampled. Further, we propose using Multi-scale Matching which explicitly inspects the correlation between words and ngrams of different levels of granularity. We evaluate the proposed method on SemEval 2016 and SemEval 2017 datasets and achieves state-of-the-art or similar performance.

PDF Details

IJCAI Conference 2017 Conference Paper

Learning to Read Irregular Text with Attention Mechanisms

Xiao Yang
Dafang He
Zihan Zhou
Daniel Kifer
C. Lee Giles

We present a robust end-to-end neural-based model to attentively recognize text in natural images. Particularly, we focus on accurately identifying irregular (perspectively distorted or curved) text, which has not been well addressed in the previous literature. Previous research on text reading often works with regular (horizontal and frontal) text and does not adequately generalize to processing text with perspective distortion or curving effects. Our work proposes to overcome this difficulty by introducing two learning components: (1) an auxiliary dense character detection task that helps to learn text specific visual patterns, (2) an alignment loss that provides guidance to the training of an attention model. We show with experiments that these two components are crucial for achieving fast convergence and high classification accuracy for irregular text recognition. Our model outperforms previous work on two irregular-text datasets: SVT-Perspective and CUTE80, and is also highly-competitive on several regular-text datasets containing primarily horizontal and frontal text.

PDF Details

YNIMG Journal 2017 Journal Article

Quicksilver: Fast predictive image registration – A deep learning approach

Xiao Yang
Roland Kwitt
Martin Styner
Marc Niethammer

Details DOI

AAAI Conference 2013 Conference Paper

Personalized Recommendation Based on Co-Ranking and Query-Based Collaborative Diffusion

Xiao Yang
Zhaoxin Zhang
Qiang Wang

In this paper, we present an adaptive graph-based personalized recommendation method based on co-ranking and query-based collaborative diffusion. By utilizing the unique network structure of n-partite heterogeneous graph, we attempt to address the problem of personalized recommendation in a two-layer ranking process with the help of reasonable measure of high and low order relationships and analyzing the representation of user’s preference in the graph. The experiments show that this algorithm can outperform the traditional CF methods and achieve competitive performance compared with many model-based and graph-based recommendation methods, and have better scalability and flexibility.

PDF Details