Arrow Research search

Author name cluster

Ya Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

TMLR Journal 2026 Journal Article

A Tighter Bound for Reward Learning in Reinforcement Learning from Human Feedback

  • Guoxi Chen
  • Xing Chen
  • Bo An
  • Ya Zhang

As a key component of reinforcement learning from human feedback (RLHF), reward learning directly influences the final learned policy. Unfortunately, existing theoretical estimation error bounds in reward learning rely on the complexity of the reward function class, unattainable optimal parameters, or non-zero constants independent of sample size, leading to uncomputable bounds that are meaningless for reward function classes with unknown complexity. To address this issue, this paper presents an analysis of parameter estimation for reward learning in RLHF under general function approximation, without imposing restrictions on the complexity of the reward function class. A tighter bound is provided without non-zero terms independent of the sample size. The optimal parameters are eliminated by applying linear approximation around the learned parameters. Additionally, the relationship between the preference dataset and the learned parameters is further examined to demonstrate how to efficiently collect data based on the current learned parameters. Inspired by the theoretical results, a novel offline RLHF algorithm with parameter constraints is proposed, restricting parameters to the valid space defined by the dataset. Furthermore, an online RLHF algorithm is proposed to iteratively optimize parameter learning and improve data collection efficiency. This work provides a tighter bound than previous studies and offers theoretical guidance for online data collection under general function approximation.

AAAI Conference 2026 Conference Paper

MedS³: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

  • Shuyang Jiang
  • Yusheng Liao
  • Zhe Chen
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that MedS3 outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that MedS3 achieves robust and faithful reasoning behavior.

AAAI Conference 2026 Conference Paper

Versatile Vision-Language Model for 3D Computed Tomography

  • Jiayu Lei
  • Ziqing Fan
  • Yanyong Zhang
  • Weidi Xie
  • Ya Zhang
  • Yanfeng Wang

Representation learning serves as a foundational component of medical vision-language models (MVLMs), enabling cross-modal alignment, semantic consistency, and enhanced generalization capabilities for downstream tasks. As generalist models rapidly evolve, there is a pressing need to unify diverse downstream tasks, such as diagnosis, segmentation, report generation, and multiple choice within a cohesive framework, demanding more efficient and versatile visual representation learning. However, current MVLMs predominately follow CLIP-style vision pretraining, failing to leverage heterogeneous data resources with multi-dimensional imaging and diverse annotation forms. And there lacks systematic analysis of efficient vision encoder design across varied downstream applications, including diagnosis, segmentation, and text generation tasks, particularly for volumetric imaging like Computed Tomography (CT). Besides, current MVLMs exhibit constrained voxel-level capabilities, lacking effective multi-task instruction tuning framework capable of achieving robust performance across various downstream tasks. To address these challenges, we propose CTInstruct, a novel MVLM employing a hybrid ResNet-ViT encoder with multi-granular vision-language pretraining for efficient heterogeneous data modeling, and unified instruction tuning that jointly optimizes discriminative, generative, and voxel-level reasoning for volumetric medical imaging. CTInstruct achieves SOTA performance across 8 CT benchmarks, setting a new standard for data-efficient multimodal learning in medical imaging.

IROS Conference 2025 Conference Paper

CoPAD: Multi-source Trajectory Fusion and Cooperative Trajectory Prediction with Anchor-oriented Decoder in V2X Scenarios

  • Kangyu Wu
  • Jiaqi Qiao
  • Ya Zhang

Recently, data-driven trajectory prediction methods have achieved remarkable results, significantly advancing the development of autonomous driving. However, the instability of single-vehicle perception introduces certain limitations to trajectory prediction. In this paper, a novel lightweight framework for cooperative trajectory prediction, CoPAD, is proposed. This framework incorporates a fusion module based on the Hungarian algorithm and Kalman filtering, along with the Past Time Attention (PTA) module, mode attention module and anchor-oriented decoder (AoD). It effectively performs early fusion on multi-source trajectory data from vehicles and road infrastructure, enabling the trajectories with high completeness and accuracy. The PTA module can efficiently capture potential interaction information among historical trajectories, and the mode attention module is proposed to enrich the diversity of predictions. Additionally, the decoder based on sparse anchors is designed to generate the final complete trajectories. Extensive experiments show that CoPAD achieves the state-of-the-art performance on the DAIR-V2X-Seq dataset, validating the effectiveness of the model in cooperative trajectory prediction in V2X scenarios.

JBHI Journal 2025 Journal Article

Interpretable Brain MRI Report Generation Anchored by Lesion Topography

  • Jiayu Lei
  • Xiaoman Zhang
  • Chaoyi Wu
  • Lisong Dai
  • Ya Zhang
  • Yanyong Zhang
  • Yanfeng Wang
  • Weidi Xie

Radiologists face increasing workloads that make accurate and timely report generation both critical and challenging. This paper presents a novel system for grounded automatic brain MRI report generation, with contributions in three key areas: First, we release RadGenome-Brain MRI, a benchmark dataset featuring multi-modal scans, expert-annotated abnormality masks, and radiology reports with region-level grounding to support fine-grained, explainable report generation. Second, we propose AutoRG-Brain, the first brain MRI report generation framework that combines automatic anomaly segmentation with a visual prompting-based language model to produce structured, anatomically grounded findings. Third, we conduct extensive quantitative and expert evaluations across segmentation and reporting tasks, and demonstrate in real clinical settings that our system significantly enhances junior radiologists' ability to detect subtle abnormalities and compose high-quality reports, narrowing the gap with senior doctors. All code, models, and datasets will be publicly released to facilitate future research and development.

NeurIPS Conference 2025 Conference Paper

Learning to Instruct for Visual Instruction Tuning

  • Zhihan Zhou
  • Feng Hong
  • JIAAN LUO
  • Yushi Ye
  • Jiangchao Yao
  • Dongsheng Li
  • Bo Han
  • Ya Zhang

We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https: //github. com/Feng-Hong/L2T.

AAMAS Conference 2025 Conference Paper

Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization

  • Shengchao Hu
  • Wanru Zhao
  • Weixiong Lin
  • Li Shen
  • Ya Zhang
  • Dacheng Tao

Offline reinforcement learning (RL) methods harness previous experiences to derive an optimal policy, forming the foundation for pretrained large-scale models (PLMs). When adapting to novel tasks, PLMs leverage expert trajectories as prompts to accelerate adaptation. While various prompt-tuning techniques aim to improve prompt quality, their effectiveness is often limited by initialization constraints, restricting exploration and potentially leading to suboptimal solutions. To eliminate dependence on the initial prompt, we reframe prompt-tuning as conditional generative modeling, where prompts are generated from random noise. Our proposed Prompt Diffuser employs a conditional diffusion model to generate high-quality prompts. Central to our framework is trajectory reconstruction and the seamless integration of downstream task guidance during training. Experimental results validate Prompt Diffuser’s effectiveness, demonstrating strong performance in meta-RL tasks.

NeurIPS Conference 2025 Conference Paper

RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis

  • Haolin Li
  • Tianjie Dai
  • Zhe Chen
  • Siyuan Du
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose R etrieval- A ugmented D iagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https: //github. com/tdlhl/RAD.

NeurIPS Conference 2025 Conference Paper

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

  • Zhenjie Mao
  • Yang Yuhuan
  • Chaofan Ma
  • Dongsheng Jiang
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions—short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process—first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines. Project page: https: //zhenjiemao. github. io/SaFiRe/.

AAAI Conference 2024 Conference Paper

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

  • Yan Cai
  • Linlin Wang
  • Ye Wang
  • Gerard de Melo
  • Ya Zhang
  • Yanfeng Wang
  • Liang He

The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establish- ing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.

NeurIPS Conference 2024 Conference Paper

Probabilistic Conformal Distillation for Enhancing Missing Modality Robustness

  • Mengxi Chen
  • Fei Zhang
  • Zihua Zhao
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Multimodal models trained on modality-complete data are plagued with severe performance degradation when encountering modality-missing data. Prevalent cross-modal knowledge distillation-based methods precisely align the representation of modality-missing data and that of its modality-complete counterpart to enhance robustness. However, due to the irreparable information asymmetry, this determinate alignment is too stringent, easily inducing modality-missing features to capture spurious factors erroneously. In this paper, a novel multimodal Probabilistic Conformal Distillation (PCD) method is proposed, which considers the inherent indeterminacy in this alignment. Given a modality-missing input, our goal is to learn the unknown Probability Density Function (PDF) of the mapped variables in the modality-complete space, rather than relying on the brute-force point alignment. Specifically, PCD models the modality-missing feature as a probabilistic distribution, enabling it to satisfy two characteristics of the PDF. One is the extremes of probabilities of modality-complete feature points on the PDF, and the other is the geometric consistency between the modeled distributions and the peak points of different PDFs. Extensive experiments on a range of benchmark datasets demonstrate the superiority of PCD over state-of-the-art methods. Code is available at: https: //github. com/mxchen-mc/PCD.

NeurIPS Conference 2024 Conference Paper

Revive Re-weighting in Imbalanced Learning by Density Ratio Estimation

  • JIAAN LUO
  • Feng Hong
  • Jiangchao Yao
  • Bo Han
  • Ya Zhang
  • Yanfeng Wang

In deep learning, model performance often deteriorates when trained on highly imbalanced datasets, especially when evaluation metrics require robust generalization across underrepresented classes. To address the challenges posed by imbalanced data distributions, this study introduces a novel method utilizing density ratio estimation for dynamic class weight adjustment, termed as Re-weighting with Density Ratio (RDR). Our method adaptively adjusts the importance of each class during training, mitigates overfitting on dominant classes and enhances model adaptability across diverse datasets. Extensive experiments conducted on various large scale benchmark datasets validate the effectiveness of our method. Results demonstrate substantial improvements in generalization capabilities, particularly under severely imbalanced conditions.

NeurIPS Conference 2024 Conference Paper

TAIA: Large Language Models are Out-of-Distribution Data Learners

  • Shuyang Jiang
  • Yusheng Liao
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \uline{T}raining \uline{A}ll parameters but \uline{I}nferring with only \uline{A}ttention (TAIA). We empirically validate TAIA using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that TAIA achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of TAIA to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data. Code is available in \url{https: //github. com/pixas/TAIA_LLM}.

NeurIPS Conference 2023 Conference Paper

Asynchrony-Robust Collaborative Perception via Bird's Eye View Flow

  • Sizhe Wei
  • Yuxi Wei
  • Yue Hu
  • Yifan Lu
  • Yiqi Zhong
  • Siheng Chen
  • Ya Zhang

Collaborative perception can substantially boost each agent's perception ability by facilitating communication among multiple agents. However, temporal asynchrony among agents is inevitable in the real world due to communication delays, interruptions, and clock misalignments. This issue causes information mismatch during multi-agent fusion, seriously shaking the foundation of collaboration. To address this issue, we propose CoBEVFlow, an asynchrony-robust collaborative perception system based on bird's eye view (BEV) flow. The key intuition of CoBEVFlow is to compensate motions to align asynchronous collaboration messages sent by multiple agents. To model the motion in a scene, we propose BEV flow, which is a collection of the motion vector corresponding to each spatial location. Based on BEV flow, asynchronous perceptual features can be reassigned to appropriate positions, mitigating the impact of asynchrony. CoBEVFlow has two advantages: (i) CoBEVFlow can handle asynchronous collaboration messages sent at irregular, continuous time stamps without discretization; and (ii) with BEV flow, CoBEVFlow only transports the original perceptual features, instead of generating new perceptual features, avoiding additional noises. To validate CoBEVFlow's efficacy, we create IRregular V2V(IRV2V), the first synthetic collaborative perception dataset with various temporal asynchronies that simulate different real-world scenarios. Extensive experiments conducted on both IRV2V and the real-world dataset DAIR-V2X show that CoBEVFlow consistently outperforms other baselines and is robust in extremely asynchronous settings. The code is available at https: //github. com/MediaBrain-SJTU/CoBEVFlow.

NeurIPS Conference 2023 Conference Paper

AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

  • Chaofan Ma
  • Yang Yuhuan
  • Chen Ju
  • Fei Zhang
  • Ya Zhang
  • Yanfeng Wang

Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent works explore vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i. e. , low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when meet with ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labelling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregation, leveraging the meticulously designed clustering module. The final result is obtained by computing the similarity between aggregated attributes and images embedding. To evaluate the effectiveness, we annotate three datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation. We refer readers to the latest arXiv version at https: //arxiv. org/abs/2309. 00096.

NeurIPS Conference 2023 Conference Paper

Combating Representation Learning Disparity with Geometric Harmonization

  • Zhihan Zhou
  • Jiangchao Yao
  • Feng Hong
  • Ya Zhang
  • Bo Han
  • Yanfeng Wang

Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. The attribution is that the vanilla SSL methods that pursue the sample-level uniformity easily leads to representation learning disparity, where head classes with the huge sample number dominate the feature regime but tail classes with the small sample number passively collapse. To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage the category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specially, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infer an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of \methodspace with high tolerance to the distribution skewness.

TMLR Journal 2023 Journal Article

Contrastive Attraction and Contrastive Repulsion for Representation Learning

  • Huangjie Zheng
  • Xu Chen
  • Jiangchao Yao
  • Hongxia Yang
  • Chunyuan Li
  • Ya Zhang
  • Hao Zhang
  • Ivor Tsang

Contrastive learning (CL) methods effectively learn data representations in a self-supervision manner, where the encoder contrasts each positive sample over multiple negative samples via a one-vs-many softmax cross-entropy loss. By leveraging large amounts of unlabeled image data, recent CL methods have achieved promising results when pretrained on large-scale datasets, such as ImageNet. However, most of them consider the augmented views from the same instance are positive pairs, while views from other instances are negative ones. Such binary partition insufficiently considers the relation between samples and tends to yield worse performance when generalized on images in the wild. In this paper, to further improve the performance of CL and enhance its robustness on various datasets, we propose a doubly CL strategy that contrasts positive samples and negative ones within themselves separately. We realize this strategy with contrastive attraction and contrastive repulsion (CACR), which makes the query not only exert a greater force to attract more distant positive samples but also do so to repel closer negative samples. Theoretical analysis reveals that CACR generalizes CL's behavior by positive attraction and negative repulsion. It further considers the intra-contrastive relation within the positive and negative pairs to narrow the gap between the sampled and true distribution, which is important when datasets are less curated. Extensive large-scale experiments on standard vision tasks show that CACR not only consistently outperforms existing CL methods on benchmark datasets, but also shows better robustness when generalized on imbalanced image datasets.

TMLR Journal 2023 Journal Article

Federated Learning under Partially Disjoint Data via Manifold Reshaping

  • Ziqing Fan
  • Jiangchao Yao
  • Ruipeng Zhang
  • Lingjuan Lyu
  • Yanfeng Wang
  • Ya Zhang

Statistical heterogeneity severely limits the performance of federated learning (FL), motivating several explorations e.g., FedProx, MOON and FedDyn, to alleviate this problem. Despite effectiveness, their considered scenario generally requires samples from almost all classes during the local training of each client, although some covariate shifts may exist among clients. In fact, the natural case of partially class-disjoint data (PCDD), where each client contributes a few classes (instead of all classes) of samples, is practical yet underexplored. Specifically, the unique collapse and invasion characteristics of PCDD can induce the biased optimization direction in local training, which prevents the efficiency of federated learning. To address this dilemma, we propose a manifold reshaping approach called FedMR to calibrate the feature space of local training. Our FedMR adds two interplaying losses to the vanilla federated learning: one is the intra-class loss to decorrelate feature dimensions for anti-collapse; and the other one is the inter-class loss to guarantee the proper margin among categories in the feature expansion. We conduct extensive experiments on a range of datasets to demonstrate that our FedMR achieves much higher accuracy and better communication efficiency.

NeurIPS Conference 2023 Conference Paper

Federated Learning with Bilateral Curation for Partially Class-Disjoint Data

  • Ziqing Fan
  • Ruipeng Zhang
  • Jiangchao Yao
  • Bo Han
  • Ya Zhang
  • Yanfeng Wang

Partially class-disjoint data (PCDD), a common yet under-explored data formation where each client contributes a part of classes (instead of all classes) of samples, severely challenges the performance of federated algorithms. Without full classes, the local objective will contradict the global objective, yielding the angle collapse problem for locally missing classes and the space waste problem for locally existing classes. As far as we know, none of the existing methods can intrinsically mitigate PCDD challenges to achieve holistic improvement in the bilateral views (both global view and local view) of federated learning. To address this dilemma, we are inspired by the strong generalization of simplex Equiangular Tight Frame (ETF) on the imbalanced data, and propose a novel approach called FedGELA where the classifier is globally fixed as a simplex ETF while locally adapted to the personal distributions. Globally, FedGELA provides fair and equal discrimination for all classes and avoids inaccurate updates of the classifier, while locally it utilizes the space of locally missing classes for locally existing classes. We conduct extensive experiments on a range of datasets to demonstrate that our FedGELA achieves promising performance (averaged improvement of 3. 9% to FedAvg and 1. 5% to best baselines) and provide both local and global convergence guarantees.

JBHI Journal 2023 Journal Article

Self-Supervised Tumor Segmentation With Sim2Real Adaptation

  • Xiaoman Zhang
  • Weidi Xie
  • Chaoqin Huang
  • Ya Zhang
  • Xin Chen
  • Qi Tian
  • Yanfeng Wang

This paper targets on self-supervised tumor segmentation. We make the following contributions: (i) we take inspiration from the observation that tumors are often characterised independently of their contexts, we propose a novel proxy task “layer-decomposition”, that closely matches the goal of the downstream task, and design a scalable pipeline for generating synthetic tumor data for pre-training; (ii) we propose a two-stage Sim2Real training regime for unsupervised tumor segmentation, where we first pre-train a model with simulated tumors, and then adopt a self-training strategy for downstream data adaptation; (iii) when evaluating on different tumor segmentation benchmarks, e. g. BraTS2018 for brain tumor segmentation and LiTS2017 for liver tumor segmentation, our approach achieves state-of-the-art segmentation performance under the unsupervised setting. While transferring the model for tumor segmentation under a low-annotation regime, the proposed approach also outperforms all existing self-supervised approaches; (iv) we conduct extensive ablation studies to analyse the critical components in data simulation, and validate the necessity of different proxy tasks. We demonstrate that, with sufficient texture randomization in simulation, model trained on synthetic data can effortlessly generalise to datasets with real tumors.

NeurIPS Conference 2023 Conference Paper

Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation

  • Fei Zhang
  • Tianfei Zhou
  • Boyang Li
  • Hao He
  • Chaofan Ma
  • Tianjiao Zhang
  • Jiangchao Yao
  • Ya Zhang

This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS), which learns to segment objects of arbitrary classes using mere image-text pairs. Existing works turn to enhance the vanilla vision transformer by introducing explicit grouping recognition, i. e. , employing several group tokens/centroids to cluster the image tokens and perform the group-text alignment. Nevertheless, these methods suffer from a granularity inconsistency regarding the usage of group tokens, which are aligned in the all-to-one v. s. one-to-one manners during the training and inference phases, respectively. We argue that this discrepancy arises from the lack of elaborate supervision for each group token. To bridge this granularity gap, this paper explores explicit supervision for the group tokens from the prototypical knowledge. To this end, this paper proposes the non-learnable prototypical regularization (NPR) where non-learnable prototypes are estimated from source features to serve as supervision and enable contrastive matching of the group tokens. This regularization encourages the group tokens to segment objects with less redundancy and capture more comprehensive semantic regions, leading to increased compactness and richness. Based on NPR, we propose the prototypical guidance segmentation network (PGSeg) that incorporates multi-modal regularization by leveraging prototypical sources from both images and texts at different levels, progressively enhancing the segmentation capability with diverse prototypical patterns. Experimental results show that our proposed method achieves state-of-the-art performance on several benchmark datasets.

NeurIPS Conference 2021 Conference Paper

Collaborative Uncertainty in Multi-Agent Trajectory Forecasting

  • Bohan Tang
  • Yiqi Zhong
  • Ulrich Neumann
  • Gang Wang
  • Siheng Chen
  • Ya Zhang

Uncertainty modeling is critical in trajectory-forecasting systems for both interpretation and safety reasons. To better predict the future trajectories of multiple agents, recent works have introduced interaction modules to capture interactions among agents. This approach leads to correlations among the predicted trajectories. However, the uncertainty brought by such correlations is neglected. To fill this gap, we propose a novel concept, collaborative uncertainty (CU), which models the uncertainty resulting from the interaction module. We build a general CU-based framework to make a prediction model learn the future trajectory and the corresponding uncertainty. The CU-based framework is integrated as a plugin module to current state-of-the-art (SOTA) systems and deployed in two special cases based on multivariate Gaussian and Laplace distributions. In each case, we conduct extensive experiments on two synthetic datasets and two public, large-scale benchmarks of trajectory forecasting. The results are promising: 1) The results of synthetic datasets show that CU-based framework allows the model to nicely rebuild the ground-truth distribution. 2) The results of trajectory forecasting benchmarks demonstrate that the CU-based framework steadily helps SOTA systems improve their performances. Specially, the proposed CU-based framework helps VectorNet improve by 57 cm regarding Final Displacement Error on nuScenes dataset. 3) The visualization results of CU illustrate that the value of CU is highly related to the amount of the interactive information among agents.

AAAI Conference 2021 Conference Paper

Invariant Teacher and Equivariant Student for Unsupervised 3D Human Pose Estimation

  • Chenxin Xu
  • Siheng Chen
  • Maosen Li
  • Ya Zhang

We propose a novel method based on teacher-student learning framework for 3D human pose estimation without any 3D annotation or side information. To solve this unsupervisedlearning problem, the teacher network adopts pose-dictionarybased modeling for regularization to estimate a physically plausible 3D pose. To handle the decomposition ambiguity in the teacher network, we propose a cycle-consistent architecture promoting a 3D rotation-invariant property to train the teacher network. To further improve the estimation accuracy, the student network adopts a novel graph convolution network for flexibility to directly estimate the 3D coordinates. Another cycle-consistent architecture promoting 3D rotationequivariant property is adopted to exploit geometry consistency, together with knowledge distillation from the teacher network to improve the pose estimation performance. We conduct extensive experiments on Human3. 6M and MPI-INF- 3DHP. Our method reduces the 3D joint prediction error by 11. 4% compared to state-of-the-art unsupervised methods and also outperforms many weakly-supervised methods that use side information on Human3. 6M. Code will be available at https: //github. com/sjtuxcx/ITES.

JBHI Journal 2020 Journal Article

A Novel MKL Method for GBM Prognosis Prediction by Integrating Histopathological Image and Multi-Omics Data

  • Ya Zhang
  • Ao Li
  • Jie He
  • Minghui Wang

Glioblastoma multiforme (GBM) is one of the most malignant brain tumors with very short prognosis expectation. To improve patients’ clinical treatment and their life quality after surgery, researches have developed tremendous in silico models and tools for predicting GBM prognosis based on molecular datasets and have earned great success. However, pathology still plays the most critical role in cancer diagnosis and prognosis in the clinic at present. Recent advancement of storing and processing histopathological images has drawn attention of researchers. Models based on histopathological images are developed, which show great potential for computer-aided pathological diagnoses. But models based on both molecular and histopathological images that could predict GBM prognosis with high accuracy are not present yet. In our previous research, we used the simple MKL method to integrate multi-omics data to improve GBM prognosis prediction successfully. In this paper, we have developed a novel multiple kernel learning (MKL) method, named histopathological integrating multiple kernel learning (HI-MKL), that could integrate both histopathological images and multi-omics data efficiently. By using datasets from The Cancer Genome Atlas project, we have built a system that could predict the GBM prognosis with high accuracy. Our research shows that HI-MKL is an accurate, robust, and generalized MKL method, which performs well in a GBM prognosis task.

NeurIPS Conference 2020 Conference Paper

Graph Cross Networks with Vertex Infomax Pooling

  • Maosen Li
  • Siheng Chen
  • Ya Zhang
  • Ivor Tsang

We propose a novel graph cross network (GXN) to achieve comprehensive feature learning from multiple scales of a graph. Based on trainable hierarchical representations of a graph, GXN enables the interchange of intermediate features across scales to promote information flow. Two key ingredients of GXN include a novel vertex infomax pooling (VIPool), which creates multiscale graphs in a trainable manner, and a novel feature-crossing layer, enabling feature interchange across scales. The proposed VIPool selects the most informative subset of vertices based on the neural estimation of mutual information between vertex features and neighborhood features. The intuition behind is that a vertex is informative when it can maximally reflect its neighboring information. The proposed feature-crossing layer fuses intermediate features between two scales for mutual enhancement by improving information flow and enriching multiscale features at hidden layers. The cross shape of feature-crossing layer distinguishes GXN from many other multiscale architectures. Experimental results show that the proposed GXN improves the classification accuracy by 2. 12% and 1. 15% on average for graph classification and vertex classification, respectively. Based on the same network, the proposed VIPool consistently outperforms other graph-pooling methods.

AAAI Conference 2019 Conference Paper

Safeguarded Dynamic Label Regression for Noisy Supervision

  • Jiangchao Yao
  • Hao Wu
  • Ya Zhang
  • Ivor W. Tsang
  • Jun Sun

Learning with noisy labels is imperative in the Big Data era since it reduces expensive labor on accurate annotations. Previous method, learning with noise transition, has enjoyed theoretical guarantees when it is applied to the scenario with the class-conditional noise. However, this approach critically depends on an accurate pre-estimated noise transition, which is usually impractical. Subsequent improvement adapts the preestimation in the form of a Softmax layer along with the training progress. However, the parameters in the Softmax layer are highly tweaked for the fragile performance and easily get stuck into undesired local minimums. To overcome this issue, we propose a Latent Class-Conditional Noise model (LCCN) that models the noise transition in a Bayesian form. By projecting the noise transition into a Dirichlet-distributed space, the learning is constrained on a simplex instead of some adhoc parametric space. Furthermore, we specially deduce a dynamic label regression method for LCCN to iteratively infer the latent true labels and jointly train the classifier and model the noise. Our approach theoretically safeguards the bounded update of the noise transition, which avoids arbitrarily tuning via a batch of samples. Extensive experiments have been conducted on controllable noise data with CIFAR- 10 and CIFAR-100 datasets, and the agnostic noise data with Clothing1M and WebVision17 datasets. Experimental results have demonstrated that the proposed model outperforms several state-of-the-art methods.

AAAI Conference 2019 Conference Paper

Understanding VAEs in Fisher-Shannon Plane

  • Huangjie Zheng
  • Jiangchao Yao
  • Ya Zhang
  • Ivor W. Tsang
  • Jia Wang

In information theory, Fisher information and Shannon information (entropy) are respectively used to quantify the uncertainty associated with the distribution modeling and the uncertainty in specifying the outcome of given variables. These two quantities are complementary and are jointly applied to information behavior analysis in most cases. The uncertainty property in information asserts a fundamental trade-off between Fisher information and Shannon information, which enlightens us the relationship between the encoder and the decoder in variational auto-encoders (VAEs). In this paper, we investigate VAEs in the Fisher-Shannon plane, and demonstrate that the representation learning and the log-likelihood estimation are intrinsically related to these two information quantities. Through extensive qualitative and quantitative experiments, we provide with a better comprehension of VAEs in tasks such as high-resolution reconstruction, and representation learning in the perspective of Fisher information and Shannon information. We further propose a variant of VAEs, termed as Fisher auto-encoder (FAE), for practical needs to balance Fisher information and Shannon information. Our experimental results have demonstrated its promise in improving the reconstruction accuracy and avoiding the noninformative latent code as occurred in previous works.

IJCAI Conference 2018 Conference Paper

Collaborative Learning for Weakly Supervised Object Detection

  • Jiajie Wang
  • Jiangchao Yao
  • Ya Zhang
  • Rui Zhang

Weakly supervised object detection has recently received much attention, since it only requires image-level labels instead of the bounding-box labels consumed in strongly supervised learning. Nevertheless, the save in labeling expense is usually at the cost of model accuracy. In this paper, we propose a simple but effective weakly supervised collaborative learning framework to resolve this problem, which trains a weakly supervised learner and a strongly supervised learner jointly by enforcing partial feature sharing and prediction consistency. For object detection, taking WSDDN-like architecture as weakly supervised detector sub-network and Faster-RCNN-like architecture as strongly supervised detector sub-network, we propose an end-to-end Weakly Supervised Collaborative Detection Network. As there is no strong supervision available to train the Faster-RCNN-like sub-network, a new prediction consistency loss is defined to enforce consistency of predictions between the two sub-networks as well as within the Faster-RCNN-like sub-networks. At the same time, the two detectors are designed to partially share features to further guarantee the model consistency at perceptual level. Extensive experiments on PASCAL VOC 2007 and 2012 data sets have demonstrated the effectiveness of the proposed framework.

NeurIPS Conference 2018 Conference Paper

Masking: A New Perspective of Noisy Supervision

  • Bo Han
  • Jiangchao Yao
  • Gang Niu
  • Mingyuan Zhou
  • Ivor Tsang
  • Ya Zhang
  • Masashi Sugiyama

It is important to learn various types of classifiers given training data with noisy labels. Noisy labels, in the most popular noise model hitherto, are corrupted from ground-truth labels by an unknown noise transition matrix. Thus, by estimating this matrix, classifiers can escape from overfitting those noisy labels. However, such estimation is practically difficult, due to either the indirect nature of two-step approaches, or not big enough data to afford end-to-end approaches. In this paper, we propose a human-assisted approach called ''Masking'' that conveys human cognition of invalid class transitions and naturally speculates the structure of the noise transition matrix. To this end, we derive a structure-aware probabilistic model incorporating a structure prior, and solve the challenges from structure extraction and structure alignment. Thanks to Masking, we only estimate unmasked noise transition probabilities and the burden of estimation is tremendously reduced. We conduct extensive experiments on CIFAR-10 and CIFAR-100 with three noise structures as well as the industrial-level Clothing1M with agnostic noise structure, and the results show that Masking can improve the robustness of classifiers significantly.

TIST Journal 2017 Journal Article

Stopping Criterion for Active Learning with Model Stability

  • Yexun Zhang
  • Wenbin Cai
  • Wenquan Wang
  • Ya Zhang

Active learning selectively labels the most informative instances, aiming to reduce the cost of data annotation. While much effort has been devoted to active sampling functions, relatively limited attention has been paid to when the learning process should stop. In this article, we focus on the stopping criterion of active learning and propose a model stability--based criterion, that is, when a model does not change with inclusion of additional training instances. The challenge lies in how to measure the model change without labeling additional instances and training new models. Inspired by the stochastic gradient update rule, we use the gradient of the loss function at each candidate example to measure its effect on model change. We propose to stop active learning when the model change brought by any of the remaining unlabeled examples is lower than a given threshold. We apply the proposed stopping criterion to two popular classifiers: logistic regression (LR) and support vector machines (SVMs). In addition, we theoretically analyze the stability and generalization ability of the model obtained by our stopping criterion. Substantial experiments on various UCI benchmark datasets and ImageNet datasets have demonstrated that the proposed approach is highly effective.

IS Journal 2014 Journal Article

Behavior Informatics: A New Perspective

  • Longbing Cao
  • Thorsten Joachims
  • Can Wang
  • Eric Gaussier
  • Jinjiu Li
  • Yuming Ou
  • Dan Luo
  • Reza Zafarani

This installment of Trends & Controversies provides an array of perspectives on the latest research in behavior informatics. Longbing Cao introduces the work in "Behavior Informatics: A New Perspective. " Then, in "Behavior Computing, " Longbing Cao and Thorsten Joachims provide a basic overview of the topic. Next is "Coupled Behavior Representation, Modeling, Analysis, and Reasoning" by Can Wang, Longbing Cao, Eric Gaussier, Jinjiu Li, Yuming Ou, and Dan Luo. The fourth article is "Behavior Analysis in Social Media, " by Reza Zafarani and Huan Liu. The fifth article is "Group Recommendation and Behavior, " by Guandong Xu and Zhiang Wu. Gabriella Pasi wrote the sixth article, "Web Search and Behavior. " The seventh article, "Behaviors of IPTV Users, " is by Ya Zhang, Xiaokang Yang, and Hongyuan Zha. Finally, "Should Behavioral Models of Terror Groups Be Disclosed? " is by Edoardo Serra and V. S. Subrahmanian.

AAAI Conference 2014 Conference Paper

Feature Selection at the Discrete Limit

  • Miao Zhang
  • Chris Ding
  • Ya Zhang
  • Feiping Nie

Feature selection plays an important role in many machine learning and data mining applications. In this paper, we propose to use L2, p norm for feature selection with emphasis on small p. As p → 0, feature selection becomes discrete feature selection problem. We provide two algorithms, proximal gradient algorithm and rankone update algorithm, which is more efficient at large regularization λ. We provide closed form solutions of the proximal operator at p = 0, 1/2. Experiments on real life datasets show that features selected at small p consistently outperform features selected at p = 1, the standard L2, 1 approach and other popular feature selection methods.

TIST Journal 2013 Journal Article

Reorder user's tweets

  • Keyi Shen
  • Jianmin Wu
  • Ya Zhang
  • Yiping Han
  • Xiaokang Yang
  • Li Song
  • Xiao Gu

Twitter displays the tweets a user received in a reversed chronological order, which is not always the best choice. As Twitter is full of messages of very different qualities, many informative or relevant tweets might be flooded or displayed at the bottom while some nonsense buzzes might be ranked higher. In this work, we present a supervised learning method for personalized tweets reordering based on user interests. User activities on Twitter, in terms of tweeting, retweeting, and replying, are leveraged to obtain the training data for reordering models. Through exploring a rich set of social and personalized features, we model the relevance of tweets by minimizing the pairwise loss of relevant and irrelevant tweets. The tweets are then reordered according to the predicted relevance scores. Experimental results with real twitter user activities demonstrated the effectiveness of our method. The new method achieved above 30% accuracy gain compared with the default ordering in twitter based on time.

NeurIPS Conference 2005 Conference Paper

Size Regularized Cut for Data Clustering

  • Yixin Chen
  • Ya Zhang
  • Xiang Ji

We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut.