Arrow Research search

Author name cluster

Yanfeng Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

AAAI Conference 2026 Conference Paper

MedS³: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

  • Shuyang Jiang
  • Yusheng Liao
  • Zhe Chen
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that MedS3 outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that MedS3 achieves robust and faithful reasoning behavior.

AAAI Conference 2026 Conference Paper

Versatile Vision-Language Model for 3D Computed Tomography

  • Jiayu Lei
  • Ziqing Fan
  • Yanyong Zhang
  • Weidi Xie
  • Ya Zhang
  • Yanfeng Wang

Representation learning serves as a foundational component of medical vision-language models (MVLMs), enabling cross-modal alignment, semantic consistency, and enhanced generalization capabilities for downstream tasks. As generalist models rapidly evolve, there is a pressing need to unify diverse downstream tasks, such as diagnosis, segmentation, report generation, and multiple choice within a cohesive framework, demanding more efficient and versatile visual representation learning. However, current MVLMs predominately follow CLIP-style vision pretraining, failing to leverage heterogeneous data resources with multi-dimensional imaging and diverse annotation forms. And there lacks systematic analysis of efficient vision encoder design across varied downstream applications, including diagnosis, segmentation, and text generation tasks, particularly for volumetric imaging like Computed Tomography (CT). Besides, current MVLMs exhibit constrained voxel-level capabilities, lacking effective multi-task instruction tuning framework capable of achieving robust performance across various downstream tasks. To address these challenges, we propose CTInstruct, a novel MVLM employing a hybrid ResNet-ViT encoder with multi-granular vision-language pretraining for efficient heterogeneous data modeling, and unified instruction tuning that jointly optimizes discriminative, generative, and voxel-level reasoning for volumetric medical imaging. CTInstruct achieves SOTA performance across 8 CT benchmarks, setting a new standard for data-efficient multimodal learning in medical imaging.

JBHI Journal 2025 Journal Article

Interpretable Brain MRI Report Generation Anchored by Lesion Topography

  • Jiayu Lei
  • Xiaoman Zhang
  • Chaoyi Wu
  • Lisong Dai
  • Ya Zhang
  • Yanyong Zhang
  • Yanfeng Wang
  • Weidi Xie

Radiologists face increasing workloads that make accurate and timely report generation both critical and challenging. This paper presents a novel system for grounded automatic brain MRI report generation, with contributions in three key areas: First, we release RadGenome-Brain MRI, a benchmark dataset featuring multi-modal scans, expert-annotated abnormality masks, and radiology reports with region-level grounding to support fine-grained, explainable report generation. Second, we propose AutoRG-Brain, the first brain MRI report generation framework that combines automatic anomaly segmentation with a visual prompting-based language model to produce structured, anatomically grounded findings. Third, we conduct extensive quantitative and expert evaluations across segmentation and reporting tasks, and demonstrate in real clinical settings that our system significantly enhances junior radiologists' ability to detect subtle abnormalities and compose high-quality reports, narrowing the gap with senior doctors. All code, models, and datasets will be publicly released to facilitate future research and development.

NeurIPS Conference 2025 Conference Paper

Learning to Instruct for Visual Instruction Tuning

  • Zhihan Zhou
  • Feng Hong
  • JIAAN LUO
  • Yushi Ye
  • Jiangchao Yao
  • Dongsheng Li
  • Bo Han
  • Ya Zhang

We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https: //github. com/Feng-Hong/L2T.

JBHI Journal 2025 Journal Article

Prototype-Driven Hard-Sample Contrastive Learning for Camera-Based Respiratory Imaging Analysis

  • Dongmin Huang
  • Ming Xia
  • Liping Pan
  • Qiqiong Wang
  • Xiaoyan Song
  • Xiaoting Tao
  • Yanfeng Wang
  • Kun Qiao

Respiratory spatial patterns describe the distribution and dynamics of lung conditions, and monitoring of their asymmetry or irregularities enables a more comprehensive assessment of respiratory functions. The feasibility of using the camera pixel array sensing combined with machine learning for analyzing respiratory spatial patterns was demonstrated, however, this approach faces challenges in patient generalization due to limited clinical data and individual respiratory variability. Data augmentation methods may address this by synthesizing new data, but they have a risk of destroying the symmetry or regular semantic information of respiratory patterns. To address this, we propose a prototype-driven hard-sample contrastive learning (PHCL) method tailored for camera-based respiratory imaging analysis. It first separates the samples into simple and hard-to-learn samples using prototypes and Gini-index distance measurement. Then it synthesizes a new feature by blending one-class simple samples and other-class hard samples to construct a transition boundary between different classes, so as to broaden the feature distribution. Then it employs contrastive learning to emphasize feature consistency between prototypes and hard same-class samples from different subjects to mitigate individual respiratory variability and refine class boundaries. Extensive experiments were conducted in the neonatal intensive care unit and the thoracic surgery department, where PHCL outperforms image augmentation and advanced feature augmentation methods by 1-10% in both accuracy and F1-score. Our work provides valuable insights into the analysis of asymmetric and irregular respiratory activities.

NeurIPS Conference 2025 Conference Paper

RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis

  • Haolin Li
  • Tianjie Dai
  • Zhe Chen
  • Siyuan Du
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose R etrieval- A ugmented D iagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https: //github. com/tdlhl/RAD.

NeurIPS Conference 2025 Conference Paper

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

  • Zhenjie Mao
  • Yang Yuhuan
  • Chaofan Ma
  • Dongsheng Jiang
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions—short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process—first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines. Project page: https: //zhenjiemao. github. io/SaFiRe/.

NeurIPS Conference 2025 Conference Paper

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

  • Zeqian Li
  • Shangzhe Di
  • Zhonghua Zhai
  • Weilin Huang
  • Yanfeng Wang
  • Weidi Xie

This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e. g. , questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.

AAAI Conference 2025 Conference Paper

VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression

  • Qiang Hu
  • Houqiang Zhong
  • Zihan Zheng
  • Xiaoyun Zhang
  • Zhengxue Cheng
  • Li Song
  • Guangtao Zhai
  • Yanfeng Wang

Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets.

NeurIPS Conference 2024 Conference Paper

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

  • Rui Ye
  • Rui Ge
  • Xinyu Zhu
  • Jingyi Chai
  • Yaxin Du
  • Yang Liu
  • Yanfeng Wang
  • Siheng Chen

Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e. g. , user-annotated multilingual dataset) for federated instruction tuning and one dataset (e. g. , user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e. g. , multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at https: //github. com/rui-ye/FedLLM-Bench.

NeurIPS Conference 2024 Conference Paper

Language-Driven Interactive Traffic Trajectory Generation

  • Junkai Xia
  • Chenxin Xu
  • Qingyao Xu
  • Yanfeng Wang
  • Siheng Chen

Realistic trajectory generation with natural language control is pivotal for advancing autonomous vehicle technology. However, previous methods focus on individual traffic participant trajectory generation, thus failing to account for the complexity of interactive traffic dynamics. In this work, we propose InteractTraj, the first language-driven traffic trajectory generator that can generate interactive traffic trajectories. InteractTraj interprets abstract trajectory descriptions into concrete formatted interaction-aware numerical codes and learns a mapping between these formatted codes and the final interactive trajectories. To interpret language descriptions, we propose a language-to-code encoder with a novel interaction-aware encoding strategy. To produce interactive traffic trajectories, we propose a code-to-trajectory decoder with interaction-aware feature aggregation that synergizes vehicle interactions with the environmental map and the vehicle moves. Extensive experiments show our method demonstrates superior performance over previous SoTA methods, offering a more realistic generation of interactive traffic trajectories with high controllability via diverse natural language commands.

AAAI Conference 2024 Conference Paper

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

  • Yan Cai
  • Linlin Wang
  • Ye Wang
  • Gerard de Melo
  • Ya Zhang
  • Yanfeng Wang
  • Liang He

The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establish- ing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.

NeurIPS Conference 2024 Conference Paper

Probabilistic Conformal Distillation for Enhancing Missing Modality Robustness

  • Mengxi Chen
  • Fei Zhang
  • Zihua Zhao
  • Jiangchao Yao
  • Ya Zhang
  • Yanfeng Wang

Multimodal models trained on modality-complete data are plagued with severe performance degradation when encountering modality-missing data. Prevalent cross-modal knowledge distillation-based methods precisely align the representation of modality-missing data and that of its modality-complete counterpart to enhance robustness. However, due to the irreparable information asymmetry, this determinate alignment is too stringent, easily inducing modality-missing features to capture spurious factors erroneously. In this paper, a novel multimodal Probabilistic Conformal Distillation (PCD) method is proposed, which considers the inherent indeterminacy in this alignment. Given a modality-missing input, our goal is to learn the unknown Probability Density Function (PDF) of the mapped variables in the modality-complete space, rather than relying on the brute-force point alignment. Specifically, PCD models the modality-missing feature as a probabilistic distribution, enabling it to satisfy two characteristics of the PDF. One is the extremes of probabilities of modality-complete feature points on the PDF, and the other is the geometric consistency between the modeled distributions and the peak points of different PDFs. Extensive experiments on a range of benchmark datasets demonstrate the superiority of PCD over state-of-the-art methods. Code is available at: https: //github. com/mxchen-mc/PCD.

NeurIPS Conference 2024 Conference Paper

Revive Re-weighting in Imbalanced Learning by Density Ratio Estimation

  • JIAAN LUO
  • Feng Hong
  • Jiangchao Yao
  • Bo Han
  • Ya Zhang
  • Yanfeng Wang

In deep learning, model performance often deteriorates when trained on highly imbalanced datasets, especially when evaluation metrics require robust generalization across underrepresented classes. To address the challenges posed by imbalanced data distributions, this study introduces a novel method utilizing density ratio estimation for dynamic class weight adjustment, termed as Re-weighting with Density Ratio (RDR). Our method adaptively adjusts the importance of each class during training, mitigates overfitting on dominant classes and enhances model adaptability across diverse datasets. Extensive experiments conducted on various large scale benchmark datasets validate the effectiveness of our method. Results demonstrate substantial improvements in generalization capabilities, particularly under severely imbalanced conditions.

NeurIPS Conference 2024 Conference Paper

TAIA: Large Language Models are Out-of-Distribution Data Learners

  • Shuyang Jiang
  • Yusheng Liao
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \uline{T}raining \uline{A}ll parameters but \uline{I}nferring with only \uline{A}ttention (TAIA). We empirically validate TAIA using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that TAIA achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of TAIA to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data. Code is available in \url{https: //github. com/pixas/TAIA_LLM}.

NeurIPS Conference 2024 Conference Paper

WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark

  • Chunhui Zhang
  • Li Liu
  • Guanjie Huang
  • Hao Wen
  • Xi Zhou
  • Yanfeng Wang

Underwater Object Tracking (UOT) is essential for identifying and tracking submerged objects in underwater videos, but existing datasets are limited in scale, diversity of target categories and scenarios covered, impeding the development of advanced tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the largest public UOT benchmark to date, sourced from complex and realistic underwater environments. It comprises 1. 1 million frames across 1, 500 video clips filtered from 408 target categories, largely surpassing previous UOT datasets, \eg, UVOT400. Through meticulous manual annotation and verification, we provide high-quality bounding boxes for underwater targets. Additionally, WebUOT-1M includes language prompts for video sequences, expanding its application areas, \eg, underwater vision-language tracking. Given that most existing trackers are designed for open-air conditions and perform poorly in underwater environments due to domain gaps, we propose a novel framework that uses omni-knowledge distillation to train a student Transformer model effectively. To the best of our knowledge, this framework is the first to effectively transfer open-air domain knowledge to the UOT model through knowledge distillation, as demonstrated by results on both existing UOT datasets and the newly proposed WebUOT-1M. We have thoroughly tested WebUOT-1M with 30 deep trackers, showcasing its potential as a benchmark for future UOT research. The complete dataset, along with codes and tracking results, are publicly accessible at \href{https: //github. com/983632847/Awesome-Multimodal-Object-Tracking}{\color{magenta}{here}}.

NeurIPS Conference 2023 Conference Paper

AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

  • Chaofan Ma
  • Yang Yuhuan
  • Chen Ju
  • Fei Zhang
  • Ya Zhang
  • Yanfeng Wang

Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent works explore vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i. e. , low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when meet with ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labelling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregation, leveraging the meticulously designed clustering module. The final result is obtained by computing the similarity between aggregated attributes and images embedding. To evaluate the effectiveness, we annotate three datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation. We refer readers to the latest arXiv version at https: //arxiv. org/abs/2309. 00096.

NeurIPS Conference 2023 Conference Paper

Combating Representation Learning Disparity with Geometric Harmonization

  • Zhihan Zhou
  • Jiangchao Yao
  • Feng Hong
  • Ya Zhang
  • Bo Han
  • Yanfeng Wang

Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. The attribution is that the vanilla SSL methods that pursue the sample-level uniformity easily leads to representation learning disparity, where head classes with the huge sample number dominate the feature regime but tail classes with the small sample number passively collapse. To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage the category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specially, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infer an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of \methodspace with high tolerance to the distribution skewness.

TMLR Journal 2023 Journal Article

Federated Learning under Partially Disjoint Data via Manifold Reshaping

  • Ziqing Fan
  • Jiangchao Yao
  • Ruipeng Zhang
  • Lingjuan Lyu
  • Yanfeng Wang
  • Ya Zhang

Statistical heterogeneity severely limits the performance of federated learning (FL), motivating several explorations e.g., FedProx, MOON and FedDyn, to alleviate this problem. Despite effectiveness, their considered scenario generally requires samples from almost all classes during the local training of each client, although some covariate shifts may exist among clients. In fact, the natural case of partially class-disjoint data (PCDD), where each client contributes a few classes (instead of all classes) of samples, is practical yet underexplored. Specifically, the unique collapse and invasion characteristics of PCDD can induce the biased optimization direction in local training, which prevents the efficiency of federated learning. To address this dilemma, we propose a manifold reshaping approach called FedMR to calibrate the feature space of local training. Our FedMR adds two interplaying losses to the vanilla federated learning: one is the intra-class loss to decorrelate feature dimensions for anti-collapse; and the other one is the inter-class loss to guarantee the proper margin among categories in the feature expansion. We conduct extensive experiments on a range of datasets to demonstrate that our FedMR achieves much higher accuracy and better communication efficiency.

NeurIPS Conference 2023 Conference Paper

Federated Learning with Bilateral Curation for Partially Class-Disjoint Data

  • Ziqing Fan
  • Ruipeng Zhang
  • Jiangchao Yao
  • Bo Han
  • Ya Zhang
  • Yanfeng Wang

Partially class-disjoint data (PCDD), a common yet under-explored data formation where each client contributes a part of classes (instead of all classes) of samples, severely challenges the performance of federated algorithms. Without full classes, the local objective will contradict the global objective, yielding the angle collapse problem for locally missing classes and the space waste problem for locally existing classes. As far as we know, none of the existing methods can intrinsically mitigate PCDD challenges to achieve holistic improvement in the bilateral views (both global view and local view) of federated learning. To address this dilemma, we are inspired by the strong generalization of simplex Equiangular Tight Frame (ETF) on the imbalanced data, and propose a novel approach called FedGELA where the classifier is globally fixed as a simplex ETF while locally adapted to the personal distributions. Globally, FedGELA provides fair and equal discrimination for all classes and avoids inaccurate updates of the classifier, while locally it utilizes the space of locally missing classes for locally existing classes. We conduct extensive experiments on a range of datasets to demonstrate that our FedGELA achieves promising performance (averaged improvement of 3. 9% to FedAvg and 1. 5% to best baselines) and provide both local and global convergence guarantees.

JBHI Journal 2023 Journal Article

Self-Supervised Tumor Segmentation With Sim2Real Adaptation

  • Xiaoman Zhang
  • Weidi Xie
  • Chaoqin Huang
  • Ya Zhang
  • Xin Chen
  • Qi Tian
  • Yanfeng Wang

This paper targets on self-supervised tumor segmentation. We make the following contributions: (i) we take inspiration from the observation that tumors are often characterised independently of their contexts, we propose a novel proxy task “layer-decomposition”, that closely matches the goal of the downstream task, and design a scalable pipeline for generating synthetic tumor data for pre-training; (ii) we propose a two-stage Sim2Real training regime for unsupervised tumor segmentation, where we first pre-train a model with simulated tumors, and then adopt a self-training strategy for downstream data adaptation; (iii) when evaluating on different tumor segmentation benchmarks, e. g. BraTS2018 for brain tumor segmentation and LiTS2017 for liver tumor segmentation, our approach achieves state-of-the-art segmentation performance under the unsupervised setting. While transferring the model for tumor segmentation under a low-annotation regime, the proposed approach also outperforms all existing self-supervised approaches; (iv) we conduct extensive ablation studies to analyse the critical components in data simulation, and validate the necessity of different proxy tasks. We demonstrate that, with sufficient texture randomization in simulation, model trained on synthetic data can effortlessly generalise to datasets with real tumors.

NeurIPS Conference 2023 Conference Paper

Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation

  • Fei Zhang
  • Tianfei Zhou
  • Boyang Li
  • Hao He
  • Chaofan Ma
  • Tianjiao Zhang
  • Jiangchao Yao
  • Ya Zhang

This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS), which learns to segment objects of arbitrary classes using mere image-text pairs. Existing works turn to enhance the vanilla vision transformer by introducing explicit grouping recognition, i. e. , employing several group tokens/centroids to cluster the image tokens and perform the group-text alignment. Nevertheless, these methods suffer from a granularity inconsistency regarding the usage of group tokens, which are aligned in the all-to-one v. s. one-to-one manners during the training and inference phases, respectively. We argue that this discrepancy arises from the lack of elaborate supervision for each group token. To bridge this granularity gap, this paper explores explicit supervision for the group tokens from the prototypical knowledge. To this end, this paper proposes the non-learnable prototypical regularization (NPR) where non-learnable prototypes are estimated from source features to serve as supervision and enable contrastive matching of the group tokens. This regularization encourages the group tokens to segment objects with less redundancy and capture more comprehensive semantic regions, leading to increased compactness and richness. Based on NPR, we propose the prototypical guidance segmentation network (PGSeg) that incorporates multi-modal regularization by leveraging prototypical sources from both images and texts at different levels, progressively enhancing the segmentation capability with diverse prototypical patterns. Experimental results show that our proposed method achieves state-of-the-art performance on several benchmark datasets.

AAAI Conference 2022 Conference Paper

Handwritten Mathematical Expression Recognition via Attention Aggregation Based Bi-directional Mutual Learning

  • Xiaohang Bian
  • Bo Qin
  • Xiaozhe Xin
  • Jianwu Li
  • Xuefeng Su
  • Yanfeng Wang

Handwritten mathematical expression recognition aims to automatically generate LaTeX sequences from given images. Currently, attention-based encoder-decoder models are widely used in this task. They typically generate target sequences in a left-to-right (L2R) manner, leaving the right-toleft (R2L) contexts unexploited. In this paper, we propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) which consists of one shared encoder and two parallel inverse decoders (L2R and R2L). The two decoders are enhanced via mutual distillation, which involves one-to-one knowledge transfer at each training step, making full use of the complementary information from two inverse directions. Moreover, in order to deal with mathematical symbols in diverse scales, an Attention Aggregation Module (AAM) is proposed to effectively integrate multi-scale coverage attentions. Notably, in the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference, keeping the original parameter size and inference speed. Extensive experiments demonstrate that our proposed approach achieves the recognition accuracy of 56. 85 % on CROHME 2014, 52. 92 % on CROHME 2016, and 53. 96 % on CROHME 2019 without data augmentation and model ensembling, substantially outperforming the state-of-the-art methods. The source code is available in https: //github. com/XH-B/ABM.

AAAI Conference 2021 Conference Paper

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach

  • Suping Zhou
  • Jia Jia
  • Zhiyong Wu
  • Zhihan Yang
  • Yanfeng Wang
  • Wei Chen
  • Fanbo Meng
  • Shuo Huang

Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data. Second, to employ more diverse emotion expressions, we design a Multi-path Mixmatch Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semisupervised methods for superior generalisation and robustness. Experiments on an internet voice dataset with 500, 000 utterances show our method outperforms (+10. 09% in terms of F1) several alternative baselines, while an acted corpus with 2, 397 utterances contributes 4. 35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the proposed approach.

ECAI Conference 2020 Conference Paper

Learning Contextualized Sentence Representations for Document-Level Neural Machine Translation

  • Pei Zhang
  • Xu Zhang
  • Wei Chen 0071
  • Jian Yu
  • Yanfeng Wang
  • Deyi Xiong

Document-level machine translation incorporates intersentential dependencies into the translation of a source sentence. In this paper, we propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence. By enforcing the NMT model to predict source context, we want the model to learn “contextualized” source sentence representations that capture document-level dependencies on the source side. We further propose two different methods to learn and integrate such contextualized sentence embeddings into NMT: a joint training method that jointly trains an NMT model with the source context prediction model and a pre-training & fine-tuning method that pretrains the source context prediction model on a large-scale monolingual document corpus and then fine-tunes it with the NMT model. Experiments on Chinese-English and English-German translation show that both methods can substantially improve the translation quality over a strong document-level Transformer baseline.

AAAI Conference 2017 Conference Paper

Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems

  • Yishuang Ning
  • Jia Jia
  • Zhiyong Wu
  • Runnan Li
  • Yongsheng An
  • Yanfeng Wang
  • Helen Meng

Speech interaction systems have been gaining popularity in recent years. The main purpose of these systems is to generate more satisfactory responses according to users’ speech utterances, in which the most critical problem is to analyze user intention. Researches show that user intention conveyed through speech is not only expressed by content, but also closely related with users’ speaking manners (e. g. with or without acoustic emphasis). How to incorporate these heterogeneous attributes to infer user intention remains an open problem. In this paper, we define Intention Prominence (IP) as the semantic combination of focus by text and emphasis by speech, and propose a multi-task deep learning framework to predict IP. Specifically, we first use long short-term memory (LSTM) which is capable of modeling long short-term contextual dependencies to detect focus and emphasis, and incorporate the tasks for focus and emphasis detection with multi-task learning (MTL) to reinforce the performance of each other. We then employ Bayesian network (BN) to incorporate multimodal features (focus, emphasis, and location reflecting users’ dialect conventions) to predict IP based on feature correlations. Experiments on a data set of 135, 566 utterances collected from real-world Sogou Voice Assistant illustrate that our method can outperform the comparison methods over 6. 9-24. 5% in terms of F1-measure. Moreover, a real practice in the Sogou Voice Assistant indicates that our method can improve the performance on user intention understanding by 7%.