Author name cluster

Chen Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers

2 author rows

EAAI Journal 2026 Journal Article

A combined air volume prediction model for hazardous mine tunnels using directed graph convolutional neural networks

Zhen Wang
Erkan Topal
Yongli Li
Ke Gao
Chen Yang

Details DOI

AAAI Conference 2026 Conference Paper

Deep Research Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks

Haiyuan Wan
Chen Yang
Junchi Yu
Meiqi Tu
Jiaxuan Lu
Di Yu
Jianbao Cao
Ben Gao

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dereflection Any Image with Diffusion Priors and Diversified Data

Jichen Hu
Chen Yang
Zanwei Zhou
Jiemin Fang
Qi Tian
Wei Shen

Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

PDF Details DOI

EAAI Journal 2026 Journal Article

Diffusion enhancement-based dual-domain fusion attention motor imagery electroencephalography classification network

Xin Li
Ji-Hong Wang
Ji An
Lang-Li Ren
Li-Feng Bian
Chen Yang

Details DOI

AAAI Conference 2026 Conference Paper

Escaping the CAM Shadow: Uncertainty-Guided Reliable Learning for Weakly Supervised Semantic Segmentation

Luyao Chang
Leiting Chen
Chen Yang
Chuan Zhou

Weakly supervised semantic segmentation (WSSS) suffers from an inherent mismatch between coarse image-level annotations and dense pixel-level predictions. To bridge this gap, existing methods primarily focus on generating refined class activation maps (CAM) as pseudo-labels. However, we argue that this focus is insufficient as it overlooks a critical component: the segmentation decoder. The decoder is typically trained through superficial alignment of predictions with pseudo-labels in the logit space. Given the noisy nature of such labels, this naive supervision leads to error accumulation and limits performance. To address this, we propose an Uncertainty-Guided Reliable Learning (UGRL) framework that exerts dual control to reshape the learning process, achieving robust supervision that escapes the CAM shadow. The cornerstone of UGRL is a prototype-driven uncertainty modeling module that estimates the reliability of class-wise supervision. The modeled uncertainty enables two synergistic control mechanisms. First, it adaptively modulates classification and segmentation losses, encouraging the model to learn from more trustworthy signals. Second, it guides the structuring of the decoder’s feature space. Rather than relying solely on superficial alignment, UGRL enforces deeper representation alignment by applying contrastive learning on reliable pixels. This enables rich semantic transfer to fine-grained segmentation details. Extensive experiments on PASCAL VOC and MS COCO demonstrate that our method surpasses other state-of-the-art WSSS methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou
Taoran Yi
Jiemin Fang
Chen Yang
Lingxi Xie
Xinggang Wang
Wei Shen
Qi Tian

Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1–2, achieving 0.68s (1 step x2) and 0.94s (2 steps x2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Learning from Human Gaze: Human-like Robot Social Navigation in Dense Crowds

Zhecheng Yu
Yan Lyu
Chen Yang
Tao Chen
Yishuang Zhang
Bo Ling
Peng Wang
Guanyu Gao

Robot navigation in dense crowds requires understanding social cues that humans naturally use, yet existing methods struggle with real-world complexity. We investigate two questions: (1) Where do pedestrians look when navigating crowds? and (2) Can eye tracking improve robot navigation? To answer, we introduce GazeNav, an egocentric dataset collected via wearable eye trackers, featuring synchronized video, gaze, and trajectories in crowded environments. Analysis reveals that the gaze of pedestrians is closely related to the semantic presence and movement of other individuals, exhibiting distinct attention patterns across navigation behaviors. Building on this, we propose Gaze2Nav, a modular framework that first predicts human gaze to infer socially salient pedestrians, then incorporates the semantic attention into motion planning alongside visual inputs. Our method achieves 87.6% salient pedestrian prediction accuracy and reduces trajectory error by 15.4% over state-of-the-art baselines. By aligning with human gaze, our framework improves both performance and interpretability, advancing toward human-like, socially intelligent robot navigation.

PDF Details DOI

EAAI Journal 2026 Journal Article

Real-time dynamic strain monitoring for battery hazard identification

Linrui Ma
Yunpeng Zhen
Like Gao
Ronggui Peng
Yuan Zhou
Ningbo Ding
Dongxu Ma
Chen Yang

Details DOI

AAAI Conference 2026 Conference Paper

WorldGrow: Generating Infinite 3D World

Sikuang Li
Chen Yang
Jiemin Fang
Taoran Yi
Jia Lu
Jiazhong Cen
Lingxi Xie
Wei Shen

We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Black-Box Membership Inference Attack for LVLMs via Prior Knowledge-Calibrated Memory Probing

Jinhua Yin
Peiru Yang
Chen Yang
Huili Wang
Zhiyang Hu
Shangguang Wang
Yongfeng Huang
Tao Qi

Large vision-language models (LVLMs) derive their capabilities from extensive training on vast corpora of visual and textual data. Empowered by large-scale parameters, these models often exhibit strong memorization of their training data, rendering them susceptible to membership inference attacks (MIAs). Existing MIA methods for LVLMs typically operate under white- or gray-box assumptions, by extracting likelihood-based features for the suspected data samples based on the target LVLMs. However, mainstream LVLMs generally only expose generated outputs while concealing internal computational features during inference, limiting the applicability of these methods. In this work, we propose the first black-box MIA framework for LVLMs, based on a prior knowledge-calibrated memory probing mechanism. The core idea is to assess the model memorization of the private semantic information embedded within the suspected image data, which is unlikely to be inferred from general world knowledge alone. We conducted extensive experiments across four LVLMs and three datasets. Empirical results demonstrate that our method effectively identifies training data of LVLMs in a purely black-box setting and even achieves performance comparable to gray-box and white-box methods. Further analysis reveals the robustness of our method against potential adversarial manipulations, and the effectiveness of the methodology designs. Our code and data are available at \url{https: //github. com/spmede/KCMP}.

PDF Details

EAAI Journal 2025 Journal Article

Efficient generation of power system topology diagrams based on Graph Neural Network

Chen Yang
Shengyang Wu
Tao Liu
Yixuan He
Jingyu Wang
Dongyuan Shi

Details DOI

NeurIPS Conference 2025 Conference Paper

Geometric Imbalance in Semi-Supervised Node Classification

Liang Yan
Shengzhong Zhang
Bisheng Li
Menglin Yang
Chen Yang
Min Zhou
Weiyang Ding
Yutong Xie

Class imbalance in graph data presents a significant challenge for effective node classification, particularly in semi-supervised scenarios. In this work, we formally introduce the concept of geometric imbalance, which captures how message passing on class-imbalanced graphs leads to geometric ambiguity among minority-class nodes in the riemannian manifold embedding space. We provide a rigorous theoretical analysis of geometric imbalance on the riemannian manifold and propose a unified framework that explicitly mitigates it through pseudo-label alignment, node reordering, and ambiguity filtering. Extensive experiments on diverse benchmarks show that our approach consistently outperforms existing methods, especially under severe class imbalance. Our findings offer new theoretical insights and practical tools for robust semi-supervised node classification.

PDF Details

ICML Conference 2025 Conference Paper

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Ye Du
Chen Yang
Nanxi Yu
Wanyu Lin
Qian Zhao
Shujun Wang

De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https: //github. com/usr922/LIPNovo.

Details

NeurIPS Conference 2025 Conference Paper

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma
Yinghao Ma
Yanqiao Zhu
Chen Yang
Yi-Wen Chao
Ruiyang Xu
Wenxi Chen
Yuanzhe Chen

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1, 000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. These findings underscore the urgent need for greater research attention in audio-language reasoning, including both data and algorithm innovation. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

PDF Details

ICML Conference 2025 Conference Paper

Provable Zero-Shot Generalization in Offline Reinforcement Learning

Zhiyong Wang
Chen Yang
John C. S. Lui
Dongruo Zhou

In this work, we study offline reinforcement learning (RL) with zero-shot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. We propose pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO), which leverage pessimistic policy evaluation to guide policy learning and enhance generalization. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG. Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.

Details

UAI Conference 2025 Conference Paper

Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation

Runze Zhao
Yue Yu
Adams Yiyue Zhu
Chen Yang
Dongruo Zhou

Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $\tilde{O}(\sqrt{d_{\mathcal{R}} + d_{\mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{\mathcal{R}}$ and $d_{\mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. Our proposed algorithms are validated through experiments on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.

Details

AAAI Conference 2025 Conference Paper

Segment Any 3D Gaussians

Jiazhong Cen
Jiemin Fang
Chen Yang
Lingxi Xie
Xiaopeng Zhang
Wei Shen
Qi Tian

This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching a scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale gate mechanism to deal with multi-granularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

Chen Yang
Hui Wang
Shiyao Wang
Junyang Chen
Jiabei He
Jiaming Zhou
Xi Yang
Yequan Wang

While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55. 53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group. Code is available at https: //github. com/flageval-baai/SeniorTalk and data at https: //huggingface. co/datasets/evan0617/seniortalk.

PDF Details

ICRA Conference 2025 Conference Paper

Wcdt: World-Centric Diffusion Transformer for Traffic Scene Generation

Chen Yang
Yangfan He
Aaron Xuxiang Tian
Dong Chen 0016
Jianhui Wang
Tianyu Shi
Arsalan Heydarian
Pei Liu

In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a. k. a. , diffusion models) and transformers. Our proposed framework, termed the “World-centric Diffusion Transformer” (WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed into “Agent Move Statement” and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders that is used to enhance the interaction of agents with other elements in the traffic scene. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems. Our code is available at https://github.com/yangchen1997/WcDT.

Details

EAAI Journal 2024 Journal Article

A comprehensive end-to-end computer vision framework for restoration and recognition of low-quality engineering drawings

Lvyang Yang
Jiankang Zhang
Huaiqiang Li
Longfei Ren
Chen Yang
Jingyu Wang
Dongyuan Shi

Details DOI

EAAI Journal 2024 Journal Article

Collaborative planning and control of heterogeneous multi-ground unmanned platforms

Pengyu Xue
Dawei Pi
Chenxi Wan
Chen Yang
Boyuan Xie
Hongliang Wang
Xianhui Wang
Guodong Yin

Details DOI

EAAI Journal 2024 Journal Article

Customer preference analysis integrating online reviews: An evidence theory-based method considering criteria interaction

Mei Cai
Chen Yang

Details DOI

NeurIPS Conference 2024 Conference Paper

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

Yu Zhang
Changhao Pan
Wenxiang Guo
Ruiqi Li
Zhiyuan Zhu
Jialei Wang
Wenhao Xu
Jingyu Lu

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80. 59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16. 16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion.

PDF Details DOI

EAAI Journal 2024 Journal Article

MixSegNet: Fusing multiple mixed-supervisory signals with multiple views of networks for mixed-supervised medical image segmentation

Ziyang Wang
Chen Yang

Details DOI

EAAI Journal 2024 Journal Article

Multi-agent policy learning-based path planning for autonomous mobile robots

Lixiang Zhang
Ze Cai
Yan Yan
Chen Yang
Yaoguang Hu

Details DOI

AAMAS Conference 2023 Conference Paper

Learning Individual Difference Rewards in Multi-Agent Reinforcement Learning

Chen Yang
Guangkai Yang
Junge Zhang

We investigate explicit solutions to multi-agent credit assignment problem. Specifically, we assign each agent individual difference rewards in addition to the team reward as to distinguish the contribution of different agents to the team. We present a novel reward decomposition network to estimate the influence of each agent’s action on the team reward, and distribute difference rewards accordingly. Furthermore, we combine difference rewards with actor-critic framework and propose a new approach called learning individual difference rewards (LIDR). We evaluate LIDR on a set of StarCraft II micromanagement problems. Results show that LIDR significantly outperforms previous state-of-the-art methods.

PDF

NeurIPS Conference 2023 Conference Paper

Rethinking Semi-Supervised Imbalanced Node Classification from Bias-Variance Decomposition

Divin Yan
Gengchen Wei
Chen Yang
Shengzhong Zhang
zengfeng Huang

This paper introduces a new approach to address the issue of class imbalance in graph neural networks (GNNs) for learning on graph-structured data. Our approach integrates imbalanced node classification and Bias-Variance Decomposition, establishing a theoretical framework that closely relates data imbalance to model variance. We also leverage graph augmentation technique to estimate the variance and design a regularization term to alleviate the impact of imbalance. Exhaustive tests are conducted on multiple benchmarks, including naturally imbalanced datasets and public-split class-imbalanced datasets, demonstrating that our approach outperforms state-of-the-art methods in various imbalanced scenarios. This work provides a novel theoretical perspective for addressing the problem of imbalanced node classification in GNNs.

PDF Details

NeurIPS Conference 2023 Conference Paper

Segment Anything in 3D with NeRFs

Jiazhong Cen
Zanwei Zhou
Jiemin Fang
Chen Yang
Wei Shen
Lingxi Xie
Dongsheng Jiang
Xiaopeng Zhang

Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e. g. , rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research offers a generic and efficient methodology to lift a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views.

PDF Details

AAAI Conference 2022 Conference Paper

TDv2: A Novel Tree-Structured Decoder for Offline Mathematical Expression Recognition

Changjie Wu
Jun Du
Yunqing Li
Jianshu Zhang
Chen Yang
Bo Ren
Yiqing Hu

In recent years, tree decoders become more popular than La- TeX string decoders in the field of handwritten mathematical expression recognition (HMER) as they can capture the hierarchical tree structure of mathematical expressions. However previous tree decoders converted the tree structure labels into a fixed and ordered sequence, which could not make full use of the diversified expression of tree labels. In this study, we propose a novel tree decoder (TDv2) to fully utilize the tree structure labels. Compared with previous tree decoders, this new model does not require a fixed priority for different branches of a node during training and inference, which can effectively improve the model generalization capability. The input and output of the model make full use of the tree structure label, so that there is no need to find the parent node in the decoding process, which simplifies the decoding process and adds a priori information to help predict the node. We verified the effectiveness of each part of the model through comprehensive ablation experiments and attention visualization analysis. On the authoritative CROHME 14/16/19 datasets, our method achieves the state-of-the-art results.

PDF Details

AAAI Conference 2021 Conference Paper

LRSC: Learning Representations for Subspace Clustering

Changsheng Li
Chen Yang
Bo Liu
Ye Yuan
Guoren Wang

Deep learning based subspace clustering methods have attracted increasing attention in recent years, where a basic theme is to non-linearly map data into a latent space, and then uncover subspace structures based upon the data selfexpressiveness property. However, almost all existing deep subspace clustering methods only rely on target domain data, and always resort to shallow neural networks for modeling data, leaving huge room to design more effective representation learning mechanisms tailored for subspace clustering. In this paper, we propose a novel subspace clustering framework through learning precise sample representations. In contrast to previous approaches, the proposed method aims to leverage external data through constructing lots of relevant tasks to guide the training of the encoder, motivated by the idea of meta-learning. Considering limited networks layers of current deep subspace clustering models, we intend to distill knowledge from a deeper network trained on the external data, and transfer it into the shallower model. To reach the above two goals, we propose a new loss function to realize them in a unified framework. Moreover, we propose to construct a new auxiliary task for self-supervised training of the model, such that the representation ability of the model can be further improved. Extensive experiments are performed on four publicly available datasets, and experimental results clearly demonstrate the efficacy of our method, compared to state-of-the-art methods.

PDF Details

JBHI Journal 2021 Journal Article

Mutual-Prototype Adaptation for Cross-Domain Polyp Segmentation

Chen Yang
Xiaoqing Guo
Meilu Zhu
Bulat Ibragimov
Yixuan Yuan

Accurate segmentation of the polyps from colonoscopy images provides useful information for the diagnosis and treatment of colorectal cancer. Despite deep learning methods advance automatic polyp segmentation, their performance often degrades when applied to new data acquired from different scanners or sequences (target domain). As manual annotation is tedious and labor-intensive for new target domain, leveraging knowledge learned from the labeled source domain to promote the performance in the unlabeled target domain is highly demanded. In this work, we propose a mutual-prototype adaptation network to eliminate domain shifts in multi-centers and multi-devices colonoscopy images. We first devise a mutual-prototype alignment (MPA) module with the prototype relation function to refine features through self-domain and cross-domain information in a coarse-to-fine process. Then two auxiliary modules: progressive self-training (PST) and disentangled reconstruction (DR) are proposed to improve the segmentation performance. The PST module selects reliable pseudo labels through a novel uncertainty guided self-training loss to obtain accurate prototypes in the target domain. The DR module reconstructs original images jointly utilizing prediction results and private prototypes to maintain semantic consistency and provide complement supervision information. We extensively evaluate the proposed model in polyp segmentation performance on three conventional colonoscopy datasets: CVC-DB, Kvasir-SEG, and ETIS-Larib. The comprehensive experimental results demonstrate that the proposed model outperforms state-of-the-art methods.

Details DOI

ICRA Conference 2020 Conference Paper

Modeling and Experiments on the Swallowing and Disgorging Characteristics of an Underwater Continuum Manipulator

Haihang Wang
He Xu
Fengshu Yu
Xin Li
Chen Yang
Siqing Chen
Junlong Chen
Yonghui Zhang

Soft robots apply compliant materials to perform motions and behaviors not typically achievable by rigid robots. An underwater, compliant, multi-segment continuum manipulator that can bend, swallow, disgorge is developed in this study. The manipulator is driven by McKibben water hydraulic artificial muscle (WHAM). The mechanical properties of the WHAM are tested and analyzed experimentally. The kinematics model, which concerns about the variable diameter structure of the soft grippers, are established to simulate the behaviors of the manipulator among the bending, swallowing and disgorging procedure. A mouth-tongue collaborative soft robot assembled with another single-segment soft robot arm is presented. And its functions are experimentally testified. The distinctive functions were verified according to the experimental results.

Details

IROS Conference 2017 Conference Paper

Development of an inexpensive tri-axial force sensor for minimally invasive surgery

Lu Li
Bocheng Yu
Chen Yang
Prasad Vagdargi
Rangaprasad Arun Srivatsan
Howie Choset

This work presents the design and evaluation of a low-cost tri-axial force sensor, that has been developed to regain the sense of touch in minimally invasive surgeries (MIS). The force sensor uses an array of force sensitive resistors (FSR) with a mechanically pre-loaded structure to perform the force sensing. The sensor has a built-in signal conditioning circuitry to provide on-board power regulation, programmable signal amplification and analog to digital conversion. The sensor is inexpensive and highly sensitive to low-amplitude force, critical in surgical applications. We validate the efficacy of the sensor with two surgical applications - robotic palpation for stiffness mapping and obstacle avoidance for a highly articulated robotic probe (HARP). The results show that the sensor is capable of accurately detecting the stiff inclusions embedded in the tissues as well as detecting obstacles and helping HARP safely navigate around them.

Details