Arrow Research search

Author name cluster

Chen Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
2 author rows

Possible papers

33

AAAI Conference 2026 Conference Paper

Deep Research Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks

  • Haiyuan Wan
  • Chen Yang
  • Junchi Yu
  • Meiqi Tu
  • Jiaxuan Lu
  • Di Yu
  • Jianbao Cao
  • Ben Gao

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

AAAI Conference 2026 Conference Paper

Dereflection Any Image with Diffusion Priors and Diversified Data

  • Jichen Hu
  • Chen Yang
  • Zanwei Zhou
  • Jiemin Fang
  • Qi Tian
  • Wei Shen

Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

AAAI Conference 2026 Conference Paper

Escaping the CAM Shadow: Uncertainty-Guided Reliable Learning for Weakly Supervised Semantic Segmentation

  • Luyao Chang
  • Leiting Chen
  • Chen Yang
  • Chuan Zhou

Weakly supervised semantic segmentation (WSSS) suffers from an inherent mismatch between coarse image-level annotations and dense pixel-level predictions. To bridge this gap, existing methods primarily focus on generating refined class activation maps (CAM) as pseudo-labels. However, we argue that this focus is insufficient as it overlooks a critical component: the segmentation decoder. The decoder is typically trained through superficial alignment of predictions with pseudo-labels in the logit space. Given the noisy nature of such labels, this naive supervision leads to error accumulation and limits performance. To address this, we propose an Uncertainty-Guided Reliable Learning (UGRL) framework that exerts dual control to reshape the learning process, achieving robust supervision that escapes the CAM shadow. The cornerstone of UGRL is a prototype-driven uncertainty modeling module that estimates the reliability of class-wise supervision. The modeled uncertainty enables two synergistic control mechanisms. First, it adaptively modulates classification and segmentation losses, encouraging the model to learn from more trustworthy signals. Second, it guides the structuring of the decoder’s feature space. Rather than relying solely on superficial alignment, UGRL enforces deeper representation alignment by applying contrastive learning on reliable pixels. This enables rich semantic transfer to fine-grained segmentation details. Extensive experiments on PASCAL VOC and MS COCO demonstrate that our method surpasses other state-of-the-art WSSS methods.

AAAI Conference 2026 Conference Paper

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

  • Zanwei Zhou
  • Taoran Yi
  • Jiemin Fang
  • Chen Yang
  • Lingxi Xie
  • Xinggang Wang
  • Wei Shen
  • Qi Tian

Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1–2, achieving 0.68s (1 step x2) and 0.94s (2 steps x2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

AAAI Conference 2026 Conference Paper

Learning from Human Gaze: Human-like Robot Social Navigation in Dense Crowds

  • Zhecheng Yu
  • Yan Lyu
  • Chen Yang
  • Tao Chen
  • Yishuang Zhang
  • Bo Ling
  • Peng Wang
  • Guanyu Gao

Robot navigation in dense crowds requires understanding social cues that humans naturally use, yet existing methods struggle with real-world complexity. We investigate two questions: (1) Where do pedestrians look when navigating crowds? and (2) Can eye tracking improve robot navigation? To answer, we introduce GazeNav, an egocentric dataset collected via wearable eye trackers, featuring synchronized video, gaze, and trajectories in crowded environments. Analysis reveals that the gaze of pedestrians is closely related to the semantic presence and movement of other individuals, exhibiting distinct attention patterns across navigation behaviors. Building on this, we propose Gaze2Nav, a modular framework that first predicts human gaze to infer socially salient pedestrians, then incorporates the semantic attention into motion planning alongside visual inputs. Our method achieves 87.6% salient pedestrian prediction accuracy and reduces trajectory error by 15.4% over state-of-the-art baselines. By aligning with human gaze, our framework improves both performance and interpretability, advancing toward human-like, socially intelligent robot navigation.

AAAI Conference 2026 Conference Paper

WorldGrow: Generating Infinite 3D World

  • Sikuang Li
  • Chen Yang
  • Jiemin Fang
  • Taoran Yi
  • Jia Lu
  • Jiazhong Cen
  • Lingxi Xie
  • Wei Shen

We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.

NeurIPS Conference 2025 Conference Paper

Black-Box Membership Inference Attack for LVLMs via Prior Knowledge-Calibrated Memory Probing

  • Jinhua Yin
  • Peiru Yang
  • Chen Yang
  • Huili Wang
  • Zhiyang Hu
  • Shangguang Wang
  • Yongfeng Huang
  • Tao Qi

Large vision-language models (LVLMs) derive their capabilities from extensive training on vast corpora of visual and textual data. Empowered by large-scale parameters, these models often exhibit strong memorization of their training data, rendering them susceptible to membership inference attacks (MIAs). Existing MIA methods for LVLMs typically operate under white- or gray-box assumptions, by extracting likelihood-based features for the suspected data samples based on the target LVLMs. However, mainstream LVLMs generally only expose generated outputs while concealing internal computational features during inference, limiting the applicability of these methods. In this work, we propose the first black-box MIA framework for LVLMs, based on a prior knowledge-calibrated memory probing mechanism. The core idea is to assess the model memorization of the private semantic information embedded within the suspected image data, which is unlikely to be inferred from general world knowledge alone. We conducted extensive experiments across four LVLMs and three datasets. Empirical results demonstrate that our method effectively identifies training data of LVLMs in a purely black-box setting and even achieves performance comparable to gray-box and white-box methods. Further analysis reveals the robustness of our method against potential adversarial manipulations, and the effectiveness of the methodology designs. Our code and data are available at \url{https: //github. com/spmede/KCMP}.

NeurIPS Conference 2025 Conference Paper

Geometric Imbalance in Semi-Supervised Node Classification

  • Liang Yan
  • Shengzhong Zhang
  • Bisheng Li
  • Menglin Yang
  • Chen Yang
  • Min Zhou
  • Weiyang Ding
  • Yutong Xie

Class imbalance in graph data presents a significant challenge for effective node classification, particularly in semi-supervised scenarios. In this work, we formally introduce the concept of geometric imbalance, which captures how message passing on class-imbalanced graphs leads to geometric ambiguity among minority-class nodes in the riemannian manifold embedding space. We provide a rigorous theoretical analysis of geometric imbalance on the riemannian manifold and propose a unified framework that explicitly mitigates it through pseudo-label alignment, node reordering, and ambiguity filtering. Extensive experiments on diverse benchmarks show that our approach consistently outperforms existing methods, especially under severe class imbalance. Our findings offer new theoretical insights and practical tools for robust semi-supervised node classification.

ICML Conference 2025 Conference Paper

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

  • Ye Du
  • Chen Yang
  • Nanxi Yu
  • Wanyu Lin
  • Qian Zhao
  • Shujun Wang

De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https: //github. com/usr922/LIPNovo.

NeurIPS Conference 2025 Conference Paper

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

  • Ziyang Ma
  • Yinghao Ma
  • Yanqiao Zhu
  • Chen Yang
  • Yi-Wen Chao
  • Ruiyang Xu
  • Wenxi Chen
  • Yuanzhe Chen

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1, 000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. These findings underscore the urgent need for greater research attention in audio-language reasoning, including both data and algorithm innovation. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

ICML Conference 2025 Conference Paper

Provable Zero-Shot Generalization in Offline Reinforcement Learning

  • Zhiyong Wang
  • Chen Yang
  • John C. S. Lui
  • Dongruo Zhou

In this work, we study offline reinforcement learning (RL) with zero-shot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. We propose pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO), which leverage pessimistic policy evaluation to guide policy learning and enhance generalization. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG. Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.

UAI Conference 2025 Conference Paper

Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation

  • Runze Zhao
  • Yue Yu
  • Adams Yiyue Zhu
  • Chen Yang
  • Dongruo Zhou

Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $\tilde{O}(\sqrt{d_{\mathcal{R}} + d_{\mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{\mathcal{R}}$ and $d_{\mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. Our proposed algorithms are validated through experiments on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.

AAAI Conference 2025 Conference Paper

Segment Any 3D Gaussians

  • Jiazhong Cen
  • Jiemin Fang
  • Chen Yang
  • Lingxi Xie
  • Xiaopeng Zhang
  • Wei Shen
  • Qi Tian

This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching a scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale gate mechanism to deal with multi-granularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

NeurIPS Conference 2025 Conference Paper

SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

  • Chen Yang
  • Hui Wang
  • Shiyao Wang
  • Junyang Chen
  • Jiabei He
  • Jiaming Zhou
  • Xi Yang
  • Yequan Wang

While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55. 53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group. Code is available at https: //github. com/flageval-baai/SeniorTalk and data at https: //huggingface. co/datasets/evan0617/seniortalk.

ICRA Conference 2025 Conference Paper

Wcdt: World-Centric Diffusion Transformer for Traffic Scene Generation

  • Chen Yang
  • Yangfan He
  • Aaron Xuxiang Tian
  • Dong Chen 0016
  • Jianhui Wang
  • Tianyu Shi
  • Arsalan Heydarian
  • Pei Liu

In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a. k. a. , diffusion models) and transformers. Our proposed framework, termed the “World-centric Diffusion Transformer” (WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed into “Agent Move Statement” and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders that is used to enhance the interaction of agents with other elements in the traffic scene. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems. Our code is available at https://github.com/yangchen1997/WcDT.

NeurIPS Conference 2024 Conference Paper

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

  • Yu Zhang
  • Changhao Pan
  • Wenxiang Guo
  • Ruiqi Li
  • Zhiyuan Zhu
  • Jialei Wang
  • Wenhao Xu
  • Jingyu Lu

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80. 59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16. 16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion.

AAMAS Conference 2023 Conference Paper

Learning Individual Difference Rewards in Multi-Agent Reinforcement Learning

  • Chen Yang
  • Guangkai Yang
  • Junge Zhang

We investigate explicit solutions to multi-agent credit assignment problem. Specifically, we assign each agent individual difference rewards in addition to the team reward as to distinguish the contribution of different agents to the team. We present a novel reward decomposition network to estimate the influence of each agent’s action on the team reward, and distribute difference rewards accordingly. Furthermore, we combine difference rewards with actor-critic framework and propose a new approach called learning individual difference rewards (LIDR). We evaluate LIDR on a set of StarCraft II micromanagement problems. Results show that LIDR significantly outperforms previous state-of-the-art methods.

NeurIPS Conference 2023 Conference Paper

Rethinking Semi-Supervised Imbalanced Node Classification from Bias-Variance Decomposition

  • Divin Yan
  • Gengchen Wei
  • Chen Yang
  • Shengzhong Zhang
  • zengfeng Huang

This paper introduces a new approach to address the issue of class imbalance in graph neural networks (GNNs) for learning on graph-structured data. Our approach integrates imbalanced node classification and Bias-Variance Decomposition, establishing a theoretical framework that closely relates data imbalance to model variance. We also leverage graph augmentation technique to estimate the variance and design a regularization term to alleviate the impact of imbalance. Exhaustive tests are conducted on multiple benchmarks, including naturally imbalanced datasets and public-split class-imbalanced datasets, demonstrating that our approach outperforms state-of-the-art methods in various imbalanced scenarios. This work provides a novel theoretical perspective for addressing the problem of imbalanced node classification in GNNs.

NeurIPS Conference 2023 Conference Paper

Segment Anything in 3D with NeRFs

  • Jiazhong Cen
  • Zanwei Zhou
  • Jiemin Fang
  • Chen Yang
  • Wei Shen
  • Lingxi Xie
  • Dongsheng Jiang
  • Xiaopeng Zhang

Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e. g. , rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research offers a generic and efficient methodology to lift a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views.

AAAI Conference 2022 Conference Paper

TDv2: A Novel Tree-Structured Decoder for Offline Mathematical Expression Recognition

  • Changjie Wu
  • Jun Du
  • Yunqing Li
  • Jianshu Zhang
  • Chen Yang
  • Bo Ren
  • Yiqing Hu

In recent years, tree decoders become more popular than La- TeX string decoders in the field of handwritten mathematical expression recognition (HMER) as they can capture the hierarchical tree structure of mathematical expressions. However previous tree decoders converted the tree structure labels into a fixed and ordered sequence, which could not make full use of the diversified expression of tree labels. In this study, we propose a novel tree decoder (TDv2) to fully utilize the tree structure labels. Compared with previous tree decoders, this new model does not require a fixed priority for different branches of a node during training and inference, which can effectively improve the model generalization capability. The input and output of the model make full use of the tree structure label, so that there is no need to find the parent node in the decoding process, which simplifies the decoding process and adds a priori information to help predict the node. We verified the effectiveness of each part of the model through comprehensive ablation experiments and attention visualization analysis. On the authoritative CROHME 14/16/19 datasets, our method achieves the state-of-the-art results.

AAAI Conference 2021 Conference Paper

LRSC: Learning Representations for Subspace Clustering

  • Changsheng Li
  • Chen Yang
  • Bo Liu
  • Ye Yuan
  • Guoren Wang

Deep learning based subspace clustering methods have attracted increasing attention in recent years, where a basic theme is to non-linearly map data into a latent space, and then uncover subspace structures based upon the data selfexpressiveness property. However, almost all existing deep subspace clustering methods only rely on target domain data, and always resort to shallow neural networks for modeling data, leaving huge room to design more effective representation learning mechanisms tailored for subspace clustering. In this paper, we propose a novel subspace clustering framework through learning precise sample representations. In contrast to previous approaches, the proposed method aims to leverage external data through constructing lots of relevant tasks to guide the training of the encoder, motivated by the idea of meta-learning. Considering limited networks layers of current deep subspace clustering models, we intend to distill knowledge from a deeper network trained on the external data, and transfer it into the shallower model. To reach the above two goals, we propose a new loss function to realize them in a unified framework. Moreover, we propose to construct a new auxiliary task for self-supervised training of the model, such that the representation ability of the model can be further improved. Extensive experiments are performed on four publicly available datasets, and experimental results clearly demonstrate the efficacy of our method, compared to state-of-the-art methods.

JBHI Journal 2021 Journal Article

Mutual-Prototype Adaptation for Cross-Domain Polyp Segmentation

  • Chen Yang
  • Xiaoqing Guo
  • Meilu Zhu
  • Bulat Ibragimov
  • Yixuan Yuan

Accurate segmentation of the polyps from colonoscopy images provides useful information for the diagnosis and treatment of colorectal cancer. Despite deep learning methods advance automatic polyp segmentation, their performance often degrades when applied to new data acquired from different scanners or sequences (target domain). As manual annotation is tedious and labor-intensive for new target domain, leveraging knowledge learned from the labeled source domain to promote the performance in the unlabeled target domain is highly demanded. In this work, we propose a mutual-prototype adaptation network to eliminate domain shifts in multi-centers and multi-devices colonoscopy images. We first devise a mutual-prototype alignment (MPA) module with the prototype relation function to refine features through self-domain and cross-domain information in a coarse-to-fine process. Then two auxiliary modules: progressive self-training (PST) and disentangled reconstruction (DR) are proposed to improve the segmentation performance. The PST module selects reliable pseudo labels through a novel uncertainty guided self-training loss to obtain accurate prototypes in the target domain. The DR module reconstructs original images jointly utilizing prediction results and private prototypes to maintain semantic consistency and provide complement supervision information. We extensively evaluate the proposed model in polyp segmentation performance on three conventional colonoscopy datasets: CVC-DB, Kvasir-SEG, and ETIS-Larib. The comprehensive experimental results demonstrate that the proposed model outperforms state-of-the-art methods.

ICRA Conference 2020 Conference Paper

Modeling and Experiments on the Swallowing and Disgorging Characteristics of an Underwater Continuum Manipulator

  • Haihang Wang
  • He Xu
  • Fengshu Yu
  • Xin Li
  • Chen Yang
  • Siqing Chen
  • Junlong Chen
  • Yonghui Zhang

Soft robots apply compliant materials to perform motions and behaviors not typically achievable by rigid robots. An underwater, compliant, multi-segment continuum manipulator that can bend, swallow, disgorge is developed in this study. The manipulator is driven by McKibben water hydraulic artificial muscle (WHAM). The mechanical properties of the WHAM are tested and analyzed experimentally. The kinematics model, which concerns about the variable diameter structure of the soft grippers, are established to simulate the behaviors of the manipulator among the bending, swallowing and disgorging procedure. A mouth-tongue collaborative soft robot assembled with another single-segment soft robot arm is presented. And its functions are experimentally testified. The distinctive functions were verified according to the experimental results.

IROS Conference 2017 Conference Paper

Development of an inexpensive tri-axial force sensor for minimally invasive surgery

  • Lu Li
  • Bocheng Yu
  • Chen Yang
  • Prasad Vagdargi
  • Rangaprasad Arun Srivatsan
  • Howie Choset

This work presents the design and evaluation of a low-cost tri-axial force sensor, that has been developed to regain the sense of touch in minimally invasive surgeries (MIS). The force sensor uses an array of force sensitive resistors (FSR) with a mechanically pre-loaded structure to perform the force sensing. The sensor has a built-in signal conditioning circuitry to provide on-board power regulation, programmable signal amplification and analog to digital conversion. The sensor is inexpensive and highly sensitive to low-amplitude force, critical in surgical applications. We validate the efficacy of the sensor with two surgical applications - robotic palpation for stiffness mapping and obstacle avoidance for a highly articulated robotic probe (HARP). The results show that the sensor is capable of accurately detecting the stiff inclusions embedded in the tissues as well as detecting obstacles and helping HARP safely navigate around them.