Author name cluster

Ta-Ying Cheng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

AAAI Conference 2026 Conference Paper

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Chun-Hsiao Yeh
Chenyu Wang
Shengbang Tong
Ta-Ying Cheng
Ruoyu Wang
Tianzhe Chu
Yuexiang Zhai
Yubei Chen

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we introduce All-Angles Bench, a human carefully benchmark with over 2,100 question-answer pairs from 90 diverse, real-world scenes. Our broad evaluation across 38 general-purpose and 3D spatial reasoning MLLMs reveals a substantial performance gap compared to humans. More critically, our analysis identifies two root failure modes: (1) cross-view object mismatch—the inability to establish consistent object correspondence across views; and (2) cross-view spatial misalignment—the failure to infer accurate camera poses and spatial layouts. These findings underscore a lack of multi-view awareness in current MLLMs, calling for architectural innovations beyond prompt tuning alone. We believe that our benchmark offers valuable insights toward building spatially-intelligent MLLMs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Chenyang Ma
Kai Lu
Ta-Ying Cheng
Niki Trigoni
Andrew Markham

Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Towards Learning Group-Equivariant Features for Domain Adaptive 3D Detection

Sangyun Shin
Yuhang He
Madhu Vankadari
Ta-Ying Cheng
Qian Xie
Andrew Markham
Niki Trigoni

The performance of 3D object detection in large outdoor point clouds deteriorates significantly in an unseen environment due to the inter-domain gap. To address these challenges, most existing methods for domain adaptation harness self-training schemes and attempt to bridge the gap by focusing on a single factor that causes the inter-domain gap, such as objects' sizes, shapes, and foreground density variation. However, the resulting adaptations suggest that there is still a substantial inter-domain gap left to be minimized. We argue that this is due to two limitations: 1) Biased pseudo-label collection from self-training. 2) Multiple factors jointly contributing to how the object is perceived in the unseen target domain. In this work, we propose a grouping-exploration strategy framework, Group Explorer Domain Adaptation ($\textbf{GroupEXP-DA}$), to addresses those two issues. Specifically, our grouping divides the available label sets into multiple clusters and ensures all of them have equal learning attention with the group-equivariant spatial feature, avoiding dominant types of objects causing imbalance problems. Moreover, grouping learns to divide objects by considering inherent factors in a data-driven manner, without considering each factor separately as existing works. On top of the group-equivariant spatial feature that selectively detects objects similar to the input group, we additionally introduce an explorative group update strategy that reduces the false negative detection in the target domain, further reducing the inter-domain gap. During inference, only the learned group features are necessary for making the group-equivariant spatial feature, placing our method as a simple add-on that can be applicable to most existing detectors. We show how each module contributes to substantially bridging the inter-domain gaps compared to existing works across large urban outdoor datasets such as NuScenes, Waymo, and KITTI.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation

Jia-Xing Zhong
Ta-Ying Cheng
Yuhang He
Kai Lu
Kaichen Zhou
Andrew Markham
Niki Trigoni

A truly generalizable approach to rigid segmentation and motion estimation is fundamental to 3D understanding of articulated objects and moving scenes. In view of the closely intertwined relationship between segmentation and motion estimates, we present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our architecture is composed of two interconnected, lightweight heads. These heads predict segmentation masks using point-level invariant features and estimate motion from SE(3) equivariant features, all without the need for category information. Our training strategy is unified and can be implemented online, which jointly optimizes the predicted segmentation and motion by leveraging the interrelationships among scene flow, segmentation mask, and rigid transformations. We conduct experiments on four datasets to demonstrate the superiority of our method. The results show that our method excels in both model performance and computational efficiency, with only 0. 25M parameters and 0. 92G FLOPs. To the best of our knowledge, this is the first work designed for category-agnostic part-level SE(3) equivariance in dynamic point clouds.

PDF Details

AAAI Conference 2022 Conference Paper

Pose Adaptive Dual Mixup for Few-Shot Single-View 3D Reconstruction

Ta-Ying Cheng
Hsuan-Ru Yang
Niki Trigoni
Hwann-Tzong Chen
Tyng-Luh Liu

We present a pose adaptive few-shot learning procedure and a two-stage data interpolation regularization, termed Pose Adaptive Dual Mixup (PADMix), for single-image 3D reconstruction. While augmentations via interpolating feature-label pairs are effective in classification tasks, they fall short in shape predictions potentially due to inconsistencies between interpolated products of two images and volumes when rendering viewpoints are unknown. PADMix targets this issue with two sets of mixup procedures performed sequentially. We first perform an input mixup which, combined with a pose adaptive learning procedure, is helpful in learning 2D feature extraction and pose adaptive latent encoding. The stagewise training allows us to build upon the pose invariant representations to perform a follow-up latent mixup under one-to-one correspondences between features and ground-truth volumes. PADMix significantly outperforms previous literature on fewshot settings over the ShapeNet dataset and sets new benchmarks on the more challenging real-world Pix3D dataset.

PDF Details