Author name cluster

Yizhou Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers

2 author rows

IROS Conference 2025 Conference Paper

A Novel Robot Hand with Hoeckens Linkages and Soft Phalanges for Scooping and Self-Adaptive Grasping in Environmental Constraints

Wentao Guo
Yizhou Wang
Wenzeng Zhang

This paper presents a novel underactuated adaptive robotic hand, Hockens-A Hand, which integrates the Hoeckens mechanism, a double-parallelogram linkage, and a specialized four-bar linkage to achieve three adaptive grasping modes: parallel pinching, asymmetric scooping, and enveloping grasping. Hockens-A Hand requires only a single linear actuator, leveraging passive mechanical intelligence to ensure adaptability and compliance in unstructured environments. Specifically, the vertical motion of the Hoeckens mechanism introduces compliance, the double-parallelogram linkage ensures line contact at the fingertip, and the four-bar amplification system enables natural transitions between different grasping modes. Additionally, the inclusion of a mesh-textured silicone phalanx further enhances the ability to envelop objects of various shapes and sizes. This study employs detailed kinematic analysis to optimize the push angle and design the linkage lengths for optimal performance. Simulations validated the design by analyzing the fingertip motion and ensuring smooth transitions between grasping modes. Furthermore, the grasping force was analyzed using power equations to enhance the understanding of the system’s performance. Experimental validation using a 3D-printed prototype demonstrates the three grasping modes of the hand in various scenarios under environmental constraints, verifying its grasping stability and broad applicability.

Details

AAAI Conference 2025 Conference Paper

Autoregressive Sequence Modeling for 3D Medical Image Representation

Siwen Wang
Churan Wang
Fei Gao
Lixian Su
Fandong Zhang
Yizhou Wang
Yizhou Yu

Three-dimensional (3D) medical images, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), are essential for clinical applications. However, the need for diverse and comprehensive representations is particularly pronounced when considering the variability across different organs, diagnostic tasks, and imaging modalities. How to effectively interpret the intricate contextual information and extract meaningful insights from these images remains an open challenge to the community. While current self-supervised learning methods have shown potential, they often consider an image as a whole thereby overlooking the extensive, complex relationships among local regions from one or multiple images. In this work, we introduce a pioneering method for learning 3D medical image representations through an autoregressive pre-training framework. Our approach sequences various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence. By employing an autoregressive sequence modeling task, we predict the next visual token in the sequence, which allows our model to deeply understand and integrate the contextual information inherent in 3D medical images. Additionally, we implement a random startup strategy to avoid overestimating token relationships and to enhance the robustness of learning. The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou
Di Lu
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Xu Zheng
Linfeng Zhang
Xuming Hu

Despite their powerful capabilities, multimodal large language models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [CLS] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i. e. , they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning rates. To this end, we propose HoloV, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1. 5 equipped with HoloV preserves 95. 8% of the original performance after pruning 88. 9% of visual tokens, achieving superior efficiency-accuracy trade-offs.