Author name cluster

Xiaohan Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

JBHI Journal 2026 Journal Article

Improving Medical Visual Representation Learning With Pathological-Level Cross-Modal Alignment and Correlation Exploration

Jun Wang
Lixing Zhu
Xiaohan Yu
Abhir Bhalerao
Yulan He

Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential for transferring acquired knowledge to various downstream medical tasks. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the P athological- L evel A lignment and enriches the fine-grained details via C orrelation E xploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.

Details DOI

AAAI Conference 2026 Conference Paper

SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction

Meiying Gu
Jiawei Zhang
Jiahe Li
Xiaohan Yu
Haonan Luo
Jin Zheng
Xiao Bai

Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose SparseSurf, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Eve3D: Elevating Vision Models for Enhanced 3D Surface Reconstruction via Gaussian Splatting

Jiawei Zhang
Youmin Zhang
Fabio Tosi
Meiying Gu
Jiahe Li
Xiaohan Yu
Jin Zheng
Xiao Bai

We present Eve3D, a novel framework for dense 3D reconstruction based on 3D Gaussian Splatting (3DGS). While most existing methods rely on imperfect priors derived from pre-trained vision models, Eve3D fully leverages these priors by jointly optimizing both them and the 3DGS backbone. This joint optimization creates a mutually reinforcing cycle: the priors enhance the quality of 3DGS, which in turn refines the priors, further improving the reconstruction. Additionally, Eve3D introduces a novel optimization step based on bundle adjustment, overcoming the limitations of the highly local supervision in standard 3DGS pipelines. Eve3D achieves state-of-the-art results in surface reconstruction and novel view synthesis on the Tanks & Temples, DTU, and Mip-NeRF360 datasets. while retaining fast convergence, highlighting an unprecedented trade-off between accuracy and speed.

PDF Details

NeurIPS Conference 2025 Conference Paper

GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction

Jiahe Li
Jiawei Zhang
Youmin Zhang
Xiao Bai
Jin Zheng
Xiaohan Yu
Lin Gu

Reconstructing accurate surfaces with radiance fields has achieved remarkable progress in recent years. However, prevailing approaches, primarily based on Gaussian Splatting, are increasingly constrained by representational bottlenecks. In this paper, we introduce GeoSVR, an explicit voxel-based framework that explores and extends the under-investigated potential of sparse voxels for achieving accurate, detailed, and complete surface reconstruction. As strengths, sparse voxels support preserving the coverage completeness and geometric clarity, while corresponding challenges also arise from absent scene constraints and locality in surface refinement. To ensure correct scene convergence, we first propose a Voxel-Uncertainty Depth Constraint that maximizes the effect of monocular depth cues while presenting a voxel-oriented uncertainty to avoid quality degradation, enabling effective and robust scene constraints yet preserving highly accurate geometries. Subsequently, Sparse Voxel Surface Regularization is designed to enhance geometric consistency for tiny voxels and facilitate the voxel-based formation of sharp and accurate surfaces. Extensive experiments demonstrate our superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency. Code is available at https: //github. com/Fictionarry/GeoSVR.

PDF Details

IJCAI Conference 2025 Conference Paper

Revisiting Continual Ultra-fine-grained Visual Recognition with Pre-trained Models

Pengcheng Zhang
Xiaohan Yu
Meiying Gu
Yuchen Wu
Yongsheng Gao
Xiao Bai

Continual ultra-fine-grained visual recognition (C-UFG) aims to continuously learn to categorize the increasing number of cultivates (VC-UFG) and consistently recognize crops across reproductive stages (HC-UFG), which is a fundamental goal of intelligent agriculture. Despite the progress made in general continual learning, C-UFG remains an underexplored issue. This work establishes the first comprehensive C-UFG benchmark using massive soy leaf data. By analyzing recent pre-trained model (PTM) based continual learning methods on the proposed benchmark, we propose two simple yet effective PTM-based methods to boost the performance of VC-UFG and HC-UFG, respectively. On top of those, we integrate the two methods into one unified framework and propose the first unified model, Unic, that is capable of tackling the C-UFG problem where VC-UFG and HC-UFG co-exist in a single continual learning sequence. To understand the effectiveness of the proposed methods, we first evaluate the models on VC-UFG and HC-UFG challenges and then test the proposed Unic on a unified C-UFG challenge. Experimental results demonstrate the proposed methods achieve superior performance for C-UFG. The code is available at https: //github. com/PatrickZad/unicufg.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Visual Perturbation for Text-Based Person Search

Pengcheng Zhang
Xiaohan Yu
Xiao Bai
Jin Zheng

Text-based person search aims at locating a person described by natural language in uncropped scene images. Recent works for TBPS mainly focus on aligning multi-granularity vision and language representations, neglecting a key discrepancy between training and inference where the former learns to unify vision and language features where the visual side covers all clues described by language, yet the latter matches image-text pairs where the images may capture only part of the described clues due to perturbations such as occlusions, background clutters and misaligned boundaries. To alleviate this issue, we present ViPer: a Visual Perturbation network that learns to match language descriptions with perturbed visual clues. On top of a CLIP-driven baseline, we design three visual perturbation modules: (1) Spatial ViPer that varies person proposals and produces visual features with misaligned boundaries, (2) Attentive ViPer that estimates visual attention on the fly and manipulates attentive visual tokens within a proposal to produce global features under visual perturbations, and (3) Fine-grained ViPer that learns to recover masked visual clues from detailed language descriptions to encourage matching language features with perturbed visual features at the fine granularity. This overall framework thus simulates real-world scenarios at the training stage to minimize the discrepancy and improve the generalization ability of the model. Experimental results demonstrate that the proposed method clearly surpasses previous TBPS methods on the PRW-TBPS and CUHK-SYSU-TBPS datasets.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization

Xiaohan Yu
Jun Wang
Yongsheng Gao

Ultra-fine-grained visual classification (ultra-FGVC) targets at classifying sub-grained categories of fine-grained objects. This inevitably requires discriminative representation learning within a limited training set. Exploring intrinsic features from the object itself, e. g. , predicting the rotation of a given image, has demonstrated great progress towards learning discriminative representation. Yet none of these works consider explicit supervision for learning mutual information at instance level. To this end, this paper introduces CLE-ViT, a novel contrastive learning encoded transformer, to address the fundamental problem in ultra-FGVC. The core design is a self-supervised module that performs self-shuffling and masking and then distinguishes these altered images from other images. This drives the model to learn an optimized feature space that has a large inter-class distance while remaining tolerant to intra-class variations. By incorporating this self-supervised module, the network acquires more knowledge from the intrinsic structure of the input data, which improves the generalization ability without requiring extra manual annotations. CLE-ViT demonstrates strong performance on 7 publicly available datasets, demonstrating its effectiveness in the ultra-FGVC task. The code is available at https: //github. com/Markin-Wang/CLEViT.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Patchy Image Structure Classification Using Multi-Orientation Region Transform

Xiaohan Yu
Yang Zhao
Yongsheng Gao
Shengwu Xiong
Xiaohui Yuan

Exterior contour and interior structure are both vital features for classifying objects. However, most of the existing methods consider exterior contour feature and internal structure feature separately, and thus fail to function when classifying patchy image structures that have similar contours and ﬂexible structures. To address above limitations, this paper proposes a novel Multi-Orientation Region Transform (MORT), which can effectively characterize both contour and structure features simultaneously, for patchy image structure classiﬁcation. MORT is performed over multiple orientation regions at multiple scales to effectively integrate patchy features, and thus enables a better description of the shape in a coarse-to-ﬁne manner. Moreover, the proposed MORT can be extended to combine with the deep convolutional neural network techniques, for further enhancement of classiﬁcation accuracy. Very encouraging experimental results on the challenging ultra-ﬁne-grained cultivar recognition task, insect wing recognition task, and large variation butterﬂy recognition task are obtained, which demonstrate the effectiveness and superiority of the proposed MORT over the state-of-theart methods in classifying patchy image structures. Our code and three patchy image structure datasets are available at: https: //github. com/XiaohanYu-GU/MReT2019.

PDF Details

AAAI Conference 2019 Conference Paper

Lattice CNNs for Matching Based Chinese Question Answering

Yuxuan Lai
Yansong Feng
Xiaohan Yu
Zheng Wang
Kun Xu
Dongyan Zhao

Short text matching often faces the challenges that there are great word mismatch and expression diversity between the two texts, which would be further aggravated in languages like Chinese where there is no natural space to segment words explicitly. In this paper, we propose a novel lattice based CNN model (LCNs) to utilize multi-granularity information inherent in the word lattice while maintaining strong ability to deal with the introduced noisy information for matching based question answering in Chinese. We conduct extensive experiments on both document based question answering and knowledge based question answering tasks, and experimental results show that the LCNs models can significantly outperform the state-of-the-art matching models and strong baselines by taking advantages of better ability to distill rich but discriminative information from the word lattice input.

PDF Details