Arrow Research search

Author name cluster

Xu Yan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

NeurIPS Conference 2025 Conference Paper

SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving

  • Haiming Zhang
  • Yiyao Zhu
  • Wending Zhou
  • Xu Yan
  • Yingjie Cai
  • Bingbing Liu
  • Shuguang Cui
  • Zhen Li

Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i. e. , +1. 3 mIoU on occupancy prediction and +1. 0 NDS on 3D detection).

AAAI Conference 2024 Conference Paper

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

  • Haiming Zhang
  • Xu Yan
  • Dongfeng Bai
  • Jiantao Gao
  • Pan Wang
  • Bingbing Liu
  • Shuguang Cui
  • Zhen Li

3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eye-view (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark.

ECAI Conference 2024 Conference Paper

TabMedBERT: A Tabular Knowledge Enhanced Biomedical Pretrained Language Model

  • Xu Yan
  • Lei Geng
  • Ziqiang Cao
  • Juntao Li 0005
  • Wenjie Li 0002
  • Sujian Li
  • Xinjie Zhou
  • Yang Yang 0074

Most existing biomedical language models are trained on plain text with general learning goals such as random word infilling, failing to capture the knowledge in the biomedical corpus sufficiently. Since biomedical articles usually contain many tables summarising the main entities and their relations, in the paper, we propose a Tabular knowledge enhanced bioMedical pretrained language model, called TabMedBERT. Specifically, we align entities between table cells, and article text spans with pre-defined rules. Then we add two table-related self-supervised tasks to integrate tabular knowledge into the language model: Entity Infilling (EI) and Table Cloze Test (TCT). While EI masks tokens within aligned entities in the article, TCT converts aligned entities in the table layout into a cloze text by erasing one entity and prompts the model to extract the appropriate span to fill in the blank. Experimental results demonstrate that TabMedBERT surpasses all competing language models without adding additional parameters, establishing a new state-of-the-art performance of 85. 59% (+1. 29%) on the BLURB biomedical NLP benchmark and 7 additional information extraction datasets. Moreover, the model architecture for TCT provides a straightforward solution to revise information extraction with paired entities.

AAAI Conference 2024 Conference Paper

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

  • Linglin Jing
  • Ying Xue
  • Xu Yan
  • Chaoda Zheng
  • Dong Wang
  • Ruimao Zhang
  • Zhigang Wang
  • Hui Fang

The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D.

AAAI Conference 2023 Conference Paper

Geometry-Aware Network for Domain Adaptive Semantic Segmentation

  • Yinghong Liao
  • Wending Zhou
  • Xu Yan
  • Zhen Li
  • Yizhou Yu
  • Shuguang Cui

Measuring and alleviating the discrepancies between the synthetic (source) and real scene (target) data is the core issue for domain adaptive semantic segmentation. Though recent works have introduced depth information in the source domain to reinforce the geometric and semantic knowledge transfer, they cannot extract the intrinsic 3D information of objects, including positions and shapes, merely based on 2D estimated depth. In this work, we propose a novel Geometry-Aware Network for Domain Adaptation (GANDA), leveraging more compact 3D geometric point cloud representations to shrink the domain gaps. In particular, we first utilize the auxiliary depth supervision from the source domain to obtain the depth prediction in the target domain to accomplish structure-texture disentanglement. Beyond depth estimation, we explicitly exploit 3D topology on the point clouds generated from RGB-D images for further coordinate-color disentanglement and pseudo-labels refinement in the target domain. Moreover, to improve the 2D classifier in the target domain, we perform domain-invariant geometric adaptation from source to target and unify the 2D semantic and 3D geometric segmentation results in two domains. Note that our GANDA is plug-and-play in any existing UDA framework. Qualitative and quantitative results demonstrate that our model outperforms state-of-the-arts on GTA5->Cityscapes and SYNTHIA->Cityscapes.

NeurIPS Conference 2022 Conference Paper

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis

  • Xu Yan
  • Heshen Zhan
  • Chaoda Zheng
  • Jiantao Gao
  • Ruimao Zhang
  • Shuguang Cui
  • Zhen Li

Although recent point cloud analysis achieves impressive progress, the paradigm of representation learning from single modality gradually meets its bottleneck. In this work, we take a step towards more discriminative 3D point cloud representation using 2D images, which inherently contain richer appearance information, e. g. , texture, color, and shade. Specifically, this paper introduces a simple but effective point cloud cross-modality training (PointCMT) strategy, which utilizes view-images, i. e. , rendered or projected 2D images of the 3D object, to boost point cloud classification. In practice, to effectively acquire auxiliary knowledge from view-images, we develop a teacher-student framework and formulate the cross-modal learning as a knowledge distillation problem. Through novel feature and classifier enhancement criteria, PointCMT eliminates the distribution discrepancy between different modalities and avoid potential negative transfer effectively. Note that PointCMT efficiently improves the point-only representation without any architecture modification. Sufficient experiments verify significant gains on various datasets based on several backbones, i. e. , equipped with PointCMT, PointNet++ and PointMLP achieve state-of-the-art performance on two benchmarks, i. e. , 94. 4% and 86. 7% accuracy on ModelNet40 and ScanObjectNN, respectively.

IJCAI Conference 2022 Conference Paper

Multi-Graph Fusion Networks for Urban Region Embedding

  • Shangbin Wu
  • Xu Yan
  • Xiaoliang Fan
  • Shirui Pan
  • Shichao Zhu
  • Chuanpan Zheng
  • Ming Cheng
  • Cheng Wang

Learning the embeddings for urban regions from human mobility data can reveal the functionality of regions, and then enables the correlated but distinct tasks such as crime prediction. Human mobility data contains rich but abundant information, which yields to the comprehensive region embeddings for cross domain tasks. In this paper, we propose multi-graph fusion networks (MGFN) to enable the cross domain prediction tasks. First, we integrate the graphs with spatio-temporal similarity as mobility patterns through a mobility graph fusion module. Then, in the mobility pattern joint learning module, we design the multi-level cross-attention mechanism to learn the comprehensive embeddings from multiple mobility patterns based on intra-pattern and inter-pattern messages. Finally, we conduct extensive experiments on real-world urban datasets. Experimental results demonstrate that the proposed MGFN outperforms the state-of-the-art methods by up to 12. 35% improvement. https: //github. com/wushangbin/MGFN

AAAI Conference 2021 Conference Paper

Plug-and-Play Domain Adaptation for Cross-Subject EEG-based Emotion Recognition

  • Li-Ming Zhao
  • Xu Yan
  • Bao-Liang Lu

Human emotion decoding in affective brain-computer interfaces suffers a major setback due to the inter-subject variability of electroencephalography (EEG) signals. Existing approaches usually require amassing extensive EEG data of each new subject, which is prohibitively time-consuming along with poor user experience. To tackle this issue, we divide EEG representations into private components specific to each subject and shared emotional components that are universal to all subjects. According to this representation partition, we propose a plug-and-play domain adaptation method for dealing with the inter-subject variability. In the training phase, subject-invariant emotional representations and private components of source subjects are separately captured by a shared encoder and private encoders. Furthermore, we build one emotion classifier on the shared partition and subjects’ individual classifiers on the combination of these two partitions. In the calibration phase, the model only requires few unlabeled EEG data from incoming target subjects to model their private components. Therefore, besides the shared emotion classifier, we have another pipeline to use the knowledge of source subjects through the similarity of private components. In the test phase, we integrate predictions of the shared emotion classifier with those of individual classifiers ensemble after modulation by similarity weights. Experimental results on the SEED dataset show that our model greatly shortens the calibration time within a minute while maintaining the recognition accuracy, all of which make emotion decoding more generalizable and practicable.

IJCAI Conference 2021 Conference Paper

PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

  • Weibing Zhao
  • Xu Yan
  • Jiantao Gao
  • Ruimao Zhang
  • Jiayan Zhang
  • Zhen Li
  • Song Wu
  • Shuguang Cui

Point Cloud Sampling and Recovery (PCSR) is critical for massive real-time point cloud collection and processing since raw data usually requires large storage and computation. This paper addresses a fundamental problem in PCSR: How to downsample the dense point cloud with arbitrary scales while preserving the local topology of discarded points in a case-agnostic manner (i. e. , without additional storage for point relationships)? We propose a novel Locally Invertible Embedding (PointLIE) framework to unify the point cloud sampling and upsampling into one single framework through bi-directional learning. Specifically, PointLIE decouples the local geometric relationships between discarded points from the sampled points by progressively encoding the neighboring offsets to a latent variable. Once the latent variable is forced to obey a pre-defined distribution in the forward sampling path, the recovery can be achieved effectively through inverse operations. Taking the recover-pleasing sampled points and a latent embedding randomly drawn from the specified distribution as inputs, PointLIE can theoretically guarantee the fidelity of reconstruction and outperform state-of-the-arts quantitatively and qualitatively.

AAAI Conference 2021 Conference Paper

Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

  • Xu Yan
  • Jiantao Gao
  • Jie Li
  • Ruimao Zhang
  • Zhen Li
  • Rui Huang
  • Shuguang Cui

LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep Li- DAR point cloud, the accurate semantic segmentation is nontrivial to achieve. In this paper, we propose a novel sparse Li- DAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-toend training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i. e. , promoting the interaction of incomplete local geometry of point cloud and complete voxelwise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C- Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i. e. , 4% and 3% improvement correspondingly.