Author name cluster

Xinyu Xiao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

AAAI Conference 2025 Conference Paper

Privacy-Preserving V2X Collaborative Perception Integrating Unknown Collaborators

Bin Lu
Xinyu Xiao
Changzhou Zhang
Yang Zhou
Zhiyu Xiang
Hangguan Shan
Eryun Liu

Vehicle-to-everything (V2X) collaborative perception has recently gained increasing attention in autonomous driving due to its ability to enhance scene understanding by integrating information from other collaborators, e.g. vehicles or infrastructure. Existing algorithms usually share deep features to achieve a trade-off between accuracy and bandwidth. However, most of these methods require joint training of all agents, which results in privacy leakage and is impractical and unacceptable in the real world. Sharing prediction results seems to be a direct solution, but its performance is suboptimal and sensitive to localization noise and communication delay. In this paper, we propose a privacy-preserving collaborative perception framework, where each agent is separately trained with its own dataset and the ego vehicle needs to integrate with completely unknown collaborators. Specifically, we propose MSD, a multi-scale feature fusion method combined with deformable attention, to better fuse features of different agents. We also propose a plug-in domain adapter to align the features from unknown collaborators to ego-domain. Extensive experiments on the challenging DAIR-V2X and V2V4Real demonstrate that: 1) MSD achieves remarkable performance, outperforming others by at least 2.8% and 6.7% in AP0.7 on DAIR-V2X and V2V4Real, respectively; 2) After domain adaptation, it significantly outperforms the No Fusion, Late Fusion scenarios and can approach or even surpass the performance of joint training. We truly achieves privacy-preserving collaboration, providing a new paradigm for the study of collaborative perception, which is crucial for practical applications.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

RayFusion: Ray Fusion Enhanced Collaborative Visual Perception

Shaohong Wang
Lu Bin
Xinyu Xiao
Hanzhi Zhong
Bowen Pang
Tong Wang
Zhiyu Xiang
Hangguan Shan

Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e. g. , 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. Our code will be made publicly available.

PDF Details

AAAI Conference 2025 Short Paper

Towards Building Human-like Smart Agents in Modern 3D Video Games (Student Abstract)

Zhihang Sun
Shuhan Qi
Xinhao Huang
Xinyu Xiao
Jiajia Zhang
Xuan Wang
Peixi Peng

In recent years, reinforcement learning has been widely applied in the field of games. However, most studies focus on assisting agents to achieve victory, with less attention paid to whether the agents exhibit human-like characteristics. In order to build human-like agents with high performance, we propose a method for learning the strategies of human players in modern three-dimensional video games. Our method utilizes a hierarchical framework, learning basic behaviors and intentions of human players at the lower level through imitation learning, and generalized policies at the high level through reinforcement learning. Compared with other existing methods, our method demonstrates significant advantages in learning human-like strategies in complex environments.

PDF Details DOI

NeurIPS Conference 2019 Conference Paper

DetNAS: Backbone Search for Object Detection

Yukang Chen
Tong Yang
Xiangyu Zhang
Gaofeng Meng
Xinyu Xiao
Jian Sun

Object detectors are usually equipped with backbone networks designed for image classification. It might be sub-optimal because of the gap between the tasks of image classification and object detection. In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection. It is non-trivial because detection training typically needs ImageNetpre-training while NAS systems require accuracies on the target detection task as supervisory signals. Based on the technique of one-shot supernet, which contains all possible networks in the search space, we propose a framework for backbone search on object detection. We train the supernet under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. This framework makes NAS on backbones very efficient. In experiments, we show the effectiveness of DetNAS on various detectors, for instance, one-stage RetinaNetand the two-stage FPN. We empirically find that networks searched on object detection shows consistent superiority compared to those searched on ImageNet classification. The resulting architecture achieves superior performance than hand-crafted networks on COCO with much less FLOPs complexity.

PDF Details

AAAI Conference 2019 Conference Paper

What and Where the Themes Dominate in Image

Xinyu Xiao
Lingfeng Wang
Shiming Xiang
Chunhong Pan

The image captioning is to describe an image with natural language as human, which has benefited from the advances in deep neural network and achieved substantial progress in performance. However, the perspective of human description to scene has not been fully considered in this task recently. Actually, the human description to scene is tightly related to the endogenous knowledge and the exogenous salient objects simultaneously, which implies that the content in the description is confined to the known salient objects. Inspired by this observation, this paper proposes a novel framework, which explicitly applies the known salient objects in image captioning. Under this framework, the known salient objects are served as the themes to guide the description generation. According to the property of the known salient object, a theme is composed of two components: its endogenous concept (what) and the exogenous spatial attention feature (where). Specifically, the prediction of each word is dominated by the concept and spatial attention feature of the corresponding theme in the process of caption prediction. Moreover, we introduce a novel learning method of Distinctive Learning (DL) to get more specificity of generated captions like human descriptions. It formulates two constraints in the theme learning process to encourage distinctiveness between different images. Particularly, reinforcement learning is introduced into the framework to address the exposure bias problem between the training and the testing modes. Extensive experiments on the COCO and Flickr30K datasets achieve superior results when compared with the state-of-the-art methods.

PDF Details