Arrow Research search

Author name cluster

Xiaofeng Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

ICLR Conference 2025 Conference Paper

Do Large Language Models Truly Understand Geometric Structures?

  • Xiaofeng Wang
  • Yiming Wang 0011
  • Wenhong Zhu
  • Rui Wang 0015

Geometric ability is a significant challenge for large language models (LLMs) due to the need for advanced spatial comprehension and abstract thinking. Existing datasets primarily evaluate LLMs on their final answers, but they cannot truly measure their true understanding of geometric structures, as LLMs can arrive at correct answers by coincidence. To fill this gap, we introduce the GeomRel dataset, designed to evaluate LLMs’ understanding of geometric structures by isolating the core step of geometric relationship identification in problem-solving. Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs’ ability to identify geometric relationships, resulting in significant performance improvements.

AAAI Conference 2025 Conference Paper

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

  • Guosheng Zhao
  • Xiaofeng Wang
  • Zheng Zhu
  • Xinze Chen
  • Guan Huang
  • Xiaoyi Bao
  • Xingang Wang

World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which incorporates a Large Language Model (LLM) to facilitate the creation of user-defined driving videos. Specifically, a trajectory generation function library is developed to produce trajectories that conform to user descriptions. Subsequently, an HDMap generator is designed to learn the mapping from trajectories to road structures. Ultimately, we propose the Unified Multi-View Model (UniMVM) to enhance temporal and spatial coherence in the generated multi-view driving videos. To the best of our knowledge, DriveDreamer-2 is the first world model to generate customized driving videos, and it can generate uncommon driving videos (e.g., vehicles abruptly cut in) in a user-friendly manner. Besides, experimental results demonstrate that the generated videos enhance the training of driving perception methods (e.g., 3D detection and tracking). Furthermore, video generation quality of DriveDreamer-2 surpasses other state-of-the-art methods, showcasing FID and FVD scores of 11.2 and 55.7, representing relative improvements of ~30% and ~50%.

NeurIPS Conference 2025 Conference Paper

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

  • Xiaofeng Wang
  • Kang Zhao
  • Feng Liu
  • Jiayu Wang
  • Guosheng Zhao
  • Xiaoyi Bao
  • Zheng Zhu
  • Yingya Zhang

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of first-person viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses over 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleansing pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

JBHI Journal 2025 Journal Article

Lightweight 2D Medical Image Segmentation via a Decoder Using Linear Deformable Convolution and Multi-scale Self-attention

  • Le Zou
  • Xiangxu Bu
  • Fengling Jiang
  • Zhize Wu
  • Lingma Sun
  • Kia Dashtipour
  • Mandar Gogate
  • Amir Hussain

Computational resources, which presents a significant challenge in resourceconstrained environments, particularly in developing countries. Consequently, the development of decoding mechanisms that are both computationally efficient and lightweight is imperative. However, the performance of medical image segmentation is frequently limited by the simplicity of decoder designs. Balancing the optimization of decoder architectures with the reduction of computational demands while maintaining high model accuracy remains a formidable challenge. In this context, we introduce a novel decoder that integrates line deformable convolution and multi-scale self-attention (LDMSD). The multi-scale self-attention enhancement module within LDMSD leverages two distinct multi-scale self-attention mechanisms, thereby substantially improving the representational capacity of the feature maps. Furthermore, the decoder incorporates a linear deformable convolution attention-guided mechanism to augment the feature maps derived from skip connections. This mechanism effectively mitigates the inherent limitations of conventional convolution and enhances the model's ability to capture complex semantic relationships within the feature maps. Through this collaborative mechanism, LDMSD is able to capture target information from both global and multiscale perspectives, accurately locate the target's boundaries and structures, while maintaining its lightweight nature. Experimental results demonstrate that LDMSD outperforms the state-of-the-art decoders in terms of performance metrics, achieving a reduction in Floating Point Operations (FLOPs) by 77. 36% and in parameter count by 81. 66% when compared to the Cascaded Attention Decoder (CASCADE). To substantiate the efficacy of the proposed method, extensive experiments are conducted on six publicly available datasets. The results validate that the proposed method surpasses existing approaches in medical image segmentation tasks, both in terms of accuracy and computational efficiency.

ICLR Conference 2025 Conference Paper

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

  • Wenhong Zhu
  • Zhiwei He 0002
  • Xiaofeng Wang
  • Pengfei Liu
  • Rui Wang 0015

Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible. The code is available at https://github.com/zwhong714/weak-to-strong-preference-optimization.

IJCAI Conference 2024 Conference Paper

Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

  • Bohan Li
  • Yasheng Sun
  • Zhujin Liang
  • Dalong Du
  • Zhuanghui Zhang
  • Xiaofeng Wang
  • Yunnan Wang
  • Xin Jin

3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird’s-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https: //github. com/Arlo0o/StereoScene.

AAAI Conference 2023 Conference Paper

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

  • Xiaofeng Wang
  • Zheng Zhu
  • Guan Huang
  • Xu Chi
  • Yun Ye
  • Ziwei Chen
  • Xingang Wang

Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20% and 19.8% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2%. The code is available at https://github.com/JeffWang987/MOVEDepth.

ICLR Conference 2023 Conference Paper

LiftedCL: Lifting Contrastive Learning for Human-Centric Perception

  • Ziwei Chen
  • Qiang Li 0024
  • Xiaofeng Wang
  • Wankou Yang

Human-centric perception targets for understanding human body pose, shape and segmentation. Pre-training the model on large-scale datasets and fine-tuning it on specific tasks has become a well-established paradigm in human-centric perception. Recently, self-supervised learning methods have re-investigated contrastive learning to achieve superior performance on various downstream tasks. When handling human-centric perception, there still remains untapped potential since 3D human structure information is neglected during the task-agnostic pre-training. In this paper, we propose the Lifting Contrastive Learning (LiftedCL) to obtain 3D-aware human-centric representations which absorb 3D human structure information. In particular, to induce the learning process, a set of 3D skeletons is randomly sampled by resorting to 3D human kinematic prior. With this set of generic 3D samples, 3D human structure information can be learned into 3D-aware representations through adversarial learning. Empirical results demonstrate that LiftedCL outperforms state-of-the-art self-supervised methods on four human-centric downstream tasks, including 2D and 3D human pose estimation (0.4% mAP and 1.8 mm MPJPE improvement on COCO 2D pose estimation and Human3.6M 3D pose estimation), human shape recovery and human parsing.

IROS Conference 2021 Conference Paper

CLMM-Net: Robust Cascaded LiDAR Map Matching based on Multi-Level Intensity Map

  • Kai Chen 0028
  • Lei He
  • Xiaofeng Wang
  • Yuqian Liu
  • Ming Zhao

LiDAR map matching(LMM) is a critical localization technique in autonomous driving while existing methods have problems in terms of both accuracy and robustness when driving in the scenes with poor structure information (e. g. highways). This paper put forward a multi-level intensity map based cascaded network for LiDAR map matching in autonomous driving. The network uses an effective multi-level intensity map representation to compactly encode the appearance and structure information of point clouds, which effectively reduce the position ambiguity in structure-less scenarios. Besides, this method leverages the multi-scale nature of deep neural networks and matches the online LiDAR observation with the offline map in a coarse-to-fine manner so as to balance the time-consuming and precision. Extensive experiments on diverse autonomous driving environments demonstrate the superiority of our proposed method over other existing state-of-the-art methods.

NeurIPS Conference 2003 Conference Paper

Learning Near-Pareto-Optimal Conventions in Polynomial Time

  • Xiaofeng Wang
  • Tuomas Sandholm

We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different pref- erences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal al- gorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions).

NeurIPS Conference 2002 Conference Paper

Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

  • Xiaofeng Wang
  • Tuomas Sandholm

Multiagent learning is a key problem in AI. In the presence of multi- ple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exac- cerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two inter- related problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.