Arrow Research search

Author name cluster

Xiaofeng Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

EAAI Journal 2025 Journal Article

A search method for fractured-vuggy reservoir inter-well connectivity path based on multi-modal multi-agent

  • Wenbin Jiang
  • Dongmei Zhang
  • Hong Cao
  • Xiaofeng Wang

The complex geological structure of carbonate reservoirs and the intricate fracture-vuggy configurations obscure inter-well connectivity, making its evaluation challenging. Conventional studies primarily rely on seismic static data to delineate fracture-vuggy reservoirs, but the limited recognition accuracy hampers the precise characterization of inter-well connectivity and the spatial configuration of fractures and vugs. To address this, this study constructs a 3D (Three-Dimensional) search environment and use multi-modal static and dynamic data and proposes a multi-agent connected channel search model based on deep reinforcement learning. The model treats multiphase fluid as an agent and incorporates Swin Transformer (Shift Window Transformer) to extract large-scale fracture features from seismic data, providing global prior information for path search. A Graph Attention Network is established based on dynamic response relationships to extract spatial geological features, while a multi-head self-attention mechanism captures real-time fluid interactions in various directions. The model fuses multi-modal features, including seismic attributes and production data, to generate decisions and automatically search for inter-well connectivity channels. Experiments were conducted using the WE1 and WE5 well groups from the fault-controlled karst reservoirs in the Tahe oilfield, with results compared against tracer tests. The findings demonstrate that the proposed model's automatic search paths closely align with seismic data and tracer test results, effectively capturing the spatial distribution of fractures and vugs across different scales. This validates the model's effectiveness in evaluating inter-well connectivity in complex carbonate reservoirs.

EAAI Journal 2025 Journal Article

Cyclic translations between pathomics and genomics improve automatic cancer diagnosis from whole slide images

  • Xinyu Hao
  • Hongming Xu
  • Xiaofeng Wang
  • Tong Wang
  • Timo Hamalainen
  • Fengyu Cong

Incorporating genomic characterization into histopathological image modeling brings substantial value to enhancing diagnostic accuracy and supporting the development of targeted and effective treatment strategies. However, prevailing multi-modal integration methods often assume the availability of both pathomics and genomics data in both training and testing phases, overlooking the challenge of data absence due to prohibitive costs. In this paper, we propose a multi-modal cyclic feature generation network (MCFGN) that facilitates cyclic translations between pathomics and genomics to acquire a unified representation of multi-modal data. This approach enables the use of pathological images alone as input to generate joint representations during the testing phase. First, we utilize a general-purpose, self-supervised vision encoder to embed histological image patches as distinctive visual tokens. Next, we hierarchically aggregate patch-level tokens to region-level and slide-level, generating improved whole slide image (WSI) representations. We build self-supervised Masked Autoencoders (MAE) to initialize the hierarchical aggregator. Finally, to incorporate genomic characterization into the learning process, we develop a novel cross-modal cyclic feature generation module to create an intermediate joint representation of pathological and genetic features for patient diagnosis. Evaluations have been conducted on two public datasets from The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) and Non-Small Cell Lung Cancer (TCGA-NSCLC) for various diagnostic tasks, including cancer subtyping and biomarker status prediction. Experiments indicate that our MCFGN model improves predictive performance in cancer diagnosis using histological slides, yielding an 8. 7% improvement in area under the curve (AUC) for the cancer subtyping task and a 14. 1% gain for biomarker prediction.

ICLR Conference 2025 Conference Paper

Do Large Language Models Truly Understand Geometric Structures?

  • Xiaofeng Wang
  • Yiming Wang 0011
  • Wenhong Zhu
  • Rui Wang 0015

Geometric ability is a significant challenge for large language models (LLMs) due to the need for advanced spatial comprehension and abstract thinking. Existing datasets primarily evaluate LLMs on their final answers, but they cannot truly measure their true understanding of geometric structures, as LLMs can arrive at correct answers by coincidence. To fill this gap, we introduce the GeomRel dataset, designed to evaluate LLMs’ understanding of geometric structures by isolating the core step of geometric relationship identification in problem-solving. Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs’ ability to identify geometric relationships, resulting in significant performance improvements.

AAAI Conference 2025 Conference Paper

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

  • Guosheng Zhao
  • Xiaofeng Wang
  • Zheng Zhu
  • Xinze Chen
  • Guan Huang
  • Xiaoyi Bao
  • Xingang Wang

World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which incorporates a Large Language Model (LLM) to facilitate the creation of user-defined driving videos. Specifically, a trajectory generation function library is developed to produce trajectories that conform to user descriptions. Subsequently, an HDMap generator is designed to learn the mapping from trajectories to road structures. Ultimately, we propose the Unified Multi-View Model (UniMVM) to enhance temporal and spatial coherence in the generated multi-view driving videos. To the best of our knowledge, DriveDreamer-2 is the first world model to generate customized driving videos, and it can generate uncommon driving videos (e.g., vehicles abruptly cut in) in a user-friendly manner. Besides, experimental results demonstrate that the generated videos enhance the training of driving perception methods (e.g., 3D detection and tracking). Furthermore, video generation quality of DriveDreamer-2 surpasses other state-of-the-art methods, showcasing FID and FVD scores of 11.2 and 55.7, representing relative improvements of ~30% and ~50%.

NeurIPS Conference 2025 Conference Paper

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

  • Xiaofeng Wang
  • Kang Zhao
  • Feng Liu
  • Jiayu Wang
  • Guosheng Zhao
  • Xiaoyi Bao
  • Zheng Zhu
  • Yingya Zhang

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of first-person viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses over 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleansing pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

JBHI Journal 2025 Journal Article

Lightweight 2D Medical Image Segmentation via a Decoder Using Linear Deformable Convolution and Multi-scale Self-attention

  • Le Zou
  • Xiangxu Bu
  • Fengling Jiang
  • Zhize Wu
  • Lingma Sun
  • Kia Dashtipour
  • Mandar Gogate
  • Amir Hussain

Computational resources, which presents a significant challenge in resourceconstrained environments, particularly in developing countries. Consequently, the development of decoding mechanisms that are both computationally efficient and lightweight is imperative. However, the performance of medical image segmentation is frequently limited by the simplicity of decoder designs. Balancing the optimization of decoder architectures with the reduction of computational demands while maintaining high model accuracy remains a formidable challenge. In this context, we introduce a novel decoder that integrates line deformable convolution and multi-scale self-attention (LDMSD). The multi-scale self-attention enhancement module within LDMSD leverages two distinct multi-scale self-attention mechanisms, thereby substantially improving the representational capacity of the feature maps. Furthermore, the decoder incorporates a linear deformable convolution attention-guided mechanism to augment the feature maps derived from skip connections. This mechanism effectively mitigates the inherent limitations of conventional convolution and enhances the model's ability to capture complex semantic relationships within the feature maps. Through this collaborative mechanism, LDMSD is able to capture target information from both global and multiscale perspectives, accurately locate the target's boundaries and structures, while maintaining its lightweight nature. Experimental results demonstrate that LDMSD outperforms the state-of-the-art decoders in terms of performance metrics, achieving a reduction in Floating Point Operations (FLOPs) by 77. 36% and in parameter count by 81. 66% when compared to the Cascaded Attention Decoder (CASCADE). To substantiate the efficacy of the proposed method, extensive experiments are conducted on six publicly available datasets. The results validate that the proposed method surpasses existing approaches in medical image segmentation tasks, both in terms of accuracy and computational efficiency.

EAAI Journal 2025 Journal Article

Redundancy-aware masked graph autoencoder for overlapping community detection in attributed networks

  • Hongkai Xie
  • Xinyi Ying
  • Xiaofeng Wang
  • Yuanyuan Qi
  • Wei Chen
  • Xiaofeng Huang
  • Junzheng Jiang
  • Daying Quan

Overlapping community detection is a critical task in complex network analysis, especially for real-world graphs where nodes participate in multiple communities. Recent advances in graph neural networks have shown great promise in this task by integrating structural and attribute information. However, existing methods still face significant challenges, particularly due to the presence of noisy or redundant nodes, which may can introduce ambiguity in message propagation and lead to suboptimal community assignments. To address these issues, we propose an unsupervised overlapping community detection framework based on a redundancy-aware masked graph autoencoder (RMGAE). Our model introduces a random edge-masking mechanism to simulate the structural uncertainty in real-world networks, encouraging the model to capture more informative patterns. Furthermore, we design a redundancy-aware dual-decoder architecture that independently reconstructs both masked and unmasked subgraphs, enabling the model to suppress redundant signals while preserving essential community structures. A unified reconstruction loss is employed to guide the embedding learning process, effectively balancing fidelity to observed data with generalization to noisy links. Extensive experiments on multiple real-world datasets demonstrate that RMGAE outperforms state-of-the-art baselines in both detection accuracy and computational efficiency.

EAAI Journal 2025 Journal Article

Regional coverage balance and efficient worker recruitment for self-organized mobile crowdsourcing

  • Ruiqing Liu
  • Yonghong Wang
  • Xiaofeng Wang

With the widespread adoption of smart devices, self-organized mobile crowdsourcing has become a popular method for decentralized data collection, where mobile users autonomously participate in task completion by leveraging their mobility and proximity to tasks. A key challenge in this context is achieving regional coverage balance, ensuring that tasks are equitably distributed across geographic areas to prevent both underserved and overserved regions. However, focusing solely on regional coverage without considering user satisfaction can negatively impact the quality of service. On the other hand, optimal worker recruitment in self-organized mobile crowdsourcing can maximize the expected overall quality of service. To address this, we develop a framework that balances regional coverage while enhancing user satisfaction by predicting user trajectories and identifying optimal service providers. Additionally, we tackle the worker recruitment problem by formulating a model that maximizes the expected quality of service. Our approach incorporates two collaborative deep learning networks: we first employ Proximal Policy Optimization (PPO) for matching candidates and training batches during the training phase, and then use Long Short-Term Memory (LSTM) to extract learning patterns of candidates, aiding PPO in making more effective recruitment decisions. We evaluate the performance of the proposed approach through extensive experiments using real-world data and by comparing it with existing strategies from previous research. Simulation results demonstrate that our method significantly improves both coverage balance and service quality in large-scale, decentralized mobile crowdsourcing environments.

ICLR Conference 2025 Conference Paper

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

  • Wenhong Zhu
  • Zhiwei He 0002
  • Xiaofeng Wang
  • Pengfei Liu
  • Rui Wang 0015

Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible. The code is available at https://github.com/zwhong714/weak-to-strong-preference-optimization.

IJCAI Conference 2024 Conference Paper

Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

  • Bohan Li
  • Yasheng Sun
  • Zhujin Liang
  • Dalong Du
  • Zhuanghui Zhang
  • Xiaofeng Wang
  • Yunnan Wang
  • Xin Jin

3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird’s-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https: //github. com/Arlo0o/StereoScene.

EAAI Journal 2024 Journal Article

Deep learning based multi-source heterogeneous information fusion framework for online monitoring of surface quality in milling process

  • Xiaofeng Wang
  • Jihong Yan

The multi-sensor configuration enables a comprehensive description of the machining processes and thus improves the capability of quality prediction model. However, the structural heterogeneity of various sensor data imposes barriers to information fusion as well as model construction. This study developed a novel multi-source heterogeneous information fusion framework based on deep learning for the prediction of milling quality, where thermal imaging is first attempted particularly. Specifically, the preprocessing module extracts multi-domain features from structured time series data, and the convolutional neural network based module is assigned to extract information from unstructured data. After that, the multilayer perceptron technique is employed to realize feature enhancement and fusion of cross-domain characteristics. Experimental validation was performed on a vertical machining center and comprehensive comparison experiments were conducted. The proposed approach achieves the best performance (minimum mean absolute percentage error 0. 33%) and exhibits great robustness (standard deviation 0. 17%). In addition, various time–frequency processing methods and convolutional neural network architectures are exploited for better configuration and prediction performance. The results revealed the great potential of thermal imaging for roughness prediction and the excellent prediction performance of the proposed framework demonstrates its superiority and effectiveness in practice.

AAAI Conference 2023 Conference Paper

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

  • Xiaofeng Wang
  • Zheng Zhu
  • Guan Huang
  • Xu Chi
  • Yun Ye
  • Ziwei Chen
  • Xingang Wang

Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20% and 19.8% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2%. The code is available at https://github.com/JeffWang987/MOVEDepth.

ICLR Conference 2023 Conference Paper

LiftedCL: Lifting Contrastive Learning for Human-Centric Perception

  • Ziwei Chen
  • Qiang Li 0024
  • Xiaofeng Wang
  • Wankou Yang

Human-centric perception targets for understanding human body pose, shape and segmentation. Pre-training the model on large-scale datasets and fine-tuning it on specific tasks has become a well-established paradigm in human-centric perception. Recently, self-supervised learning methods have re-investigated contrastive learning to achieve superior performance on various downstream tasks. When handling human-centric perception, there still remains untapped potential since 3D human structure information is neglected during the task-agnostic pre-training. In this paper, we propose the Lifting Contrastive Learning (LiftedCL) to obtain 3D-aware human-centric representations which absorb 3D human structure information. In particular, to induce the learning process, a set of 3D skeletons is randomly sampled by resorting to 3D human kinematic prior. With this set of generic 3D samples, 3D human structure information can be learned into 3D-aware representations through adversarial learning. Empirical results demonstrate that LiftedCL outperforms state-of-the-art self-supervised methods on four human-centric downstream tasks, including 2D and 3D human pose estimation (0.4% mAP and 1.8 mm MPJPE improvement on COCO 2D pose estimation and Human3.6M 3D pose estimation), human shape recovery and human parsing.

IROS Conference 2021 Conference Paper

CLMM-Net: Robust Cascaded LiDAR Map Matching based on Multi-Level Intensity Map

  • Kai Chen 0028
  • Lei He
  • Xiaofeng Wang
  • Yuqian Liu
  • Ming Zhao

LiDAR map matching(LMM) is a critical localization technique in autonomous driving while existing methods have problems in terms of both accuracy and robustness when driving in the scenes with poor structure information (e. g. highways). This paper put forward a multi-level intensity map based cascaded network for LiDAR map matching in autonomous driving. The network uses an effective multi-level intensity map representation to compactly encode the appearance and structure information of point clouds, which effectively reduce the position ambiguity in structure-less scenarios. Besides, this method leverages the multi-scale nature of deep neural networks and matches the online LiDAR observation with the offline map in a coarse-to-fine manner so as to balance the time-consuming and precision. Extensive experiments on diverse autonomous driving environments demonstrate the superiority of our proposed method over other existing state-of-the-art methods.

NeurIPS Conference 2003 Conference Paper

Learning Near-Pareto-Optimal Conventions in Polynomial Time

  • Xiaofeng Wang
  • Tuomas Sandholm

We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different pref- erences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal al- gorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions).

NeurIPS Conference 2002 Conference Paper

Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

  • Xiaofeng Wang
  • Tuomas Sandholm

Multiagent learning is a key problem in AI. In the presence of multi- ple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exac- cerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two inter- related problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.