Arrow Research search

Author name cluster

Bowen Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

AAAI Conference 2026 Conference Paper

LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection

  • Feijiang Han
  • Zelong Wang
  • Bowen Wang
  • Xinxin Liu
  • Skyler Cheung
  • Delip Rao
  • Chris Callison-Burch
  • Lyle Ungar

General-purpose Vision-Language Models (VLMs) are increasingly integral to modern AI systems for document understanding, yet their ability to perform fine-grained layout analysis remains severely underdeveloped. Overcoming this limitation requires large-scale, high-fidelity training datasets. However, current annotation methods that rely on parsing rendered PDFs are costly, error-prone, and difficult to scale. We propose a different paradigm: extracting ground-truth layout directly from the LaTeX compilation process rather than the final PDF. We present LaTeX2Layout, a generalizable procedural pipeline that recovers pixel-accurate bounding boxes and reading order from compiler traces. This enables the generation of a 140K-page dataset, including 120K programmatically generated synthetic variants that more than double the layout diversity of real-world data. Using this dataset, we fine-tune an efficient 3B-parameter VLM with an easy-to-hard curriculum that accelerates convergence. Our model achieves Kendall's tau=0.95 for reading order and mAP@50=0.91 for element grounding, delivering nearly 200% relative improvement over strong zero-shot baselines such as GPT-4o and Claude-3.7.

IROS Conference 2025 Conference Paper

GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction

  • Jianheng Liu
  • Yunfei Wan
  • Bowen Wang
  • Chunran Zheng
  • Jiarong Lin
  • Fu Zhang 0002

Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applications. Existing regularization methods, which rely on render-derived constraints, often fail in complex environments. Moreover, effectively integrating sparse LiDAR data with Gaussian splatting remains challenging. We propose a unified LiDAR-visual system that synergizes Gaussian splatting with a neural signed distance field. The accurate LiDAR point clouds enable a trained neural signed distance field to offer a manifold geometry field. This motivates us to offer an SDF-based Gaussian initialization for physically grounded primitive placement and a comprehensive geometric regularization for geometrically consistent rendering and reconstruction. Experiments demonstrate superior reconstruction accuracy and rendering quality across diverse trajectories. To benefit the community, the codes are released at https://github.com/hku-mars/GS-SDF.

ICRA Conference 2025 Conference Paper

Neural Surface Reconstruction and Rendering for LiDAR-Visual Systems

  • Jianheng Liu
  • Chunran Zheng
  • Yunfei Wan
  • Bowen Wang
  • Yixi Cai
  • Fu Zhang 0002

This paper presents a unified surface reconstruction and rendering framework for LiDAR-visual systems, integrating Neural Radiance Fields (NeRF) and Neural Distance Fields (NDF) to recover both appearance and structural information from posed images and point clouds. We address the structural visible gap between NeRF and NDF by utilizing a visible-aware occupancy map to classify space into the free, occupied, visible unknown, and background regions. This classification facilitates the recovery of a complete appearance and structure of the scene. We unify the training of the NDF and NeRF using a spatial-varying scale SDF-to-density transformation for levels of detail for both structure and appearance. The proposed method leverages the learned NDF for structure-aware NeRF training by an adaptive sphere tracing sampling strategy for accurate structure rendering. In return, NeRF further refines structural in recovering missing or fuzzy structures in the NDF. Extensive experiments demonstrate the superior quality and versatility of the proposed method across various scenarios. To benefit the community, the codes will be released at https://github.com/hku-mars/M2Mapping.

NeurIPS Conference 2025 Conference Paper

OpenCUA: Open Foundations for Computer-Use Agents

  • Xinyuan Wang
  • Bowen Wang
  • Dunjie Lu
  • Junlin Yang
  • Tianbao Xie
  • Junli Wang
  • Jiaqi Deng
  • Xiaole Guo

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45. 0% on OSWorld‑Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

AAAI Conference 2025 Conference Paper

Putting People in LLMs’ Shoes: Generating Better Answers via Question Rewriter

  • Junhao Chen
  • Bowen Wang
  • Zhouqiang Jiang
  • Yuta Nakashima

Large Language Models (LLMs) have demonstrated significant capabilities, particularly in the domain of question answering (QA). However, their effectiveness in QA is often undermined by the vagueness of user questions. To address this issue, we introduce single-round instance-level prompt optimization, referred to as question rewriter. By enhancing the intelligibility of human questions for black-box LLMs, our question rewriter improves the quality of generated answers. The rewriter is optimized using direct preference optimization based on feedback collected from automatic criteria for evaluating generated answers; therefore, its training does not require costly human annotations. The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. This paper provides a practical framework for training question rewriters and sets a precedent for future explorations in prompt optimization within LFQA tasks.

IROS Conference 2025 Conference Paper

U-Snake: A Small-Sized Smart Underwater Snake Robot

  • Bowen Wang
  • Haobo Zuo
  • Changhong Fu 0001

With the rapid development of AI chips, underwater snake robots hold significant promise for navigating complex underwater environments, offering unique advantages in exploration, monitoring, and inspection tasks due to their flexible body and high mobility. However, existing underwater snake robots predominantly employ bulky mechanical configurations with expensive manufacturing costs, resulting in excessive power consumption and limited operational endurance with standard batteries, which impede their widespread adoption and limit their operational flexibility. Moreover, most path following methods used in underwater snake robots inadequately account for the dynamic changes in path curvature, leading to serious tracking error in scenarios involving sharp turns or complex environments, which does not address the demands of more intricate trajectories. To address above issues, this work introduces the U-Snake, a small-sized smart underwater snake robot with a simple lightweight structure and a highly maneuverable controller, adapted to various complicated path following tasks. In particular, each joint of U-Snake is designed to be small and lightweight to achieve higher spatial utilization, which is covered by a convenient and efficient 3D printed waterproof casing to achieve robust water resistance. In addition, a path following method based on curvature of the path is designed to achieve high performance in various complicated trajectories. Furthermore, an integrated controller combines the method with the kinematics and dynamics models of U-Snake, enabling precise following of straight and curved paths. The experimental results demonstrate that the proposed control structure effectively guides U-Snake to follow the desired path.

NeurIPS Conference 2024 Conference Paper

DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models

  • Bowen Wang
  • Jiuyang Chang
  • Yiming Qian
  • Guoxin Chen
  • Junhao Chen
  • Zhouqiang Jiang
  • Jiahao Zhang
  • Yuta Nakashima

Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 511 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios.

AAAI Conference 2024 Conference Paper

DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems

  • Bowen Wang
  • Chen Liang
  • Jiaze Wang
  • Jiezhong Qiu
  • Furui Liu
  • Shaogang Hao
  • Dong Li
  • Guangyong Chen

Attaining the equilibrium geometry of a catalyst-adsorbate system is key to fundamentally assessing its effective properties, such as adsorption energy. While machine learning methods with advanced representation or supervision strategies have been applied to boost and guide the relaxation processes of catalysis systems, existing methods that produce linearly aggregated geometry predictions are susceptible to edge representations ambiguity, and are therefore vulnerable to graph variations. In this paper, we present a novel graph neural network (GNN) supervision and prediction strategy DR-Label. Our approach mitigates the multiplicity of solutions in edge representation and encourages model predictions that are independent of graph structural variations. DR-Label first Deconstructs finer-grained equilibrium state information to the model by projecting the node-level supervision signal to each edge. Reversely, the model Reconstructs a more robust equilibrium state prediction by converting edge-level predictions to node-level via a sphere-fitting algorithm. When applied to three fundamentally different models, DR-Label consistently enhanced performance. Leveraging the graph structure invariance of the DR-Label strategy, we further propose DRFormer, which applied explicit intermediate positional update and achieves a new state-of-the-art performance on the Open Catalyst 2020 (OC20) dataset and the Cu-based single-atom alloys CO adsorption (SAA) dataset. We expect our work to highlight vital principles for advancing geometric GNN models for catalysis systems and beyond. Our code is available at https://github.com/bowenwang77/DR-Label

IJCAI Conference 2023 Conference Paper

A Refined Upper Bound and Inprocessing for the Maximum K-plex Problem

  • Hua Jiang
  • Fusheng Xu
  • Zhifei Zheng
  • Bowen Wang
  • Wei Zhou

A k-plex of a graph G is an induced subgraph in which every vertex has at most k-1 nonadjacent vertices. The Maximum k-plex Problem (MKP) consists in finding a k-plex of the largest size, which is NP-hard and finds many applications. Existing exact algorithms mainly implement a branch-and-bound approach and improve performance by integrating effective upper bounds and graph reduction rules. In this paper, we propose a refined upper bound, which can derive a tighter upper bound than existing methods, and an inprocessing strategy, which performs graph reduction incrementally. We implement a new BnB algorithm for MKP that employs the two components to reduce the search space. Extensive experiments show that both the refined upper bound and the inprocessing strategy are very efficient in the reduction of search space. The new algorithm outperforms the state-of-the-art algorithms on the tested benchmarks significantly.

NeurIPS Conference 2023 Conference Paper

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

  • Tong Xiang
  • Liangzhi Li
  • Wangyue Li
  • Mingbai Bai
  • Lu Wei
  • Bowen Wang
  • Noa Garcia

The recent advances in natural language processing (NLP), have led to a new trend of applying large language models (LLMs) to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form (LF) generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building LF generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1, 612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer off-the-shelf judgment models for automatically assessing the LF output of LLMs given benchmark questions. Moreover, we compare potential solutions for LF generation evaluation and provide insights for building better automated metrics.