Arrow Research search

Author name cluster

Weiming Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2026 Conference Paper

DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts

  • Yujing Lu
  • Ling Zhong
  • Jing Yang
  • Weiming Li
  • Peng Wei
  • Yongheng Wang
  • Manni Duan
  • Qing Zhang

Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA’s generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.

AAAI Conference 2025 Conference Paper

OAMaskFlow: Occlusion-Aware Motion Mask for Scene Flow

  • Xiongfeng Peng
  • Zhihua Liu
  • Weiming Li
  • Yamin Mao
  • Qiang Wang

The scene flow estimation methods make significant progress by estimating pixel-wise 3D motion on implicitly learning a motion embedding using an end-to-end differentiable optimization framework. However, the motion embedding learned implicitly is insufficient for grouping pixels into rigid object in challenging regions, such as occlusion and inconsistent multi-view geometric properties. To address this issue, we propose a novel method for estimating scene flow called OAMaskFlow, which has three novelties. Firstly, we propose the concept of occlusion-aware motion (OAM) mask and generate the ground truth annotation through the photo-metric and geometry consistency. Secondly, we propose to supervise the motion embedding with the OAM mask to learn informative and reliable motion representation of the scene. Finally, a 3D motion propagation module is proposed to propagate high-quality 3D motion from reliable pixels to the challenging occluded regions. Experiments show that our proposed OAMaskFlow has reduced the EPE3D metric by 21.0% on the FlyingThings3D dataset and decreased SF-all metric by 24.3% on the KITTI scene flow benchmark than the baseline method RAFT-3D. Furthermore, we apply our proposed OAM mask in simultaneous localization and mapping (SLAM) to improve a state-of-the-art method DROID-SLAM. In comparison, the ATE metric has decreased by 65.7% and 58.3% on the TartanAir monocular and stereo datasets respectively.

AAAI Conference 2024 Conference Paper

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

  • Xiaoxuan Yu
  • Hao Wang
  • Weiming Li
  • Qiang Wang
  • Soonyong Cho
  • Younghun Sung

Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.

ICRA Conference 2024 Conference Paper

DVI-SLAM: A Dual Visual Inertial SLAM Network

  • Xiongfeng Peng
  • Zhihua Liu
  • Weiming Li
  • Ping Tan
  • SoonYong Cho
  • Qiang Wang

Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45. 3% and 36. 2% respectively.

NeurIPS Conference 2024 Conference Paper

Is Your HD Map Constructor Reliable under Sensor Corruptions?

  • Xiaoshuai Hao
  • Mengchuan Wei
  • Yifan Yang
  • Haimei Zhao
  • Hui Zhang
  • Yi Zhou
  • Qiang Wang
  • Weiming Li

Driving systems often rely on high-definition (HD) maps for precise environmental information, which is crucial for planning and navigation. While current HD map constructors perform well under ideal conditions, their resilience to real-world challenges, \eg, adverse weather and sensor failures, is not well understood, raising safety concerns. This work introduces MapBench, the first comprehensive benchmark designed to evaluate the robustness of HD map construction methods against various sensor corruptions. Our benchmark encompasses a total of 29 types of corruptions that occur from cameras and LiDAR sensors. Extensive evaluations across 31 HD map constructors reveal significant performance degradation of existing methods under adverse weather conditions and sensor failures, underscoring critical safety concerns. We identify effective strategies for enhancing robustness, including innovative approaches that leverage multi-modal fusion, advanced data augmentation, and architectural techniques. These insights provide a pathway for developing more reliable HD map construction methods, which are essential for the advancement of autonomous driving technology. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.

IROS Conference 2022 Conference Paper

Attention-guided RGB-D Fusion Network for Category-level 6D Object Pose Estimation

  • Hao Wang 0144
  • Weiming Li
  • Jiyeon Kim
  • Qiang Wang 0023

This work focuses on estimating 6D poses and sizes of category-level objects from a single RGB-D image. How to exploit the complementary RGB and depth features plays an important role in this task yet remains an open question. Due to the large intra-category texture and shape variations, an object instance in test may have different RGB and depth features from those of the object instances in training, which poses challenges to previous RGB-D fusion methods. To deal with such problem, an Attention-guided RGB-D Fusion Network (ARF-Net) is proposed in this work. Our key design is an ARF module that learns to adaptively fuse RGB and depth features with guidance from both structure-aware attention and relation-aware attention. Specifically, the structure-aware attention captures spatial relationship among object parts and the relation-aware attention captures the RGB-to-depth correlations between the appearance and geometric features. Our ARF -Net directly establishes canonical correspondences with a compact decoder based on the multi-modal features from our ARF module. Extensive experiments show that our method can effectively fuse RGB features to various popular point cloud encoders and provide consistent performance improvement. In particular, without reconstructing instance 3D models, our method with its relatively compact architecture outperforms all state-of-the-art models on CAMERA25 and REAL275 benchmarks by a large margin.

ICML Conference 2021 Conference Paper

Learning Generalized Intersection Over Union for Dense Pixelwise Prediction

  • Jiaqian Yu
  • Jingtao Xu
  • Yiwei Chen
  • Weiming Li
  • Qiang Wang 0023
  • ByungIn Yoo
  • Jae-Joon Han

Intersection over union (IoU) score, also named Jaccard Index, is one of the most fundamental evaluation methods in machine learning. The original IoU computation cannot provide non-zero gradients and thus cannot be directly optimized by nowadays deep learning methods. Several recent works generalized IoU for bounding box regression, but they are not straightforward to adapt for pixelwise prediction. In particular, the original IoU fails to provide effective gradients for the non-overlapping and location-deviation cases, which results in performance plateau. In this paper, we propose PixIoU, a generalized IoU for pixelwise prediction that is sensitive to the distance for non-overlapping cases and the locations in prediction. We provide proofs that PixIoU holds many nice properties as the original IoU. To optimize the PixIoU, we also propose a loss function that is proved to be submodular, hence we can apply the Lovász functions, the efficient surrogates for submodular functions for learning this loss. Experimental results show consistent performance improvements by learning PixIoU over the original IoU for several different pixelwise prediction tasks on Pascal VOC, VOT-2020 and Cityscapes.

AAAI Conference 2020 Conference Paper

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

  • Yueying Kao
  • Weiming Li
  • Qiang Wang
  • Zhouchen Lin
  • Wooshik Kim
  • Sunghoon Hong

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.

IJCAI Conference 2018 Conference Paper

An Appearance-and-Structure Fusion Network for Object Viewpoint Estimation

  • Yueying Kao
  • Weiming Li
  • Zairan Wang
  • Dongqing Zou
  • Ran He
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Automatic object viewpoint estimation from a single image is an important but challenging problem in machine intelligence community. Although impressive performance has been achieved, current state-of-the-art methods still have difficulty to deal with the visual ambiguity and structure ambiguity in real world images. To tackle these problems, a novel Appearance-and-Structure Fusion network, which we call it ASFnet that estimates viewpoint by fusing both appearance and structure information, is proposed in this paper. The structure information is encoded by precise semantic keypoints and can help address the visual ambiguity. Meanwhile, distinguishable appearance features contribute to overcoming the structure ambiguity. Our ASFnet integrates an appearance path and a structure path to an end-to-end network and allows deep features effectively share supervision from both the two complementary aspects. A convolutional layer is learned to fuse the two path results adaptively. To balance the influence from the two supervision sources, a piecewise loss weight strategy is employed during training. Experimentally, our proposed network outperforms state-of-the-art methods on a public PASCAL 3D+ dataset, which verifies the effectiveness of our method and further corroborates the above proposition.

IJCAI Conference 2018 Conference Paper

HCR-Net: A Hybrid of Classification and Regression Network for Object Pose Estimation

  • Zairan Wang
  • Weiming Li
  • Yueying Kao
  • Dongqing Zou
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Object pose estimation from a single image is a fundamental and challenging problem in computer vision and robotics. Generally, current methods treat pose estimation as a classification or a regression problem. However, regression based methods usually suffer from the issue of imbalanced training data, while classification methods are difficult to discriminate nearby poses. In this paper, a hybrid CNN model, which we call it HCR-Net that integrates both a classification network and a regression network, is proposed to deal with these issues. Our model is inspired by that regression methods can get better accuracy on homogeneously distributed datasets while classification methods are more effective for coarse quantization of the poses even if the dataset is not well balanced. The classification methods and the regression methods essentially complement each other. Thus we integrate both them into a neural network in a hybrid fashion and train it end-to-end with two novel loss functions. As a result, our method surpass the state-of-the-art methods, even with imbalanced training data and much less data augmentation. The experimental results on the challenging Pascal3D+ database demonstrate that our method outperforms the state-of-the-arts significantly, achieving improvements on ACC and AVP metrics up to 4% and 6%, respectively.

ICRA Conference 2011 Conference Paper

An analytical solution to optimal focal distance in catadioptric imaging systems

  • Weiming Li
  • You-Fu Li 0001

Catadioptric imaging systems are important in many computer vision and robotics applications. This work addresses the issue of optimally setting the focal distance of a lens based camera in a catadioptric imaging system in order to acquire a best focused image. To this end, understanding the spatial distribution of virtual feature formed by mirror reflection is important. It is known that the virtual features of infinite range of scene depth are limited to a finite depth extent, which is named a caustic volume. In this work we further find that for a variety of quadric mirror based catadioptric systems, when the objects are located at a certain distance to the system, the corresponding virtual features can be considered to be located on the caustic volume boundary. We verify this property with real catadioptric images. Based on this property, an analytical solution is derived for the optimal focal distance setting, which can only be calculated by software simulation or numerical approaches in previous work. This solution is compared with a numerical solution of previous method and is also verified by a simulation of the optical process.

ICRA Conference 2011 Conference Paper

Generic radial distortion calibration of a novel single camera based panoramic stereoscopic system

  • Weiming Li
  • You-Fu Li 0001
  • Zhongwei Li
  • Dong Sun 0001
  • Beiwei Zhang 0001

This work presents a novel panoramic stereoscopic system consisting of a fisheye lens camera and a hyperbolic mirror with co-axis installation. From the overlapping field of view captured through the fisheye lens and the reflection of the mirror, position of an object point in 3D Euclidean space can be reconstructed once the system geometry is calibrated. To deal with the non-single viewpoint issue in the catadioptric image, a generic radial distortion model is used to describe the imaging process with a series of viewing cones. The parameters of the viewing cones are estimated using a homography based method with observations of an LCD panel at a few unknown positions. Following this, a closed form solution for 3D reconstruction is used with a non-linear optimization to obtain an optimal calibration. A prototype of the proposed design is constructed. Quantitative experiments are conducted to evaluate the calibration result in terms of 3D reconstruction precision. With the calibration result, we also present potential robotic applications of the proposed system such as 3D environment reconstruction in a 360 degree horizontal field of view.

IROS Conference 2006 Conference Paper

Structure-Constrained Obstacles Recognition for Power Transmission Line Inspection Robot

  • Si-Yao Fu
  • Weiming Li
  • Yun-Chu Zhang
  • Zi-ze Liang
  • Zeng-Guang Hou
  • Min Tan 0001
  • Wenbo Ye
  • Lian Bo

Inspection robot must plan its behavior to detect the obstacles from the complex background according to their types when it is crawling along the power transmission line in order to negotiate reliably. However, in most instances, detecting the obstacles from the complex background is a hard task. For this purpose, a novel and fast visual obstacle recognition algorithm is designed based on the structure of the 220 KV power transmission line. Basic principle and architecture of the algorithm are given. By this approach, three typical obstacles on the power transmission line such as insulator strings, counterweights and suspension clamps can be recognized with high accuracy. Experiments in the real power transmission line show its effectiveness. This method can contribute to the process of the mobile robot negotiating obstacles