Author name cluster

Peize Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

AAAI Conference 2025 Conference Paper

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Peize Li
Qingyi Si
Peng Fu
Zheng Lin
Yan Wang

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

PDF Details DOI

ICRA Conference 2024 Conference Paper

Multimodal Indoor Localization Using Crowdsourced Radio Maps

Zhaoguang Yi
Xiangyu Wen 0001
Qiyue Xia
Peize Li
Francisco Zampella
Firas Alsehly
Chris Xiaoxuan Lu

Indoor Positioning Systems (IPS) traditionally rely on odometry and building infrastructures like WiFi, often supplemented by building floor plans for increased accuracy. However, the limitation of floor plans in terms of availability and timeliness of updates challenges their wide applicability. In contrast, the proliferation of smartphones and WiFi-enabled robots has made crowdsourced radio maps – databases pairing locations with their corresponding Received Signal Strengths (RSS) – increasingly accessible. These radio maps not only provide WiFi fingerprint-location pairs but encode movement regularities akin to the constraints imposed by floor plans. This work investigates the possibility of leveraging these radio maps as a substitute for floor plans in multimodal IPS. We introduce a new framework to address the challenges of radio map inaccuracies and sparse coverage. Our proposed system integrates an uncertainty-aware neural network model for WiFi localization and a bespoken Bayesian fusion technique for optimal fusion. Extensive evaluations on multiple real-world sites indicate a significant performance enhancement, with results showing ∼ 25% improvement over the best baseline.

Details

AAAI Conference 2024 Conference Paper

Object Attribute Matters in Visual Question Answering

Peize Li
Qingyi Si
Peng Fu
Zheng Lin
Yan Wang

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.

PDF Details DOI

IROS Conference 2023 Conference Paper

Feature-based Visual Odometry for Bronchoscopy: A Dataset and Benchmark

Jianning Deng
Peize Li
Kevin Dhaliwal
Chris Xiaoxuan Lu
Mohsen Khadem

Bronchoscopy is a medical procedure that involves the insertion of a flexible tube with a camera into the airways to survey, diagnose and treat lung diseases. Due to the complex branching anatomical structure of the bronchial tree and the similarity of the inner surfaces of the segmental airways, navigation systems are now being routinely used to guide the operator during procedures to access the lung periphery. Current navigation systems rely on sensor-integrated bronchoscopes to track the position of the bronchoscope in real-time. This approach has limitations, including increased cost and limited use in non-specialized settings. To address this issue, researchers have proposed visual odometry algorithms to track the bronchoscope camera without the need for external sensors. However, due to the lack of publicly available datasets, limited progress is made. To this end, we have developed a database of bronchoscopy videos in a phantom lung model and ex-vivo human lungs. The dataset contains 34 video sequences with over 23, 000 frames with odometry ground truth data collected using electromagnetic tracking sensors. With our dataset, we empower the robotics and machine learning community to advance the field. We share our insights on challenges in endoscopic visual odometry. Furthermore, we provide benchmark results for this dataset. State-of-the-art feature extraction algorithms including SIFT, ORB, Superpoint, Shi- Tomasi, and LoFTR are tested on this dataset. The benchmark results demonstrate that the LoFTR algorithm outperforms other approaches, but still has significant errors in the presence of rapid movements and occlusions.

Details

IROS Conference 2022 Conference Paper

OdomBeyondVision: An Indoor Multi-modal Multi-platform Odometry Dataset Beyond the Visible Spectrum

Peize Li
Kaiwen Cai
Muhamad Risqi Utama Saputra
Zhuangzhuang Dai
Chris Xiaoxuan Lu

This paper presents a multimodal indoor odometry dataset, OdomBeyondVision, featuring multiple sensors across the different spectrum and collected with different mobile platforms. Not only does OdomBeyondVision contain the traditional navigation sensors, sensors such as IMUs, mechanical LiDAR, RGBD camera, it also includes several emerging sensors such as the single-chip mmWave radar, LWIR thermal camera and solid-state LiDAR. With the above sensors on UAV, UGV and handheld platforms, we respectively recorded the multimodal odometry data and their movement trajectories in various indoor scenes and different illumination conditions. We release the exemplar radar, radar-inertial and thermal-inertial odometry implementations to demonstrate their results for future works to compare against and improve upon. The full dataset including toolkit and documentation is publicly available at: https://github.com/MAPS-Lab/OdomBeyondVision.

Details