Author name cluster

Botian Shi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

Yaoze Zhang
Rong Wu
Pinlong Cai
Xiaoman Wang
Guohang Yan
Song Mao
Ding Wang
Botian Shi

Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands'', lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph's rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph's semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimize redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforms existing methods in response quality while reducing 46% retrieval redundancy.

PDF Details DOI

ICLR Conference 2025 Conference Paper

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Renqiu Xia
Mingsheng Li
Hancheng Ye
Wenjie Wu
Hongbin Zhou
Jiakang Yuan
Tianshuo Peng
Xinyu Cai

Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k. Our data and code will be released soon to accelerate future research on automatic GPS.

Details

ICRA Conference 2024 Conference Paper

An Extrinsic Calibration Method between LiDAR and GNSS/INS for Autonomous Driving

Jiahao Pi
Guohang Yan
Chengjie Wang 0009
Xinyu Cai
Botian Shi

Accurate and reliable sensor calibration is critical for fusing LiDAR and inertial measurements in autonomous driving. This paper proposes a novel three-stage extrinsic calibration method between LiDAR and GNSS/INS for autonomous driving. The first stage can quickly calibrate the extrinsic parameters between the sensors through point cloud surface features so that the extrinsic can be narrowed from a large initial error to a small error range in little time. The second stage can further calibrate the extrinsic parameters based on LiDAR-mapping space occupancy while removing motion distortion. In the final stage, the z-axis (the vertical direction relative to the ground plane) errors caused by the plane motion of the autonomous vehicle are corrected, and an accurate extrinsic parameter is finally obtained. Specifically, This method utilizes the planar features in the environment, making it possible to quickly carry out calibration. Experimental results on real-world datasets demonstrate the reliability and accuracy of our method. The codes are open-sourced on the Github website. The code link is https://github.com/OpenCalib/LiDAR2INS.

Details

NeurIPS Conference 2024 Conference Paper

Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving

Jianbiao Mei
Yukai Ma
Xuemeng Yang
Licheng Wen
Xinyu Cai
Xin Li
Daocheng Fu
Bo Zhang

Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1. 8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement. Project page: https: //pjlab-adg. github. io/LeapAD

PDF Details DOI

ICLR Conference 2024 Conference Paper

DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

Licheng Wen
Daocheng Fu
Xin Li 0110
Xinyu Cai
Tao Ma 0002
Pinlong Cai
Min Dou
Botian Shi

Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability. Drawing inspiration from the knowledge-driven nature of human driving, we explore the question of how to instill similar capabilities into autonomous driving systems and summarize a paradigm that integrates an interactive environment, a driver agent, as well as a memory component to address this question. Leveraging large language models (LLMs) with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. Extensive experiments prove DiLu's capability to accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Moreover, DiLu is able to directly acquire experiences from real-world datasets which highlights its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to leverage knowledge-driven capability in decision-making for autonomous vehicles. Through the proposed DiLu framework, LLM is strengthened to apply knowledge and to reason causally in the autonomous driving domain. Project page: https://pjlab-adg.github.io/DiLu/

Details

IROS Conference 2024 Conference Paper

Realistic Rainy Weather Simulation for LiDARs in CARLA Simulator

Donglin Yang
Xinyu Cai
Zhenfeng Liu
Wentao Jiang
Bo Zhang 0069
Guohang Yan
Xing Gao 0005
Si Liu 0001

Data augmentation methods to enhance perception performance in adverse weather have recently attracted considerable attention. Most of the LiDAR data augmentation methods post-process the existing dataset by physics-based models or machine-learning methods. However, due to the limited environmental annotations and the fixed vehicle trajectories in existing datasets, it is challenging to edit the scene and expand the diversity of traffic flow and scenario. To this end, we propose a simulator-based physical modeling approach to augment LiDAR data in rainy weather, enhancing the performance of the perception model. We complete the modeling task of the rainy weather effect in the CARLA simulator and establish a data collection pipeline for LiDAR. Furthermore, we pay special attention to the spray generated by vehicles in rainy weather and simulate this phenomenon through the Spray Emitter method we developed. In addition, considering the influence of different weather conditions on point cloud intensity, we develop a prediction network to forecast the intensity of the LiDAR echo. This enables us to complete the rainy weather simulation of 4D point cloud data. In the experiment, we observe that the model augmented by our synthetic dataset improves the performance for 3D object detection in rainy weather. Both code and dataset are available at https://github.com/PJLab-ADG/PCSim#rainypcsim.

Details

ICLR Conference 2024 Conference Paper

ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

Bo Zhang 0069
Xinyu Cai
Jiakang Yuan
Donglin Yang
Jianfei Guo
Xiangchao Yan
Renqiu Xia
Botian Shi

Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous domain knowledge can be hardly directly deployed to a new domain without additional costs. In this paper, we provide a new perspective and approach of alleviating the domain shifts, by proposing a Reconstruction-Simulation-Perception (ReSimAD) scheme. Specifically, the implicit reconstruction process is based on the knowledge from the previous old domain, aiming to convert the domain-related knowledge into domain-invariant representations, e.g., 3D scene-level meshes. Besides, the point clouds simulation process of multiple new domains is conditioned on the above reconstructed 3D meshes, where the target-domain-like simulation samples can be obtained, thus reducing the cost of collecting and annotating new-domain data for the subsequent perception process. For experiments, we consider different cross-domain situations such as Waymo-to-KITTI, Waymo-to-nuScenes, etc, to verify the zero-shot target-domain perception using ReSimAD. Results demonstrate that our method is beneficial to boost the domain generalization ability, even promising for 3D pre-training. Code and simulated points are available at: https://github.com/PJLab-ADG/3DTrans

Details

NeurIPS Conference 2024 Conference Paper

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Hancheng Ye
Jiakang Yuan
Renqiu Xia
Xiangchao Yan
Tao Chen
Junchi Yan
Botian Shi
Bo Zhang

Diffusion models have recently achieved great success in the synthesis of high-quality images and videos. However, the existing denoising techniques in diffusion models are commonly based on step-by-step noise predictions, which suffers from high computation cost, resulting in a prohibitive latency for interactive applications. In this paper, we propose AdaptiveDiffusion to relieve this bottleneck by adaptively reducing the noise prediction steps during the denoising process. Our method considers the potential of skipping as many noise prediction steps as possible while keeping the final denoised results identical to the original full-step ones. Specifically, the skipping strategy is guided by the third-order latent difference that indicates the stability between timesteps during the denoising process, which benefits the reusing of previous noise prediction results. Extensive experiments on image and video diffusion models demonstrate that our method can significantly speed up the denoising process while generating identical results to the original process, achieving up to an average 2-5x speedup without quality degradation. The code is available at https: //github. com/UniModal4Reasoning/AdaptiveDiffusion

PDF Details DOI

ICRA Conference 2024 Conference Paper

VeloVox: A Low-Cost and Accurate 4D Object Detector with Single-Frame Point Cloud of Livox LiDAR

Tao Ma 0002
Zhiwei Zheng
Hongbin Zhou
Xinyu Cai
Xuemeng Yang
Yikang Li 0002
Botian Shi
Hongsheng Li 0001

Combining motion prediction in LiDAR-based 3D object detection is an effective method for improving overall accuracy, especially the downstream autonomous driving tasks. The recent development of low-cost LiDARs (e. g. Livox LiDAR) enables us to explore such 4D perception systems with a lower budget and higher performance. In this paper, we propose a 4D object detector, VeloVox, to establish accurate object detection and velocity estimation with a single-frame point cloud of Livox LiDAR. Based on the non-repetitive scanning pattern and point-level temporal nature, we propose a two-stage module to enhance the spatial-temporal point feature interaction along the time dimension. The aggregated feature also benefits a more accurate proposal refinement. To demonstrate the performance, comparison of VeloVox with several SOTA detector based baselines is evaluated on our in-house dataset and synthesized dataset built under Carla simulation. Code will be released at https://github.com/PJLab-ADG/VeloVox.

Details

ICRA Conference 2024 Conference Paper

Zero-training LiDAR-Camera Extrinsic Calibration Method Using Segment Anything Model

Zhaotong Luo
Guohang Yan
Xinyu Cai
Botian Shi

Extrinsic calibration for LiDAR and camera is an essential prerequisite for sensor fusion. Recently, automatic and target-less extrinsic calibration has become the mainstream of academic research. However, geometric feature-based methods still have requirements on the scene. Deep learning methods, while achieving high accuracy and good adaptability, rely on large annotated dataset and need additional training. We propose a novel LiDAR-camera calibration method by using the Segment Anything Model(SAM) without additional training. With the automatically generated masks, we optimize the extrinsic parameters by maximizing the consistency score of the point attributes that fall on each mask. The point cloud attributes include intensity, normal vector and segmentation class. Experiments on different real-world dataset demonstrate the accuracy and robustness of our proposed method. The code is available at https://github.com/OpenCalib/CalibAnything.

Details

NeurIPS Conference 2024 Conference Paper

ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving

Tao Ma
Hongbin Zhou
Qiusheng Huang
Xuemeng Yang
Jianfei Guo
Bo Zhang
Min Dou
Yu Qiao

Offboard perception aims to automatically generate high-quality 3D labels for autonomous driving (AD) scenes. Existing offboard methods focus on 3D object detection with closed-set taxonomy and fail to match human-level recognition capability on the rapidly evolving perception tasks. Due to heavy reliance on human labels and the prevalence of data imbalance and sparsity, a unified framework for offboard auto-labeling various elements in AD scenes that meets the distinct needs of perception tasks is not being fully explored. In this paper, we propose a novel multi-modal Zero-shot Offboard Panoptic Perception (ZOPP) framework for autonomous driving scenes. ZOPP integrates the powerful zero-shot recognition capabilities of vision foundation models and 3D representations derived from point clouds. To the best of our knowledge, ZOPP represents a pioneering effort in the domain of multi-modal panoptic perception and auto labeling for autonomous driving scenes. We conduct comprehensive empirical studies and evaluations on Waymo open dataset to validate the proposed ZOPP on various perception tasks. To further explore the usability and extensibility of our proposed ZOPP, we also conduct experiments in downstream applications. The results further demonstrate the great potential of our ZOPP for real-world scenarios. The source code will be released at \url{https: //github. com/PJLab-ADG/ZOPP}.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

Jiakang Yuan
Bo Zhang
Xiangchao Yan
Botian Shi
Tao Chen
Yikang Li
Yu Qiao

It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks. Previous works mainly focus on the self-supervised pre-training pipeline, meaning that they perform the pre-training and fine-tuning on the same benchmark, which is difficult to attain the performance scalability and cross-dataset application for the pre-training checkpoint. In this paper, for the first time, we are committed to building a large-scale pre-training point-cloud dataset with diverse data distribution, and meanwhile learning generalizable representations from such a diverse pre-training dataset. We formulate the point-cloud pre-training task as a semi-supervised problem, which leverages the few-shot labeled and massive unlabeled point-cloud data to generate the unified backbone representations that can be directly applied to many baseline models and benchmarks, decoupling the AD-related pre-training process and downstream fine-tuning task. During the period of backbone pre-training, by enhancing the scene- and instance-level distribution diversity and exploiting the backbone's ability to learn from unknown instances, we achieve significant performance gains on a series of downstream perception benchmarks including Waymo, nuScenes, and KITTI, under different baseline models like PV-RCNN++, SECOND, CenterPoint.

PDF Details

AAAI Conference 2023 Conference Paper

LWSIS: LiDAR-Guided Weakly Supervised Instance Segmentation for Autonomous Driving

Xiang Li
Junbo Yin
Botian Shi
Yikang Li
Ruigang Yang
Jianbing Shen

Image instance segmentation is a fundamental research topic in autonomous driving, which is crucial for scene understanding and road safety. Advanced learning-based approaches often rely on the costly 2D mask annotations for training. In this paper, we present a more artful framework, LiDAR-guided Weakly Supervised Instance Segmentation (LWSIS), which leverages the off-the-shelf 3D data, i.e., Point Cloud, together with the 3D boxes, as natural weak supervisions for training the 2D image instance segmentation models. Our LWSIS not only exploits the complementary information in multimodal data during training but also significantly reduces the annotation cost of the dense 2D masks. In detail, LWSIS consists of two crucial modules, Point Label Assignment (PLA) and Graph-based Consistency Regularization (GCR). The former module aims to automatically assign the 3D point cloud as 2D point-wise labels, while the atter further refines the predictions by enforcing geometry and appearance consistency of the multimodal data. Moreover, we conduct a secondary instance segmentation annotation on the nuScenes, named nuInsSeg, to encourage further research on multimodal perception tasks. Extensive experiments on the nuInsSeg, as well as the large-scale Waymo, show that LWSIS can substantially improve existing weakly supervised segmentation models by only involving 3D data during training. Additionally, LWSIS can also be incorporated into 3D object detectors like PointPainting to boost the 3D detection performance for free. The code and dataset are available at https://github.com/Serenos/LWSIS.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

RangePerception: Taming LiDAR Range View for Efficient and Accurate 3D Object Detection

Yeqi BAI
Ben Fei
Youquan Liu
Tao Ma
Yuenan Hou
Botian Shi
Yikang Li

LiDAR-based 3D detection methods currently use bird's-eye view (BEV) or range view (RV) as their primary basis. The former relies on voxelization and 3D convolutions, resulting in inefficient training and inference processes. Conversely, RV-based methods demonstrate higher efficiency due to their compactness and compatibility with 2D convolutions, but their performance still trails behind that of BEV-based methods. To eliminate this performance gap while preserving the efficiency of RV-based methods, this study presents an efficient and accurate RV-based 3D object detection framework termed RangePerception. Through meticulous analysis, this study identifies two critical challenges impeding the performance of existing RV-based methods: 1) there exists a natural domain gap between the 3D world coordinate used in output and 2D range image coordinate used in input, generating difficulty in information extraction from range images; 2) native range images suffer from vision corruption issue, affecting the detection accuracy of the objects located on the margins of the range images. To address the key challenges above, we propose two novel algorithms named Range Aware Kernel (RAK) and Vision Restoration Module (VRM), which facilitate information flow from range image representation and world-coordinate 3D detection results. With the help of RAK and VRM, our RangePerception achieves 3. 25/4. 18 higher averaged L1/L2 AP compared to previous state-of-the-art RV-based method RangeDet, on Waymo Open Dataset. For the first time as an RV-based 3D detection method, RangePerception achieves slightly superior averaged AP compared with the well-known BEV-based method CenterPoint and the inference speed of RangePerception is 1. 3 times as fast as CenterPoint.

PDF Details

AAAI Conference 2020 Conference Paper

Functionality Discovery and Prediction of Physical Objects

Lei Ji
Botian Shi
Xianglin Guo
Xilin Chen

Functionality is a fundamental attribute of an object which indicates the capability to be used to perform speciﬁc actions. It is critical to empower robots the functionality knowledge in discovering appropriate objects for a task e. g. cut cake using knife. Existing research works have focused on understanding object functionality through human-object-interaction from extensively annotated image or video data and are hard to scale up. In this paper, we (1) mine object-functionality knowledge through pattern-based and model-based methods from text, (2) introduce a novel task on physical objectfunctionality prediction, which consumes an image and an action query to predict whether the object in the image can perform the action, and (3) propose a method to leverage the mined functionality knowledge for the new task. Our experimental results show the effectiveness of our methods.

PDF Details

IJCAI Conference 2019 Conference Paper

Knowledge Aware Semantic Concept Expansion for Image-Text Matching

Botian Shi
Lei Ji
Pan Lu
Zhendong Niu
Nan Duan

Image-text matching is a vital cross-modality task in artificial intelligence and has attracted increasing attention in recent years. Existing works have shown that learning semantic concepts is useful to enhance image representation and can significantly improve the performance of both image-to-text and text-to-image retrieval. However, existing models simply detect semantic concepts from a given image, which are less likely to deal with long-tail and occlusion concepts. Frequently co-occurred concepts in the same scene, e. g. bedroom and bed, can provide common-sense knowledge to discover other semantic-related concepts. In this paper, we develop a Scene Concept Graph (SCG) by aggregating image scene graphs and extracting frequently co-occurred concept pairs as scene common-sense knowledge. Moreover, we propose a novel model to incorporate this knowledge to improve image-text matching. Specifically, semantic concepts are detected from images and then expanded by the SCG. After learning to select relevant contextual concepts, we fuse their representations with the image embedding feature to feed into the matching module. Extensive experiments are conducted on Flickr30K and MSCOCO datasets, and prove that our model achieves state-of-the-art results due to the effectiveness of incorporating the external SCG.

PDF Details