E2B: A Single Modality Point-Based Tracker with Event Cameras

Hongwei Ren; Zhuo Li; Aiersi Tuerhong; Haobo Liu; Fei Liang; Yongxiang Feng; Wenhui Wang 0001; Yaoyuan Wang; Ziyang Zhang; Weihua He; Bojun Cheng

Back to ICRA

ICRA 2025

E2B: A Single Modality Point-Based Tracker with Event Cameras

Conference Paper Accepted Paper Artificial Intelligence · Robotics

Details

Abstract

High-speed object tracking holds significant relevance across robotic domains, such as drones and autonomous driving. Compared to conventional cameras, event cameras are equipped with the ability to capture object motion information at exceptionally high temporal resolution with relatively low power consumption and remain immune from motion-blurring effects. Regrettably, many existing methods adopt a framebased approach by stacking events into Event Frame, which overlooks the sparsity and high temporal resolution of events. This approach is also reliant on the huge pre-training backbone and reaches a performance plateau but demands unrealistically large networks and high power consumption, rendering it impractical for real-time applications in battery-constrained robotic scenarios. In this paper, we propose an efficient and effective single-modality tracker using Point Cloud representation named E2B (Event to Box). By directly handling the raw output of event cameras without dataformat transformation, E2B leverages events' coordinate guidance to accurately map Event Cloud features to 2D bounding boxes. Moreover, E2B incorporates the pyramid structure into the multi-stage feature extraction architecture to effectively track objects across diverse scales. In the experiments, E2B performs outstandingly on two large-scale and one synthetic event-based tracking datasets, covering both indoor and outdoor environments, as well as rigid and non-rigid objects.

Authors

Keywords

Point cloud compression
Power demand
Robot kinematics
Robot vision systems
Stacking
Cameras
Rendering (computer graphics)
Feature extraction
Real-time systems
Object tracking
Dynamic Vision Sensor
Point Cloud
Low Power Consumption
Diverse Scales
Cloud Features
Tracking Dataset
Event Frames
Point Cloud Representation
Convolutional Neural Network
K-nearest Neighbor
Pedestrian
Intersection Over Union
3D Space
Multilayer Perceptron
Feature Points
Weight Coefficient
Temporal Domain
Feature Extraction Network
Fixed Time Interval
Search Region
Multi-stage Structure
Traditional Cameras
Feature Extraction Block
Point Cloud Features
Template Feature
Prediction Box
Template Region
Structural Hierarchy
Global Feature Extraction

Context

Venue: IEEE International Conference on Robotics and Automation
Archive span: 1984-2025
Indexed papers: 30179
Paper id: 129679758423960663