Arrow Research search

Author name cluster

Ruize Han

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
2 author rows

Possible papers

4

NeurIPS Conference 2025 Conference Paper

DOVTrack: Data-Efficient Open-Vocabulary Tracking

  • Zekun Qian
  • Ruize Han
  • Zhixiang Wang
  • Junhui Hou
  • Wei Feng

Open-Vocabulary Multi-Object Tracking (OVMOT) aims to detect and track multi-category objects including both seen and unseen categories during training. Currently, a significant challenge in this domain is the lack of large-scale annotated video data for training. To address this challenge, this work aims to effectively train the OV tracker using only the existing limited and sparsely annotated video data. We propose a comprehensive training sample space expansion strategy that addresses the fundamental limitation of sparse annotations in OVMOT training. Specifically, for the association task, we develop a diffusion-based feature generation framework that synthesizes intermediate object features between sparsely annotated frames, effectively expanding the training sample space by approximately 3× and enabling robust association learning from temporally continuous features. For the detection task, we introduce a dynamic group contrastive learning approach that generates diverse sample groups through affinity, dispersion, and adversarial grouping strategies, tripling the effective training samples for classification while maintaining sample quality. Additionally, we propose an adaptive localization loss that expands positive sample coverage by lowering IoU thresholds while mitigating noise through confidence-based weighting. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the OVMOT benchmark, surpassing existing methods by 3. 8\% in TETA metric, without requiring additional data or annotations. The code will be available at https: //github. com/zekunqian/DOVTrack.

NeurIPS Conference 2024 Conference Paper

OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking

  • Haiji Liang
  • Ruize Han

Open-vocabulary object perception has become an important topic in artificial intelligence, which aims to identify objects with novel classes that have not been seen during training. Under this setting, open-vocabulary object detection (OVD) in a single image has been studied in many literature. However, open-vocabulary object tracking (OVT) from a video has been studied less, and one reason is the shortage of benchmarks. In this work, we have built a new large-scale benchmark for open-vocabulary multi-object tracking namely OVT-B. OVT-B contains 1, 048 categories of objects and 1, 973 videos with 637, 608 bounding box annotations, which is much larger than the sole open-vocabulary tracking dataset, i. e. , OVTAO-val dataset (200+ categories, 900+ videos). The proposed OVT-B can be used as a new benchmark to pave the way for OVT research. We also develop a simple yet effective baseline method for OVT. It integrates the motion features for object tracking, which is an important feature for MOT but is ignored in previous OVT methods. Experimental results have verified the usefulness of the proposed benchmark and the effectiveness of our method. We have released the benchmark to the public at https: //github. com/Coo1Sea/OVT-B-Dataset.

ICRA Conference 2024 Conference Paper

Robust Collaborative Perception without External Localization and Clock Devices

  • Zixing Lei
  • Zhenyang Ni
  • Ruize Han
  • Shuo Tang
  • Chen Feng 0002
  • Siheng Chen
  • Yanfeng Wang 0001

A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and potentially malicious attack, jeopardizing the precision of spatial-temporal alignment. Rather than relying on external hardwares, this work proposes a novel approach: aligning by recognizing the inherent geometric patterns within the perceptual data of various agents. Following this spirit, we propose a robust collaborative perception system that operates independently of external localization and clock devices. The key module of our system, FreeAlign, constructs a salient object graph for each agent based on its detected boxes and uses a graph neural network to identify common subgraphs between agents, leading to accurate relative pose and time. We validate FreeAlign on both real-world and simulated datasets. The results show that, the FreeAlign empowered robust collaborative perception system perform comparably to systems relying on precise localization and clock devices. ${\mathbf{Code}}$ will be released.

AAAI Conference 2020 Conference Paper

Complementary-View Multiple Human Tracking

  • Ruize Han
  • Wei Feng
  • Jiewen Zhao
  • Zicheng Niu
  • Yujun Zhang
  • Liang Wan
  • Song Wang

The global trajectories of targets on ground can be well captured from a top view in a high altitude, e. g. , by a dronemounted camera, while their local detailed appearances can be better recorded from horizontal views, e. g. , by a helmet camera worn by a person. This paper studies a new problem of multiple human tracking from a pair of top- and horizontalview videos taken at the same time. Our goal is to track the humans in both views and identify the same person across the two complementary views frame by frame, which is very challenging due to very large field of view difference. In this paper, we model the data similarity in each view using appearance and motion reasoning and across views using appearance and spatial reasoning. Combing them, we formulate the proposed multiple human tracking as a joint optimization problem, which can be solved by constrained integer programming. We collect a new dataset consisting of top- and horizontal-view video pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.