Author name cluster

Jiajun Deng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

ICML Conference 2025 Conference Paper

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Guangting Zheng
Yehao Li
Yingwei Pan
Jiajun Deng
Ting Yao 0003
Yanyong Zhang
Tao Mei 0001

Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs.

Details

IJCAI Conference 2025 Conference Paper

Self-Classification Enhancement and Correction for Weakly Supervised Object Detection

Yufei Yin
Lechao Cheng
Wengang Zhou
Jiajun Deng
Zhou Yu
Houqiang Li

In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i. e. , multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network’s discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Cycle-Consistency Learning for Captioning and Grounding

Ning Wang
Jiajun Deng
Mingbo Jia

We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Revisiting Open-Set Panoptic Segmentation

Yufei Yin
Hao Chen
Wengang Zhou
Jiajun Deng
Haiming Xu
Houqiang Li

In this paper, we focus on the open-set panoptic segmentation (OPS) task to circumvent the data explosion problem. Different from the close-set setting, OPS targets to detect both known and unknown categories, where the latter is not annotated during training. Different from existing work that only selects a few common categories as unknown ones, we move forward to the real-world scenario by considering the various tail categories (~1k). To this end, we first build a new dataset with long-tail distribution for the OPS task. Based on this dataset, we additionally add a new class type for unknown classes and re-define the training annotations to make the OPS definition more complete and reasonable. Moreover, we analyze the influence of several significant factors in the OPS task and explore the upper bound of performance on unknown classes with different settings. Furthermore, based on the analyses, we design an effective two-phase framework for the OPS task, including thing-agnostic map generation and unknown segment mining. We further adopt semi-supervised learning to improve the OPS performance. Experimental results on different datasets validate the effectiveness of our method.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection

Yunyao Mao
Jiajun Deng
Wengang Zhou
Li Li
Yao Fang
Houqiang Li

Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories. A strong zero-shot HOI detector is supposed to be not only capable of discriminating novel interactions but also robust to positional distribution discrepancy between seen and unseen categories when locating human-object pairs. However, top-performing zero-shot HOI detectors rely on seen and predefined unseen categories to distill knowledge from CLIP and jointly locate human-object pairs without considering the potential positional distribution discrepancy, leading to impaired transferability. In this paper, we introduce CLIP4HOI, a novel framework for zero-shot HOI detection. CLIP4HOI is developed on the vision-language model CLIP and ameliorates the above issues in the following two aspects. First, to avoid the model from overfitting to the joint positional distribution of seen human-object pairs, we seek to tackle the problem of zero-shot HOI detection in a disentangled two-stage paradigm. To be specific, humans and objects are independently identified and all feasible human-object pairs are processed by Human-Object interactor for pairwise proposal generation. Second, to facilitate better transferability, the CLIP model is elaborately adapted into a fine-grained HOI classifier for proposal discrimination, avoiding data-sensitive knowledge distillation. Finally, experiments on prevalent benchmarks show that our CLIP4HOI outperforms previous approaches on both rare and unseen categories, and sets a series of state-of-the-art records under a variety of zero-shot settings.

PDF Details

NeurIPS Conference 2023 Conference Paper

CluB: Cluster Meets BEV for LiDAR-Based 3D Object Detection

Yingjie Wang
Jiajun Deng
Yuenan Hou
Yao Li
Yu Zhang
Jianmin Ji
Wanli Ouyang
Yanyong Zhang

Currently, LiDAR-based 3D detectors are broadly categorized into two groups, namely, BEV-based detectors and cluster-based detectors. BEV-based detectors capture the contextual information from the Bird's Eye View (BEV) and fill their center voxels via feature diffusion with a stack of convolution layers, which, however, weakens the capability of presenting an object with the center point. On the other hand, cluster-based detectors exploit the voting mechanism and aggregate the foreground points into object-centric clusters for further prediction. In this paper, we explore how to effectively combine these two complementary representations into a unified framework. Specifically, we propose a new 3D object detection framework, referred to as CluB, which incorporates an auxiliary cluster-based branch into the BEV-based detector by enriching the object representation at both feature and query levels. Technically, CluB is comprised of two steps. First, we construct a cluster feature diffusion module to establish the association between cluster features and BEV features in a subtle and adaptive fashion. Based on that, an imitation loss is introduced to distill object-centric knowledge from the cluster features to the BEV features. Second, we design a cluster query generation module to leverage the voting centers directly from the cluster branch, thus enriching the diversity of object queries. Meanwhile, a direction loss is employed to encourage a more accurate voting center for each cluster. Extensive experiments are conducted on Waymo and nuScenes datasets, and our CluB achieves state-of-the-art performance on both benchmarks.

PDF Details

AAAI Conference 2021 Conference Paper

Instance Mining with Class Feature Banks for Weakly Supervised Object Detection

Yufei Yin
Jiajun Deng
Wengang Zhou
Houqiang Li

Recent progress on weakly supervised object detection (WSOD) is characterized by formulating WSOD as a Multiple Instance Learning (MIL) problem and taking online refinement with the selected region proposals from MIL. However, MIL inclines to select the most discriminative part rather than the entire instance as the top-scoring region proposals, which leads to weak localization capability for weakly supervised object detectors. We attribute this problem to the limited intra-class diversity within a single image. Specifically, due to the lack of annotated bounding boxes, the network tends to focus on the most common parts of each class and neglect the diverse parts of objects. To solve the problem, we introduce a novel Instance Mining with Class Feature Banks (IM-CFB) framework, which includes a Class Feature Banks (CFB) module and a Feature Guided Instance Mining (FGIM) algorithm. Concretely, Class Feature Banks (CFB) consist of sub-banks for each class, which are utilized to collect diversity information from a broader view. At the training stage, the RoI features of reliable region proposals are recorded and updated in the CFB. Then, FGIM leverages the features recorded in the CFB to ameliorate the region proposal selection of the MIL branch. Extensive experiments conducted on two publicly available datasets, Pascal VOC 2007 and 2012, demonstrate the effectiveness of our method. More remarkably, our method achieves 54. 3% on mAP and 70. 7% on Cor- Loc on Pascal VOC 2007. When further re-trained by a Fast- RCNN detector, we obtain to-date the best reported mAP and CorLoc of 55. 8% and 72. 2%, respectively.

PDF Details

AAAI Conference 2021 Conference Paper

Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection

Jiajun Deng
Shaoshuai Shi
Peiwei Li
Wengang Zhou
Yanyong Zhang
Houqiang Li

Recent advances on 3D object detection heavily rely on how the 3D data are represented, i. e. , voxel-based or point-based representation. Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. In contrast, the voxel-based structure is better suited for feature extraction but often yields lower accuracy because the input data are divided into grids. In this paper, we take a slightly different viewpoint — we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy. Bearing this view in mind, we devise a simple but effective voxel-based framework, named Voxel R-CNN. By taking full advantage of voxel features in a two stage approach, our method achieves comparable detection accuracy with state-of-the-art point-based models, but at a fraction of the computation cost. Voxel R-CNN consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. A voxel RoI pooling is devised to extract RoI features directly from voxel features for further refinement. Extensive experiments are conducted on the widely used KITTI Dataset and the more recent Waymo Open Dataset. Our results show that compared to existing voxel-based methods, Voxel R-CNN delivers a higher detection accuracy while maintaining a realtime frame processing rate, i. e. , at a speed of 25 FPS on an NVIDIA RTX 2080 Ti GPU. The code is available at https: //github. com/djiajunustc/Voxel-R-CNN.

PDF Details