Author name cluster

Qingchao Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

ICRA Conference 2025 Conference Paper

CubeDN: Real-Time Drone Detection in 3D Space from Dual mmWave Radar Cubes

Yuan Fang
Fangzhan Shi
Xijia Wei
Qingchao Chen
Kevin Chetty
Simon Julier

As drone use has become more widespread, there is a critical need to ensure safety and security. A key element of this is robust and accurate drone detection and localization. While cameras and other optical sensors like LiDAR are commonly used for object detection, their performance degrades under adverse lighting and environmental conditions. Therefore, this has generated interest in finding more reliable alternatives, such as millimeter-wave (mmWave) radar. Recent research on mmWave radar object detection has predominantly focused on 2D detection of road users. Although these systems demonstrate excellent performance for 2D problems, they lack the sensing capability to measure elevation, which is essential for 3D drone detection. To address this gap, we propose CubeDN, a single-stage end-to-end radar object detection network specifically designed for flying drones. CubeDN overcomes challenges such as poor elevation resolution by utilizing a dual radar configuration and a novel deep learning pipeline. It simultaneously detects, localizes, and classifies drones of two sizes, achieving decimeter-level tracking accuracy at closer ranges with overall 95% average precision (AP) and 85% average recall (AR). Furthermore, CubeDN completes data processing and inference at 10Hz, making it highly suitable for practical applications.

Details

AAAI Conference 2025 Conference Paper

Joint Class-level and Instance-level Relationship Modeling for Novel Class Discovery

Jiaying Zhou
Qingchao Chen

Novel class discovery(NCD) aims to cluster the unlabeled data with the help of a labeled set containing different but related classes. The key to solving NCD is the knowledge transfer between labeled and unlabeled sets.Since NCD requires that known classes and unknown classes are related, it is significant to explore class-level relationships between known and unknown for more effective knowledge transfer. However, most existing methods either facilitate knowledge transfer by learning a shared representation space or by modeling coarse-grained or asymmetric relationships between known and unknown, neglecting class-level relationships. To tackle these challenges, we propose a symmetric class-to-class relationship modeling and knowledge transfer method, achieving bidirectional knowledge transfer at class-level. Considering that class-level modeling often overlooks the subtle distinctions between samples, we propose pairwise similarity-based relationship modeling and consistency constraint for instance-level knowledge transfer. Extensive experiments on CIFAR100 and three fine-grained datasets demonstrate that our method achieves significant performance improvements compared to state-of-the-art methods.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Dejie Yang
Zhu Xu
Wentao Mo
Qingchao Chen
Siyuan Huang
Yang Liu

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1. 2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering. Codes are available at: https: //github. com/idejie/3DSyn

PDF Details DOI

AAAI Conference 2024 Conference Paper

Novel Class Discovery in Chest X-rays via Paired Images and Text

Jiaying Zhou
Yang Liu
Qingchao Chen

Novel class discover(NCD) aims to identify new classes undefined during model training phase with the help of knowledge of known classes. Many methods have been proposed and notably boosted performance of NCD in natural images. However, there has been no work done in discovering new classes based on medical images and disease categories, which is crucial for understanding and diagnosing specific diseases. Moreover, most of the existing methods only utilize information from image modality and use labels as the only supervisory information. In this paper, we propose a multi-modal novel class discovery method based on paired images and text, inspired by the low classification accuracy of chest X-ray images and the relatively higher accuracy of the paired text. Specifically, we first pretrain the image encoder and text encoder with multi-modal contrastive learning on the entire dataset and then we generate pseudo-labels separately on the image branch and text branch. We utilize intra-modal consistency to assess the quality of pseudo-labels and adjust the weights of the pseudo-labels from both branches to generate the ultimate pseudo-labels for training. Experiments on eight subset splits of MIMIC-CXR-JPG dataset show that our method improves the clustering performance of unlabeled classes by about 10% on average compared to state-of-the-art methods. Code is available at: https://github.com/zzzzzzzzjy/MMNCD-main.

PDF Details DOI

ICML Conference 2024 Conference Paper

Semantic-Aware Human Object Interaction Image Generation

Zhu Xu
Qingchao Chen
Yuxin Peng 0001
Yang Liu 0105

Recent text-to-image generative models have demonstrated remarkable abilities in generating realistic images. Despite their great success, these models struggle to generate high-fidelity images with prompts oriented toward human-object interaction (HOI). The difficulty in HOI generation arises from two aspects. Firstly, the complexity and diversity of human poses challenge plausible human generation. Furthermore, untrustworthy generation of interaction boundary regions may lead to deficiency in HOI semantics. To tackle the problems, we propose a Semantic-Aware HOI generation framework SA-HOI. It utilizes human pose quality and interaction boundary region information as guidance for denoising process, thereby encouraging refinement in these regions to produce more reasonable HOI images. Based on it, we establish an iterative inversion and image refinement pipeline to continually enhance generation quality. Further, we introduce a comprehensive benchmark for HOI generation, which comprises a dataset involving diverse and fine-grained HOI categories, along with multiple custom-tailored evaluation metrics for HOI generation. Experiments demonstrate that our method significantly improves generation quality under both HOI-specific and conventional image evaluation metrics. The code is available at https: //github. com/XZPKU/SA-HOI. git

Details

AAAI Conference 2024 Conference Paper

Semantic-Guided Novel Category Discovery

Weishuai Wang
Ting Lei
Qingchao Chen
Yang Liu

The Novel Category Discovery problem aims to cluster an unlabeled set with the help of a labeled set consisting of disjoint but related classes. However, existing models treat class names as discrete one-hot labels and ignore the semantic understanding of these classes. In this paper, we propose a new setting named Semantic-guided Novel Category Discovery (SNCD), which requires the model to not only cluster the unlabeled images but also semantically recognize these images based on a set of their class names. The first challenge we confront pertains to effectively leveraging the class names of unlabeled images, given the inherent gap between the visual and linguistic domains. To address this issue, we incorporate a semantic-aware recognition mechanism. This is achieved by constructing dynamic class-wise visual prototypes as well as a semantic similarity matrix that enables the projection of visual features into the semantic space. The second challenge originates from the granularity disparity between the classification and clustering tasks. To deal with this, we develop a semantic-aware clustering process to facilitate the exchange of knowledge between the two tasks. Through extensive experiments, we demonstrate the mutual benefits of the recognition and clustering tasks, which can be jointly optimized. Experimental results on multiple datasets confirm the effectiveness of our proposed method. Our code is available at https://github.com/wang-weishuai/Semantic-guided-NCD.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Phrase-Level Temporal Relationship Mining for Temporal Sentence Localization

Minghang Zheng
Sizhe Li
Qingchao Chen
Yuxin Peng
Yang Liu

In this paper, we address the problem of video temporal sentence localization, which aims to localize a target moment from videos according to a given language query. We observe that existing models suffer from a sheer performance drop when dealing with simple phrases contained in the sentence. It reveals the limitation that existing models only capture the annotation bias of the datasets but lack sufficient understanding of the semantic phrases in the query. To address this problem, we propose a phrase-level Temporal Relationship Mining (TRM) framework employing the temporal relationship relevant to the phrase and the whole sentence to have a better understanding of each semantic entity in the sentence. Specifically, we use phrase-level predictions to refine the sentence-level prediction, and use Multiple Instance Learning to improve the quality of phrase-level predictions. We also exploit the consistency and exclusiveness constraints of phrase-level and sentence-level predictions to regularize the training process, thus alleviating the ambiguity of each phrase prediction. The proposed approach sheds light on how machines can understand detailed phrases in a sentence and their compositions in their generality rather than learning the annotation biases. Experiments on the ActivityNet Captions and Charades-STA datasets show the effectiveness of our method on both phrase and sentence temporal localization and enable better model interpretability and generalization when dealing with unseen compositions of seen concepts. Code can be found at https://github.com/minghangz/TRM.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Minghang Zheng
Yanjie Huang
Qingchao Chen
Yang Liu

Video moment localization aims at localizing the video segments which are most related to the given free-form natural language query. The weakly supervised setting, where only video level description is available during training, is getting more and more attention due to its lower annotation cost. Prior weakly supervised methods mainly use sliding windows to generate temporal proposals, which are independent of video content and low quality, and train the model to distinguish matched video-query pairs and unmatched ones collected from different videos, while neglecting what the model needs is to distinguish the unaligned segments within the video. In this work, we propose a novel weakly supervised solution by introducing Contrastive Negative sample Mining (CNM). Specifically, we use a learnable Gaussian mask to generate positive samples, highlighting the video frames most related to the query, and consider other frames of the video and the whole video as easy and hard negative samples respectively. We then train our network with the Intra-Video Contrastive loss to make our positive and negative samples more discriminative. Our method has two advantages: (1) Our proposal generation process with a learnable Gaussian mask is more efficient and makes our positive sample higher quality. (2) The more difficult intra-video negative samples enable our model to distinguish highly confusing scenes. Experiments on two datasets show the effectiveness of our method. Code can be found at https: //github. com/minghangz/cnm.

PDF Details

AAAI Conference 2021 Conference Paper

Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval

Qingchao Chen
Yang Liu
Samuel Albanie

When can we expect a text-video retrieval system to work effectively on datasets that differ from its training domain? In this work, we investigate this question through the lens of unsupervised domain adaptation in which the objective is to match natural language queries and video content in the presence of domain shift at query-time. Such systems have significant practical applications since they are capable generalising to new data sources without requiring corresponding text annotations. We make the following contributions: (1) We propose the UDAVR (Unsupervised Domain Adaptation for Video Retrieval) benchmark and employ it to study the performance of text-video retrieval in the presence of domain shift. (2) We propose Concept-Aware-Pseudo-Query (CAPQ), a method for learning discriminative and transferable features that bridge these cross-domain discrepancies to enable effective target domain retrieval using source domain supervision. (3) We show that CAPQ outperforms alternative domain adaptation strategies on UDAVR.

PDF Details

AAAI Conference 2020 Conference Paper

Structure-Aware Feature Fusion for Unsupervised Domain Adaptation

Qingchao Chen
Yang Liu

Unsupervised domain Adaptation (UDA) aims to learn and transfer generalized features from a labelled source domain to a target domain without any annotations. Existing methods only aligning high-level representation but without exploiting the complex multi-class structure and local spatial structure. This is problematic as 1) the model is prone to negative transfer when the features from different classes are misaligned; 2) missing the local spatial structure poses a major obstacle in performing the ﬁne-grained feature alignment. In this paper, we integrate the valuable information conveyed in classiﬁer prediction and local feature maps into global feature representation and then perform a single mini-max game to make it domain invariant. In this way, the domain-invariant feature not only describes the holistic representation of the original image but also preserves mode-structure and ﬁne-grained spatial structural information. The feature integration is achieved by estimating and maximizing the mutual information (MI) among the global feature, local feature and classiﬁer prediction simultaneously. As the MI is hard to measure directly in high-dimension spaces, we adopt a new objective function that implicitly maximizes the MI via an effective sampling strategy and a discriminator design. Our STructure-Aware Feature Fusion (STAFF) network achieves the state-of-the-art performances in various UDA datasets.

PDF Details

AAAI Conference 2018 Conference Paper

Dictionary Learning Inspired Deep Network for Scene Recognition

Yang Liu
Qingchao Chen
Wei Chen
Ian Wassell

Scene recognition remains one of the most challenging problems in image understanding. With the help of fully connected layers (FCL) and rectiﬁed linear units (ReLu), deep networks can extract the moderately sparse and discriminative feature representation required for scene recognition. However, few methods consider exploiting a sparsity model for learning the feature representation in order to provide enhanced discriminative capability. In this paper, we replace the conventional FCL and ReLu with a new dictionary learning layer, that is composed of a ﬁnite number of recurrent units to simultaneously enhance the sparse representation and discriminative abilities of features via the determination of optimal dictionaries. In addition, with the help of the structure of the dictionary, we propose a new label discriminative regressor to boost the discrimination ability. We also propose new constraints to prevent overﬁtting by incorporating the advantage of the Mahalanobis and Euclidean distances to balance the recognition accuracy and generalization performance. Our proposed approach is evaluated using various scene datasets and shows superior performance to many stateof-the-art approaches.

PDF Details