Author name cluster

Qi Jia

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

Yihua Wang
Qi Jia
Cong Xu
Feiyu Chen
Yuhan Liu
Haotian Zhang
Liang Jin
Lu Liu

Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model's generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++R by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RSOD: Reliability-Guided Sonar Image Object Detection with Extremely Limited Labels

Chengzhou Li
Ping Guo
Guanchen Meng
Qi Jia
Jinyuan Liu
Zhu Liu
Xiaokang Liu
Yu Liu

Object detection in sonar images is a key technology in underwater detection systems. Compared to natural images, sonar images contain fewer texture details and are more susceptible to noise, making it difficult for non-experts to distinguish subtle differences between classes. This leads to their inability to provide precise annotation data for sonar images. Therefore, designing effective object detection methods for sonar images with extremely limited labels is particularly important. To address this, we propose a teacher-student framework called RSOD, which aims to fully learn the characteristics of sonar images and develop a pseudo-label strategy suitable for these images to mitigate the impact of limited labels. First, RSOD calculates a reliability score by assessing the consistency of the teacher's predictions across different views. To leverage this score, we introduce an object mixed pseudo-label method to tackle the shortage of labeled data in sonar images. Finally, we optimize the performance of the student by implementing a reliability-guided adaptive constraint. By taking full advantage of unlabeled data, the student can perform well even in situations with extremely limited labels. Notably, on the UATD dataset, our method, using only 5% of labeled data, achieves results that can compete against those of our baseline algorithm trained on 100% labeled data. We also collected a new dataset to provide more valuable data for research in the field of sonar.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Time Series Class-Incremental Learning via Confidence-guided Mask Distillation and Prototype-guided Contrastive Learning

Yu Liu
Haoqin Yang
Jinping Sui
Hui Wang
Haipeng Li
Weimin Wang
Qi Jia

Class-incremental learning (CIL) has recently gained great attention in the field of time series classification. Existing CIL methods based on knowledge distillation exhibit impressive ability to retain prior knowledge and overcome catastrophic forgetting, however, their effectiveness faces major challenges posed by time series data. Since temporal data is more susceptible to sensor errors and electronic noise, the distillation process may be significantly affected by noisy knowledge transfer. To address this issue, we propose a novel confidence-guided mask distillation (CMD) framework, to prevent the noisy inheritance during distillation. The core of CMD lies in a dynamic masking mechanism guided by prediction confidence, capable of allocating higher weights to high-confidence time series and substantially suppressing the influence of low-confidence ones. Additionally, different from prior work simply passing a set of feature prototypes to the classifier, we develop prototype-guided contrastive learning (PCL) to alleviate the classifier bias on new classes, through extra contrastive constraints to push away the feature distributions of old feature prototypes from those of new classes features. Extensive experiments on three time-series datasets demonstrate that, our method significantly outperforms other replay-free CIL approaches in raising average accuracy, as well as decreasing forgetting rate.

PDF Details DOI

AAAI Conference 2025 Conference Paper

As Pseudo-Label Free as Possible: Leveraging Adaptive Feature Generation for Sparsely Annotated Object Detection

Shuilian Yao
Yu Liu
Qi Jia
Sihong Chen
Wei Zhuo

Compared to fully supervised object detection, training with sparse annotations typically leads to a decline in performance due to insufficient feature diversity. Existing sparsely annotated object detection (SAOD) methods often rely on pseudo-labeling strategies, but these pseudo-labels tend to introduce noise under extreme sparsity. To simultaneously avoid the impact of pseudo-label noise and enhance feature diversity, we propose a novel Adaptive Feature Generation (AdaptFG) model that generates features based on class names. This model integrates a pre-trained CLIP into a VAE-based feature generator, with its core innovation being an Adaptor that adaptively maps CLIP’s semantic embeddings to the object detector domain. Additionally, we introduce inter-class relationship reasoning in detector, which effectively mitigates misclassifications stemming from similar features. Extensive experimental results demonstrate that AdaptFG consistently outperforms state-of-the-art SAOD methods on the PASCAL VOC and MS COCO benchmarks.

PDF Details DOI

EAAI Journal 2025 Journal Article

High-resolution multi-view stereo with multi-scale feature fusion

Dapeng Chen
Qi Jia
Hao Wu
Da Yu
Nanxuan Huang
Jia Liu

To enhance the handling of three-dimensional reconstruction for large-scale scenes and high-resolution images, we introduce a novel multi-view high-resolution three-dimensional reconstruction approach. Our proposed method integrates a Feature Pyramid Network with the Swin Transformer for improved performance. We integrate the Swin Transformer into the feature pyramid. This integration aims to establish long-range feature dependencies, facilitate information exchange between different input positions, and enhance the global consistency of feature representation. This improves the efficiency of the feature extraction stage. Following this, we apply cost volume regularization to mitigate noise and compute depth maps. A Depth Optimization Module is employed to refine the predicted depth maps, thereby enhancing their precision. Experimental results demonstrate the efficacy of our method in generating more accurate depth information, particularly in predicting high-resolution depth maps. Our approach utilizes these depth predictions to generate point clouds, enabling precise matching and reconstruction of multi-view images. Experiments conducted on public datasets validate the effectiveness and superiority of our proposed method.

Details DOI

ICML Conference 2025 Conference Paper

Info-Coevolution: An Efficient Framework for Data Model Coevolution

Ziheng Qin
Hailun Xu
Wei Chee Yew
Qi Jia
Yang Luo
Kanchan Sarkar
Danhui Guan
Kai Wang 0036

Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available at https: //github. com/NUS-HPC-AI-Lab/Info-Coevolution/.

Details

NeurIPS Conference 2024 Conference Paper

Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline

Qi Jia
Baoyu Fan
Cong Xu
Lu Liu
Liang Jin
Guoguang Du
Zhenhua Guo
Yaqian Zhao

Existing video multi-modal sentiment analysis mainly focuses on the sentiment expression of people within the video, yet often neglects the induced sentiment of viewers while watching the videos. Induced sentiment of viewers is essential for inferring the public response to videos and has broad application in analyzing public societal sentiment, effectiveness of advertising and other areas. The micro videos and the related comments provide a rich application scenario for viewers’ induced sentiment analysis. In light of this, we introduces a novel research task, Multimodal Sentiment Analysis for Comment Response of Video Induced(MSA-CRVI), aims to infer opinions and emotions according to comments response to micro video. Meanwhile, we manually annotate a dataset named Comment Sentiment toward to Micro Video (CSMV) to support this research. It is the largest video multi-modal sentiment dataset in terms of scale and video duration to our knowledge, containing 107, 267 comments and 8, 210 micro videos with a video duration of 68. 83 hours. To infer the induced sentiment of comment should leverage the video content, we propose the Video Content-aware Comment Sentiment Analysis (VC-CSA) method as a baseline to address the challenges inherent in this new task. Extensive experiments demonstrate that our method is showing significant improvements over other established baselines. We make the dataset and source code publicly available at https: //github. com/IEIT-AGI/MSA-CRVI.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Not Just Object, But State: Compositional Incremental Learning without Forgetting

Yanyi Zhang
Binglin Qiu
Qi Jia
Yu Liu
Ran He

Most incremental learners excessively prioritize object classes while neglecting various kinds of states (e. g. color and material) attached to the objects. As a result, they are limited in the ability to model state-object compositionality accurately. To remedy this limitation, we propose a novel task called Compositional Incremental Learning (composition-IL), which enables the model to recognize a variety of state-object compositions in an incremental learning fashion. Since the lack of suitable datasets, we re-organize two existing datasets and make them tailored for composition-IL. Then, we propose a prompt-based Composition Incremental Learner (CompILer), to overcome the ambiguous composition boundary. Specifically, we exploit multi-pool prompt learning, and ensure the inter-pool prompt discrepancy and intra-pool prompt diversity. Besides, we devise object-injected state prompting which injects object prompts to guide the selection of state prompts. Furthermore, we fuse the selected prompts by a generalized-mean strategy, to eliminate irrelevant information learned in the prompts. Extensive experiments on two datasets exhibit state-of-the-art performance achieved by CompILer. Code and datasets are available at: https: //github. com/Yanyi-Zhang/CompILer.

PDF Details DOI

AAAI Conference 2021 Conference Paper

DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic Dialogues

Qi Jia
Hongru Huang
Kenny Q. Zhu

Interpersonal language style shifting in dialogues is an interesting and almost instinctive ability of human. Understanding interpersonal relationship from language content is also a crucial step toward further understanding dialogues. Previous work mainly focuses on relation extraction between named entities in texts or within a single dialogue session. In this paper, we propose the task of relation classification of interlocutors based on their dialogues. We crawled movie scripts from IMSDb, and annotated the relation label for each session according to 13 pre-defined relationships. The annotated dataset DDRel consists of 6, 300 dyadic dialogue sessions between 694 pairs of speakers with 53, 126 utterances in total. We also construct session-level and pair-level relation classification tasks with widely-accepted baselines. The experimental results show that both tasks are challenging for existing models and the dataset will be useful for future research.

PDF Details