Arrow Research search

Author name cluster

Qiang Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
1 author row

Possible papers

11

AAAI Conference 2026 Conference Paper

On the Evaluation of Capability Estimation Methods for Large Language Models

  • Qiang Hu
  • Jin Wen
  • Yao Zhang
  • Maxime Cordy
  • Yongqiang Lyu

The emergence of large language models (LLMs) marks a transformative era in artificial intelligence~(AI). However, systematically evaluating the capability of LLMs is challenging due to the necessity of a large number of labeled test data. To tackle this problem, in the conventional AI field, AutoEval has been proposed to estimate the capability of AI models without data labeling effort. Unfortunately, even though multiple AutoEval methods have been proposed, most are constructed for classification tasks and evaluated only on image datasets. As a result, their effectiveness for LLMs is unclear, as LLMs often target generation tasks. In this work, we introduce the first AutoEval benchmark specifically designed to estimate the capability of LLMs using unlabeled test data, AEBench. Besides existing AutoEval methods, AEBench also supports our designed method, which utilizes the correlation between data uncertainty and model ability for the capability estimation. In total, AEBench covers 12 AutoEval methods and 120 method combinations. Based on AEBench, we conducted a comprehensive study to explore the usefulness of AutoEval on LLMs. Experimental results on 10 datasets demonstrated that our designed uncertainty features-based methods perform the best in achieving the lowest estimation errors.

AAAI Conference 2026 Conference Paper

Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light Endoscopy

  • Qiang Hu
  • Qimei Wang
  • Yingjie Guo
  • Qiang Li
  • Zhiwei Wang

White-Light Imaging (WLI) is the standard for endoscopic cancer screening, but Narrow-Band Imaging (NBI) offers superior diagnostic details. A key challenge is transferring knowledge from NBI to enhance WLI-only models, yet existing methods are critically hampered by their reliance on paired NBI-WLI images of the same lesion, a costly and often impractical requirement that leaves vast amounts of clinical data untapped. In this paper, we break this paradigm by introducing PaGKD, a novel Pairing-free Group-level Knowledge Distillation framework that that enables effective cross-modal learning using unpaired WLI and NBI data. Instead of forcing alignment between individual, often semantically mismatched image instances, PaGKD operates at the group level to distill more complete and compatible knowledge across modalities. Central to PaGKD are two complementary modules: (1) Group-level Prototype Distillation (GKD-Pro) distills compact group representations by extracting modality-invariant semantic prototypes via shared lesion-aware queries; (2) Group-level Dense Distillation (GKD-Den) performs dense cross-modal alignment by guiding group-aware attention with activation-derived relation maps. Together, these modules enforce global semantic consistency and local structural coherence without requiring image-level correspondence. Extensive experiments on four clinical datasets demonstrate that PaGKD consistently and significantly outperforms state-of-the-art methods, boosting AUC by 3.3%, 1.1%, 2.8%, and 3.2%, respectively, establishing a new direction for cross-modal learning from unpaired data.

NeurIPS Conference 2025 Conference Paper

4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming

  • Zihan Zheng
  • Zhenlong Wu
  • Houqiang Zhong
  • Yuan Tian
  • Ning Cao
  • Lan Xu
  • Jiangchao Yao
  • Xiaoyun Zhang

Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and variable bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets.

JBHI Journal 2025 Journal Article

Fine-Grained Temporal Site Monitoring in EGD Streams via Visual Time-Aware Embedding and Vision-Text Asymmetric Coworking

  • Fang Peng
  • Hongkuan Shi
  • Shiquan He
  • Qiang Hu
  • Ting Li
  • Fan Huang
  • Xinxia Feng
  • Mei Liu

Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two-fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i. e. , Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame-wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7. 66%.

NeurIPS Conference 2025 Conference Paper

Long-tailed Recognition with Model Rebalancing

  • JIAAN LUO
  • Feng Hong
  • Qiang Hu
  • Xiaofeng Cao
  • Feng Liu
  • Jiangchao Yao

Long-tailed recognition is ubiquitous and challenging in deep learning and even in the downstream finetuning of foundation models, since the skew class distribution generally prevents the model generalization to the tail classes. Despite the promise of previous methods from the perspectives of data augmentation, loss rebalancing and decoupled training etc. , consistent improvement in the broad scenarios like multi-label long-tailed recognition is difficult. In this study, we dive into the essential model capacity impact under long-tailed context, and propose a novel framework, Model Rebalancing (MORE), which mitigates imbalance by directly rebalancing the model's parameter space. Specifically, MORE introduces a low-rank parameter component to mediate the parameter space allocation guided by a tailored loss and sinusoidal reweighting schedule, but without increasing the overall model complexity or inference costs. Extensive experiments on diverse long-tailed benchmarks, spanning multi-class and multi-label tasks, demonstrate that MORE significantly improves generalization, particularly for tail classes, and effectively complements existing imbalance mitigation methods. These results highlight MORE's potential as a robust plug-and-play module in long-tailed settings.

AAAI Conference 2025 Conference Paper

MonoBox: Tightness-Free Box-Supervised Polyp Segmentation Using Monotonicity Constraint

  • Qiang Hu
  • Zhenyu Yi
  • Ying Zhou
  • Fan Huang
  • Mei Liu
  • Qiang Li
  • Zhiwei Wang

We propose MonoBox, an innovative box-supervised segmentation method constrained by monotonicity to liberate its training from the user-unfriendly box-tightness assumption. In contrast to conventional box-supervised segmentation, where the box edges must precisely touch the target boundaries, MonoBox leverages imprecisely-annotated boxes to achieve robust pixel-wise segmentation. The 'linchpin' is that, within the noisy zones around box edges, MonoBox discards the traditional misguiding multiple-instance learning loss, and instead optimizes a carefully-designed objective, termed monotonicity constraint. Along directions transitioning from the foreground to background, this new constraint steers responses to adhere to a trend of monotonically decreasing values. Consequently, the originally unreliable learning within the noisy zones is transformed into a correct and effective monotonicity optimization. Moreover, an adaptive label correction is introduced, enabling MonoBox to enhance the tightness of box annotations using predicted masks from the previous epoch and dynamically shrink the noisy zones as training progresses. We verify MonoBox in the box-supervised segmentation task of polyps, where satisfying box-tightness is challenging due to the vague boundaries between the polyp and normal tissues. Experiments on both public synthetic and in-house real noisy datasets demonstrate that MonoBox exceeds other anti-noise state-of-the-arts by improving Dice by at least 5.5% and 3.3%, respectively.

JBHI Journal 2025 Journal Article

TTFNet: Temporal-Frequency Features Fusion Network for Speech Based Automatic Depression Recognition and Assessment

  • Xiyuan Chen
  • Zhuhong Shao
  • Yinan Jiang
  • Runsen Chen
  • Yunlong Wang
  • Bicao Li
  • Mingyue Niu
  • Hongguang Chen

Related studies have revealed that the phonological features of depressed patients are different from those of healthy individuals. With the increasing prevalence of depression, an objective and convenient approach for early screening is necessary. To this end, we propose an automatic depression detection method based on hybrid speech features extracted by deep learning, dubbed as TTFNet. Firstly, to effectively excavate the intrinsic relationship among multidimensional dynamic features in the frequency domain, the log-Mel spectrogram of raw speech and its related derivatives are encoded into quaternion representation. Then, the innovatively designed quaternion VisionLSTM is utilized to capture their synergistic effects. Simultaneously, we integrate sLSTM with the pre-trained wav2vec 2. 0 model to fully acquire the temporal features. In addition, to further exploit the complementarity between temporal and frequency features, we design an XConformer block for cross-sequence interactions, which ingeniously combines self-attention mechanisms and convolutional modules. Based on this block, the dual-path fusion module closely utilizes the mutual promotion of features from different domains, thereby enhancing generalization capability of the proposed model. Extensive experiments conducted on the AVEC 2013, AVEC 2014, DAIC-WOZ and E-DAIC datasets demonstrate that our method outperforms current state-of-the-art methods in both depression recognition and severity prediction tasks.

AAAI Conference 2025 Conference Paper

VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression

  • Qiang Hu
  • Houqiang Zhong
  • Zihan Zheng
  • Xiaoyun Zhang
  • Zhengxue Cheng
  • Li Song
  • Guangtao Zhai
  • Yanfeng Wang

Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets.

AAAI Conference 2024 Conference Paper

Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification

  • Yuteng Ye
  • Hang Zhou
  • Jiale Cai
  • Chenxing Gao
  • Youjia Zhang
  • Junle Wang
  • Qiang Hu
  • Junqing Yu

Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.