Arrow Research search

Author name cluster

Ran Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
1 author row

Possible papers

8

AAAI Conference 2026 Conference Paper

DeFB: Decomposed Feature Learning for Real-Time Multi-Person Eyeblink Detection in Untrimmed In-the-Wild Videos

  • Jinfang Gan
  • Wenzheng Zeng
  • Yang Xiao
  • Xintao Zhang
  • Chaoyang Zheng
  • Ran Zhao
  • Ran Wang
  • Min Du

Multi-person eyeblink detection in untrimmed in-the-wild videos is a recently emerged and challenging task. Due to its significant spatio-temporal fine-grained characteristics compared to general actions, we empirically find that general action detectors, though effective in general domains, struggle with this task (i.e., Blink-AP < 2%). Specialized eyeblink detection methods alleviate it through fine-grained spatio-temporal operations. SOTA method proposes a unified model combining instance-aware face localization and eyeblink detection through joint multi-task learning and feature sharing. While effective, it exhibits two critical limitations that may contribute to its unsatisfactory performance (i.e., Blink-AP=10.11%): (1) Face localization and eyeblink detection require distinct spatio-temporal feature granularities, making joint modeling in a unified feature space suboptimal. (2) Eyeblink task training could be largely affected by unstable face-eye feature learning under the joint training paradigm. To address this, we propose DeFB, a decomposed feature learning paradigm with favorable effectiveness and efficiency: (1) We model faces and eyes in granularity-specific feature spaces, which enhances fine-grained perception while reducing computational costs compared to a unified feature space. (2) To mitigate face-eye feature learning instability, we adopt an asynchronous learning mechanism where eye feature learning refines well-trained coarse face features, with shared queries acting as a bridge between stages to retain the efficient feature sharing of existing unified models. Compared with SOTA method, DeFB doubles the performance (Blink-AP: 24.65% v.s. 10.11%) while boosting efficiency by nearly 35%. DeFB can also be integrated as a plug-in to substantially augment the eyeblink detection capabilities of general action detectors.

AAAI Conference 2026 Conference Paper

MoEG-HOI: Mixture of Expert Groups for One-Stage Hand-Object Interaction Motion Generation with Hand-Finger-Joint Semantic Guidance

  • Hang Xu
  • Yang Xiao
  • Changlong Jiang
  • Haohong Kuang
  • Kaidi Zhang
  • Min Du
  • Ran Wang

In this paper, MoEG-HOI is proposed as a novel method for the challenging 3D hand-object interaction (HOI) motion generation task, by introducing Mixture-of-Experts (MoE) to this field for the first time. Almost all the mainstream approaches in HOI motion generation leverage diffusion model as its strong generative ability. Nevertheless, due to HOI’s fine-grained property, well training diffusion in one-stage way is actually not trivial. Existing state-of-the-art (SOTA) methods (e.g.,Text2HOI and MF-MDM) alleviate this mainly via a coarse-to-fine, multi-stage paradigm. Although effective and practical, this paradigm prevents end-to-end training for optimal performance. In contrast, MoEG-HOI applies MoE to address this in one-stage way, with end-to-end training ability. This allows each expert to specialize in certain distinct HOI patterns, which alleviates individual expert’s training difficulty. However, intuitively applying MoE is not optimal due to the issues of: (1) towards expert design, original MoE cannot well characterize hand’s articulated structure at the levels of hand, finger, and joint explicitly, and (2) for expert routing mechanism, the characteristics of variational HOI action classes and diffusion noise levels have not been concerned. Towards the first problem, MoE’s experts are designed into groups that correspond to motion generation for hand, finger, and joint respectively, under the semantic guidance from global to local. To facilitate this, HOI’s text description will be correspondingly refined at Hand-Finger-Joint levels using LLM. Secondly, during MoE routing, the information of HOI’s action label and diffusion noise level is concerned to select experts jointly, to better reveal actions’ inter-class variation and dynamics of diffusion generation. SOTA performance on ARCTIC, GRAB and H2O datasets demonstrates the effectiveness of our method.

JBHI Journal 2026 Journal Article

Towards Unconstrained Fall Detection Using Vision Language Model: Dataset, Theory and Practices

  • Shiman Wu
  • Tianyi Chen
  • Zhihao Zha
  • Bin Wu
  • Yixin Li
  • Ran Wang
  • Yanan Li
  • Chong Tian

Unconstrained fall detection is essential for real-world applications. However, it remains underexplored due to the scarcity of real-world fall data and the limited generalization ability of existing methods. To address these challenges, we first introduce HUST-FALL, a fine-grained text-video dataset for unconstrained fall detection, featuring diverse fall scenarios and rich semantic annotations. Building on this dataset, we propose Action-R1, a lightweight vision-language model that leverages structured textual guidance and reasoning to improve the understanding of fall events. In challenging cross-dataset tests, Action-R1 achieves an average F1 score of 0. 827 on three benchmarks, significantly outperforming conventional CNN/RNN-based methods. Despite having only 1/16 the parameters, Action-R1 achieves competitive performance against MiniCPM-V 2. 6, even surpassing it on UPFall by 116. 22%. These results demonstrate that Action-R1 is a lightweight yet powerful solution for unconstrained fall detection in real-world scenarios.

AAAI Conference 2025 Conference Paper

GeCC: Generalized Contrastive Clustering with Domain Shifts Modeling

  • Yujie Chen
  • Wenhui Wu
  • Le Ou-Yang
  • Ran Wang
  • Debby D. Wang

Contrastive clustering performs clustering and data representation in a unified model, where instance- and cluster-level constrastive learning are conducted simultaneously. However, commonly-used data augmentation methods make contrastive mechanism effect but may cause representation learning getting stuck in domain-specific information, which further deteriorates clustering performance and limits generalization ability. To this end, we propose a new framework, named Generalized Contrastive Clustering with domain shifts modeling (GeCC), which can integrate diverse domain knowledge to improve the clustering performance. Specifically, we first design a cluster-guided domain shifts modeling module to synthesize a reference view with diverse domain information. Then, we introduce instance representation and cluster assignment contrastive modules with well-designed attention weights to guide the representation learning and clustering. In this way, our method can maximize the extraction of cluster-related information and avoid over-fitting domain-specific features. Experimental results on four benchmark datasets demonstrate that our proposed method consistently outperforms other state-of-the-art methods.

TIST Journal 2025 Journal Article

Integrated Image-Text Augmentation for Few-Shot Learning in Vision-Language Models

  • Ran Wang
  • Hua Zuo
  • Zhen Fang
  • Jie Lu

Vision-language models, such as the Contrastive Language-Image Pre-Training (CLIP) model, have achieved significant success in image classification tasks. CLIP demonstrates high expressive power in few-shot learning scenarios due to its pairing of text and image encoders. However, CLIP still faces over-fitting when trained with a limited number of samples. To mitigate this, image augmentation techniques have been proposed in few-shot learning tasks to prevent over-fitting by enriching the dataset. Existing image augmentation methods, primarily designed for single-modal image models, focus solely on transformations within the image itself. However, for CLIP, merely increasing visual variety without considering textual content can reduce generalization ability and may even mislead the model. To address this issue, we introduce a novel image augmentation approach—Integrated Image-Text Augmentation (ITA)— for CLIP model in few-shot learning tasks. This method generates new and diverse augmented images to increase the diversity of the training data and reduce over-fitting. Additionally, ITA establishes an alignment between the augmented images and their textual descriptions. Through this alignment, the model not only learns to recognize visual elements in the images but also understands the semantic connections between these elements and the text descriptions. This dual-modal approach enhances the model’s flexibility and accuracy in processing few-shot learning tasks. Extensive experiments in few-shot image classification scenarios have demonstrated that ITA shows significant improvements compared to various image augmentation techniques.

NeurIPS Conference 2025 Conference Paper

PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space

  • Jinghong Zheng
  • Changlong Jiang
  • Yang Xiao
  • Jiaqi Li
  • Haohong Kuang
  • Hang Xu
  • Ran Wang
  • Zhiguo Cao

3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i. e. , Human3. 6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by 14. 7% compared to SOTA methods on the challenging conditions of Human3. 6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.

AAAI Conference 2024 Conference Paper

Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation

  • Yuheng Jia
  • Xiaorui Peng
  • Ran Wang
  • Min-Ling Zhang

In partial label learning (PLL), each instance is associated with a set of candidate labels, among which only one is correct. The traditional PLL almost all implicitly assume that the distribution of the classes is balanced. However, in real-world applications, the distribution of the classes is imbalanced or long-tailed, leading to the long-tailed partial label learning problem. The previous methods solve this problem mainly by ameliorating the ability to learn in the tail classes, which will sacrifice the performance of the head classes. While keeping the performance of the head classes may degrade the performance of the tail classes. Therefore, in this paper, we construct two classifiers, i.e., a head classifier for keeping the performance of dominant classes and a tail classifier for improving the performance of the tail classes. Then, we propose a classifier weight estimation module to automatically estimate the shot belongingness (head class or tail class) of the samples and allocate the weights for the head classifier and tail classifier when making prediction. This cooperation improves the prediction ability for both the head classes and the tail classes. The experiments on the benchmarks demonstrate the proposed approach improves the accuracy of the SOTA methods by a substantial margin. Code and data are available at: https://github.com/pruirui/HTC-LTPLL.