Arrow Research search

Author name cluster

Lingling Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
1 author row

Possible papers

8

AAAI Conference 2026 Conference Paper

Disentangled Hypergraph-Guided Mamba Scanning for Fine-Grained Visual Recognition

  • Zhongwei Xiong
  • Hao Wang
  • Xiaoyan Yu
  • Lingling Li
  • Xuezhuan Zhao
  • Taisong Jin

Fine-grained Visual Recognition (FGVR) aims to distinguish between categories with subtle inter-class differences and large intra-class variations. While Vision Transformers with attention mechanisms have been widely adopted for FGVR, they usually suffer from high computational complexity and entangled global representations. Recent advancements in state-space models, exemplified by Mamba, have showcased substantial potential in vision-related tasks due to their linear scalability and rich sequence modeling capacity. To this end, we propose DHMamba, a novel Mamba based FGVR method. The proposed method leverages hypergraph to guide selective scanning and strengthen Mamba’s capability in modeling fine-grained semantics. Furthermore, a Disentangled Local Scanning (DLS) module is introduced to utilize hyperedges to allocate distinct informative patches into independent channels for mitigating the representational entanglement. Extensive experiments conducted on multiple FGVR benchmarks demonstrate that the proposed DHMamba outperforms the state-of-the-art methods, validating the efficacy of combining state-space modeling with hypergraph-based feature structuring.

AAAI Conference 2026 Conference Paper

Evolving Semantic Propagation for Aerial Semantic 3D Gaussian Splatting

  • Zihan Gao
  • Lingling Li
  • Xu Liu
  • Fang Liu
  • Licheng Jiao
  • Puhua Chen
  • Wenping Ma
  • Shuyuan Yang

Semantic understanding of large-scale aerial scenes represents a critical challenge in 3D computer vision, hindered by the prohibitive cost of dense annotation. This paper introduces EvoPropGS, a novel approach for the semantic segmentation of 3D Gaussian Splatting models that requires only minimal supervision. Our core insight is to leverage the inherent structural repetitions within aerial environments to propagate semantic information from a sparse set of annotations across the entire 3D scene. Our approach constructs a prompt library by pairing SAM-generated mask candidates with DINOv2 feature embeddings from annotated views. For unannotated regions, we generate pseudo-labels by matching region proposals with these featured prompts via cosine similarity. We then formulate optimal prompt selection as a discrete optimization problem solved via evolutionary search, guided by our novel fitness function that evaluates both 3D consistency and 2D semantic coherence. Extensive experiments demonstrate that EvoPropGS achieves accurate segmentation with only 2 percent annotated pixels.

AAAI Conference 2026 Conference Paper

HTTrack: Learning to Perceive Targets via Historical Trajectories in Satellite Video Tracking

  • Jiahao Wang
  • Fang Liu
  • Licheng Jiao
  • Hao Wang
  • Shuo Li
  • Xinyi Wang
  • Lingling Li
  • Puhua Chen

In recent years, the rapid progress of deep learning has driven notable advancements in satellite video tracking, a critical task for applications such as environmental monitoring, disaster management, and defense. Despite these strides, existing approaches remain constrained by their inability to handle dynamic challenges, such as target appearance variations, complex motion patterns, and occlusions. Traditional methods often suffer from static template matching or overly complex update mechanisms, compromising their robustness and practicality in real-world scenarios. To address these limitations, we propose a paradigm shift in satellite video tracking by integrating historical trajectory knowledge with visual features. This fusion enhances the tracker's perceptual understanding of targets over time, enabling more adaptive and resilient tracking. By aligning spatial, temporal, and cross-modal information, our approach effectively bridges the gap between fragmented observations and coherent tracking performance, even under challenging conditions like small target detection and cluttered backgrounds. Extensive experiments conducted on multiple satellite video tracking benchmarks demonstrate the superiority of our method, with HTTrack achieving success rates of 51.5% on SV248S, 52.9% on SatSOT, and 32.6% on VISO, significantly outperforming state-of-the-art trackers and marking a step forward in achieving robust, accurate, and scalable satellite video tracking.

AAAI Conference 2026 Conference Paper

SOAR: Semi-Supervised Open-Vocabulary Aerial Object Detection via Dual-Aware Enhanced Prior Denoising

  • Xu Liu
  • Yihong Huang
  • Dan Zhang
  • Lingling Li
  • Long Sun
  • Licheng Jiao

Open-Vocabulary Object Detection (OVOD) shows promise in remote sensing (RS), but due to its unique value, there are challenges such as the predominance of background regions, sparse labels, limited semantic information, and difficulties in semi-supervised training. To tackle these challenges, we propose the Semi-Supervised Open-Vocabulary Aerial Object Detection with Dual-Perception Prior Denoising (SOAR), which explicitly models the background embeddings of each scene to indirectly construct foreground priors, thereby capitalizing on the abundant background information present in RS imagery. We further introduce a query enhancement module that integrates language and foreground prior information to enhance the effectiveness of query selection and feature augmentation. During the decoding stage of semi-supervised training, we perform denoising and reconstruction of the foreground priors to generate pseudo-labels that support the training process. Additionally, we address the sparsity of label information through expansion and aggregation techniques, further improving model performance. Experimental evaluations reveal that, in the open-vocabulary object detection task on the DIOR dataset, our method achieves a mean Average Precision (mAP) of 68.5% and Harmonic Mean (HM) of 55.9%, outperforming the previous state-of-the-art model’s mAP of 61.6% and HM of 53.6%. Our approach offers a novel solution to the open-vocabulary challenge in aerial object detection.

IJCAI Conference 2025 Conference Paper

Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images

  • Biao Liu
  • Xu Liu
  • Lingling Li
  • Licheng Jiao
  • Fang Liu
  • Xinyu Sun
  • Youlin Huang

Visual grounding (VG) refers to detecting the specific objects in images based on linguistic expressions, and it has profound significance in the advanced interpretation of natural images. In remote sensing image interpretation, visual grounding is limited by characteristics such as the complex scenes and diverse object sizes. To solve this problem, we propose a novel remote sensing visual grounding (RSVG) framework, named language-guided hybrid representation learning Transformer (LGFormer). Specifically, we designed a multimodal dual-encoder Transformer structure called the adaptive multimodal feature fusion module. This structure innovatively integrates text and visual features as hybrid queries, enabling early-stage decoding queries to perceive the target position accurately. Then, the different modal information from the dual encoders is aggregated by hybrid queries to obtain the final object embedding for coordinate regression. Besides, a multi-scale cross-modal feature enhancement module (MSCM) is designed to enhance the self-representation of the extracted text and visual features and align them semantically. As for the hybrid queries, we use linguistic guidance to select visual features as the visual part and sentence-level features as the textual part. Finally, the LGFormer model we designed achieved the best results compared to existing models on the DIOR-RSVG and OPT-RSVG datasets.

AAAI Conference 2024 Conference Paper

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

  • Hao Wang
  • Fang Liu
  • Licheng Jiao
  • Jiahao Wang
  • Zehua Hao
  • Shuo Li
  • Lingling Li
  • Puhua Chen

Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.