Author name cluster

Wenwei Jin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Jintao Tong
Wenwei Jin
Pengda Qin
Anqi Li
Yixiong Zou
Yuhong Li
Yuhua Li
Ruixuan Li

Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show FlowCut achieves superior results, outperforming SoTA by 1. 6% on LLaVA-1. 5-7B with 88. 9% token reduction, and by 4. 3% on LLaVA-NeXT-7B with 94. 4% reduction, delivering 3. 2$\times$ speed-up in the prefilling stage. Our code is available at https: //github. com/TungChintao/FlowCut.

PDF Details

ICML Conference 2024 Conference Paper

Discrete Latent Perspective Learning for Segmentation and Detection

Deyi Ji
Feng Zhao 0004
Lanyun Zhu
Wenwei Jin
Hongtao Lu 0001
Jieping Ye

In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework, Discrete Latent Perspective Learning (DLPL), for latent multi-perspective fusion learning using conventional single-view images. DLPL comprises three main modules: Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA), which work together to discretize visual features, transform perspectives, and fuse multi-perspective semantic information, respectively. DLPL is a universal perspective learning framework applicable to a variety of scenarios and vision tasks. Extensive experiments demonstrate that DLPL significantly enhances the network’s capacity to depict images across diverse scenarios (daily photos, UAV, auto-driving) and tasks (detection, segmentation).

Details

IJCAI Conference 2024 Conference Paper

PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

Deyi Ji
Wenwei Jin
Hongtao Lu
Feng Zhao

The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel Pseudo Multi-Perspective Transformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.

PDF Details DOI