Arrow Research search

Author name cluster

Deyi Ji

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2026 Conference Paper

Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection

  • Han Wang
  • Deyi Ji
  • Junyu Lu
  • Lanyun Zhu
  • Hailong Zhang
  • Haiyang Wu
  • Liqun Liu
  • Peng Shu

Accurate detection of offensive content on social media demands high-quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low-resource challenge, we propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling. Starting with a lightweight classifier trained on limited labeled data, our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models (MA-VLMs). Unlabeled data on which the classifier and MA-VLMs agree are designated as the Agreed-Unknown set, while conflicting samples form the Disagreed-Unknown set. To enhance label reliability, MA-VLMs simulate dual perspectives, moderator and user, capturing both regulatory and subjective viewpoints. The classifier is optimized using a novel Positive-Negative-Unlabeled (PNU) loss, which jointly exploits labeled, Agreed-Unknown, and Disagreed-Unknown data while mitigating pseudo-label noise. Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision and approaches the performance of large-scale models.

ICML Conference 2025 Conference Paper

CPCF: A Cross-Prompt Contrastive Framework for Referring Multimodal Large Language Models

  • Lanyun Zhu
  • Deyi Ji
  • Tianrun Chen
  • Haiyang Wu
  • De Wen Soh
  • Jun Liu 0036

Referring MLLMs extend conventional multimodal large language models by allowing them to receive referring visual prompts and generate responses tailored to the indicated regions. However, these models often suffer from suboptimal performance due to incorrect responses tailored to misleading areas adjacent to or similar to the target region. This work introduces CPCF, a novel framework to address this issue and achieve superior results. CPCF contrasts outputs generated from the indicated visual prompt with those from contrastive prompts sampled from misleading regions, effectively suppressing the influence of erroneous information outside the target region on response generation. To further enhance the effectiveness and efficiency of our framework, several novel designs are proposed, including a prompt extraction network to automatically identify suitable contrastive prompts, a self-training method that leverages unlabeled data to improve training quality, and a distillation approach to reduce the additional computational overhead associated with contrastive decoding. Incorporating these novel designs, CPCF achieves state-of-the-art performance, as demonstrated by extensive experiments across multiple benchmarks. Project page: https: //lanyunzhu. site/CPCF/

NeurIPS Conference 2025 Conference Paper

Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

  • Lanyun Zhu
  • Deyi Ji
  • Tianrun Chen
  • Haiyang Wu
  • Shiqi Wang

The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs' reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Additionally, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by extensive experiments across multiple benchmarks and tasks.

ICML Conference 2024 Conference Paper

Discrete Latent Perspective Learning for Segmentation and Detection

  • Deyi Ji
  • Feng Zhao 0004
  • Lanyun Zhu
  • Wenwei Jin
  • Hongtao Lu 0001
  • Jieping Ye

In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework, Discrete Latent Perspective Learning (DLPL), for latent multi-perspective fusion learning using conventional single-view images. DLPL comprises three main modules: Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA), which work together to discretize visual features, transform perspectives, and fuse multi-perspective semantic information, respectively. DLPL is a universal perspective learning framework applicable to a variety of scenarios and vision tasks. Extensive experiments demonstrate that DLPL significantly enhances the network’s capacity to depict images across diverse scenarios (daily photos, UAV, auto-driving) and tasks (detection, segmentation).

IJCAI Conference 2024 Conference Paper

PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

  • Deyi Ji
  • Wenwei Jin
  • Hongtao Lu
  • Feng Zhao

The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel Pseudo Multi-Perspective Transformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.

IJCAI Conference 2023 Conference Paper

Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for Ultra-High Resolution Segmentation

  • Deyi Ji
  • Feng Zhao
  • Hongtao Lu

Most existing ultra-high resolution (UHR) segmentation methods always struggle in the dilemma of balancing memory cost and local characterization accuracy, which are both taken into account in our proposed Guided Patch-Grouping Wavelet Transformer (GPWFormer) that achieves impressive performances. In this work, GPWFormer is a Transformer (T)-CNN (C) mutual leaning framework, where T takes the whole UHR image as input and harvests both local details and fine-grained long-range contextual dependencies, while C takes downsampled image as input for learning the category-wise deep context. For the sake of high inference speed and low computation complexity, T partitions the original UHR image into patches and groups them dynamically, then learns the low-level local details with the lightweight multi-head Wavelet Transformer (WFormer) network. Meanwhile, the fine-grained long-range contextual dependencies are also captured during this process, since patches that are far away in the spatial domain can also be assigned to the same group. In addition, masks produced by C are utilized to guide the patch grouping process, providing a heuristics decision. Moreover, the congruence constraints between the two branches are also exploited to maintain the spatial consistency among the patches. Overall, we stack the multi-stage process in a pyramid way. Experiments show that GPWFormer outperforms the existing methods with significant improvements on five benchmark datasets.

AAAI Conference 2021 Conference Paper

Context-Aware Graph Convolution Network for Target Re-identification

  • Deyi Ji
  • Haoran Wang
  • Hanzhe Hu
  • Weihao Gan
  • Wei Wu
  • Junjie Yan

Most existing re-identification methods focus on learning robust and discriminative features with deep convolution networks. However, many of them consider content similarity separately and fail to utilize the context information of the query and gallery sets, e. g. probe-gallery and gallery-gallery relations, thus hard samples may not be well solved due to the limited or even misleading information. In this paper, we present a novel Context-Aware Graph Convolution Network (CAGCN), where the probe-gallery relations are encoded into the graph nodes and the graph edge connections are well controlled by the gallery-gallery relations. In this way, hard samples can be addressed with the context information flows among other easy samples during the graph reasoning. Specifically, we adopt an effective hard gallery sampler to obtain high recall for positive samples while keeping a reasonable graph size, which can also weaken the imbalanced problem in training process with low computation complexity. Experiments show that the proposed method achieves state-of-the-art performance on both person and vehicle reidentification datasets in a plug and play fashion with limited overhead.