Author name cluster

Junlong Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

ICLR Conference 2025 Conference Paper

ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Yuanchen Wu
Junlong Du
Ke Yan
Shouhong Ding
Xiaoqiang Li 0002

Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP image encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows selective detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

Details

AAAI Conference 2024 Conference Paper

MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning

Yi Xin
Junlong Du
Qiang Wang
Ke Yan
Shouhong Ding

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark

Yi Xin
Siqi Luo
Xuyang Liu
Yuntao Du
Haodi Zhou
Xinyu Cheng
Christina Lee
Junlong Du

Parameter-efficient transfer learning (PETL) methods show promise in adapting a pre-trained model to various downstream tasks while training only a few parameters. In the computer vision (CV) domain, numerous PETL algorithms have been proposed, but their direct employment or comparison remains inconvenient. To address this challenge, we construct a Unified Visual PETL Benchmark (V-PETL Bench) for the CV domain by selecting 30 diverse, challenging, and comprehensive datasets from image recognition, video action recognition, and dense prediction tasks. On these datasets, we systematically evaluate 25 dominant PETL algorithms and open-source a modular and extensible codebase for fair evaluation of these algorithms. V-PETL Bench runs on NVIDIA A800 GPUs and requires approximately 310 GPU days. We release all the benchmark, making it more efficient and friendly to researchers. Additionally, V-PETL Bench will be continuously updated for new PETL algorithms and CV tasks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding

Yi Xin
Junlong Du
Qiang Wang
Zhiwen Lin
Ke Yan

Large-scale pre-trained models have achieved remarkable success in various computer vision tasks. A standard approach to leverage these models is to fine-tune all model parameters for downstream tasks, which poses challenges in terms of computational and storage costs. Recently, inspired by Natural Language Processing (NLP), parameter-efficient transfer learning has been successfully applied to vision tasks. However, most existing techniques primarily focus on single-task adaptation, and despite limited research on multi-task adaptation, these methods often exhibit suboptimal training/inference efficiency. In this paper, we first propose an once-for-all Vision Multi-Task Adapter (VMT-Adapter), which strikes approximately O(1) training and inference efficiency w.r.t task number. Concretely, VMT-Adapter shares the knowledge from multiple tasks to enhance cross-task interaction while preserves task-specific knowledge via independent knowledge extraction modules. Notably, since task-specific modules require few parameters, VMT-Adapter can handle an arbitrary number of tasks with a negligible increase of trainable parameters. We also propose VMT-Adapter-Lite, which further reduces the trainable parameters by learning shared parameters between down- and up-projections. Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3.96% (1.34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ～1% (0.36%) trainable parameters of the pre-trained model.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Lifelong Person Re-identification by Pseudo Task Knowledge Preservation

Wenhang Ge
Junlong Du
Ancong Wu
Yuqiao Xian
Ke Yan
Feiyue Huang
Wei-Shi Zheng

In real world, training data for person re-identification (Re- ID) is collected discretely with spatial and temporal variations, which requires a model to incrementally learn new knowledge without forgetting old knowledge. This problem is called lifelong person re-identification (LReID). Variations of illumination and background for images of each task exhibit task-specific image style and lead to task-wise domain gap. In addition to missing data from the old tasks, task-wise domain gap is a key factor for catastrophic forgetting in LReID, which is ignored in existing approaches for LReID. The model tends to learn task-specific knowledge with task-wise domain gap, which results in stability and plasticity dilemma. To overcome this problem, we cast LReID as a domain adaptation problem and propose a pseudo task knowledge preservation framework to alleviate the domain gap. Our framework is based on a pseudo task transformation module which maps the features of the new task into the feature space of the old tasks to complement the limited saved exemplars of the old tasks. With extra transformed features in the task-specific feature space, we propose a task-specific domain consistency loss to implicitly alleviate the task-wise domain gap for learning task-shared knowledge instead of task-specific one. Furthermore, to guide knowledge preservation with the feature distributions of the old tasks, we propose to preserve knowledge on extra pseudo tasks which jointly distills knowledge and discriminates identity, in order to achieve a better tradeoff between stability and plasticity for lifelong learning with task-wise domain gap. Extensive experiments demonstrate the superiority of our method 1 as compared with the stateof-the-art lifelong learning and LReID methods.

PDF Details