Arrow Research search

Author name cluster

Yiming Cui

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

NeurIPS Conference 2025 Conference Paper

All You Need is One: Capsule Prompt Tuning with a Single Vector

  • Yiyang Liu
  • James Liang
  • Heng Fan
  • Wenhao Yang
  • Yiming Cui
  • Xiaotian Han
  • Lifu Huangg
  • Dongfang Liu

Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i. e. , one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e. g. , 84. 03\% average accuracy on T5-Large), serving as an "attention anchor, " while enjoying high parameter efficiency (e. g. , 0. 003\% of model parameters on Llama3. 2-1B).

NeurIPS Conference 2023 Conference Paper

ClusterFomer: Clustering As A Universal Visual Learner

  • James Liang
  • Yiming Cui
  • Qifan Wang
  • Tong Geng
  • Wenguan Wang
  • Dongfang Liu

This paper presents ClusterFormer, a universal vision model that is based on the Clustering paradigm with TransFormer. It comprises two novel designs: 1) recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2) feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i. e. , image classification, object detection, and image segmentation) with varying levels of clustering granularity (i. e. , image-, box-, and pixel-level). Empirical results demonstrate that ClusterFormer outperforms various well-known specialized architectures, achieving 83. 41% top-1 acc. over ImageNet-1K for image classification, 54. 2% and 47. 0% mAP over MS COCO for object detection and instance segmentation, 52. 4% mIoU over ADE20K for semantic segmentation, and 55. 8% PQ over COCO Panoptic for panoptic segmentation. This work aims to initiate a paradigm shift in universal visual understanding and to benefit the broader field.

IJCAI Conference 2022 Conference Paper

GL-RG: Global-Local Representation Granularity for Video Captioning

  • Liqi Yan
  • Qifan Wang
  • Yiming Cui
  • Fuli Feng
  • Xiaojun Quan
  • Xiangyu Zhang
  • Dongfang Liu

Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GL-RG framework for video captioning, namely a Global-Local Representation Granularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning behavior. Experimental results on the challenging MSR-VTT and MSVD datasets show that our DL-RG outperforms recent state-of-the-art methods by a significant margin. Code is available at https: //github. com/ylqi/GL-RG.

AAAI Conference 2021 Conference Paper

DenserNet: Weakly Supervised Visual Localization Using Multi-Scale Feature Aggregation

  • Dongfang Liu
  • Yiming Cui
  • Liqi Yan
  • Christos Mousas
  • Baijian Yang
  • Yingjie Chen

In this work, we introduce a Denser Feature Network (DenserNet) for visual localization. Our work provides three principal contributions. First, we develop a convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels for image representations. Using denser feature maps, our method can produce more keypoint features and increase image retrieval accuracy. Second, our model is trained end-to-end without pixel-level annotation other than positive and negative GPS-tagged image pairs. We use a weakly supervised triplet ranking loss to learn discriminative features and encourage keypoint feature repeatability for image representation. Finally, our method is computationally efficient as our architecture has shared features and parameters during forwarding propagation. Our method is flexible and can be crafted on a light-weighted backbone architecture to achieve appealing efficiency with a small penalty on accuracy. Extensive experiment results indicate that our method sets a new state-of-the-art on four challenging large-scale localization benchmarks and three image retrieval benchmarks with the same level of supervision. The code is available at https: //github. com/goodproj13/ DenserNet.

AAAI Conference 2020 Conference Paper

Discriminative Sentence Modeling for Story Ending Prediction

  • Yiming Cui
  • Wanxiang Che
  • Wei-Nan Zhang
  • Ting Liu
  • Shijin Wang
  • Guoping Hu

Story Ending Prediction is a task that needs to select an appropriate ending for the given story, which requires the machine to understand the story and sometimes needs commonsense knowledge. To tackle this task, we propose a new neural network called Diff-Net for better modeling the differences of each ending in this task. The proposed model could discriminate two endings in three semantic levels: contextual representation, story-aware representation, and discriminative representation. Experimental results on the Story Cloze Test dataset show that the proposed model siginificantly outperforms various systems by a large margin, and detailed ablation studies are given for better understanding our model. We also carefully examine the traditional and BERT-based models on both SCT v1. 0 and v1. 5 with interesting findings that may potentially help future studies.

AAAI Conference 2019 Conference Paper

Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions

  • Zhipeng Chen
  • Yiming Cui
  • Wentao Ma
  • Shijin Wang
  • Guoping Hu

Machine Reading Comprehension (MRC) with multiplechoice questions requires the machine to read given passage and select the correct answer among several candidates. In this paper, we propose a novel approach called Convolutional Spatial Attention (CSA) model which can better handle the MRC with multiple-choice questions. The proposed model could fully extract the mutual information among the passage, question, and the candidates, to form the enriched representations. Furthermore, to merge various attention results, we propose to use convolutional operation to dynamically summarize the attention values within the different size of regions. Experimental results show that the proposed model could give substantial improvements over various state-of- the-art systems on both RACE and SemEval-2018 Task11 datasets.

IJCAI Conference 2019 Conference Paper

Exploiting Persona Information for Diverse Generation of Conversational Responses

  • Haoyu Song
  • Wei-Nan Zhang
  • Yiming Cui
  • Dong Wang
  • Ting Liu

In human conversations, due to their personalities in mind, people can easily carry out and maintain the conversations. Giving conversational context with persona information to a chatbot, how to exploit the information to generate diverse and sustainable conversations is still a non-trivial task. Previous work on persona-based conversational models successfully make use of predefined persona information and have shown great promise in delivering more realistic responses. And they all learn with the assumption that given a source input, there is only one target response. However, in human conversations, there are massive appropriate responses to a given input message. In this paper, we propose a memory-augmented architecture to exploit persona information from context and incorporate a conditional variational autoencoder model together to generate diverse and sustainable conversations. We evaluate the proposed model on a benchmark persona-chat dataset. Both automatic and human evaluations show that our model can deliver more diverse and more engaging persona-based responses than baseline approaches.