Author name cluster

Kaicheng Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

AAAI Conference 2026 Conference Paper

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Tiancheng Gu
Kaicheng Yang
Kaichen Zhang
Xiang An
Ziyong Feng
Yueyi Zhang
Weidong Cai
Jiankang Deng

Universal multimodal embedding models are essential in various tasks. Existing approaches typically use in-batch mining to identify hard negatives by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning, and present a novel Universal Multimodal Embedding(UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance across all tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Yin Xie
Kaicheng Yang
Peirou Liang
Xiang An
Yongle Zhao
Yumeng Wang
Ziyong Feng
Roy Miles

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEEDI, and RealWorldQA benchmarks, respectively.

PDF Details DOI

ICML Conference 2025 Conference Paper

BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution

Kai Liu 0034
Kaicheng Yang
Zheng Chen 0014
Zhiteng Li
Yong Guo
Wenbo Li 0001
Linghe Kong
Yulun Zhang 0001

While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrix branch (LRM). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. Moreover, we achieve excellent compression and acceleration. BiMaCoSR achieves a 23. 8x compression ratio and a 27. 4x speedup ratio compared to FP counterpart. Our code and model are available at https: //github. com/Kai-Liu001/BiMaCoSR

Details

AAAI Conference 2025 Conference Paper

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

Kaicheng Yang
Tiancheng Gu
Xiang An
Haiqiang Jiang
Xiangzi Dai
Ziyong Feng
Weidong Cai
Jiankang Deng

Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. Although knowledge distillation has been widely applied in single modality models, how to efficiently expand knowledge distillation to vision-language foundation models with extensive data remains relatively unexplored. In this paper, we introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model. We initially propose a simple but efficient image semantic balance method to reduce transfer learning bias and improve distillation efficiency. This method filters out 43.7% of image-text pairs from the LAION400M while maintaining superior performance. After that, we leverage cluster-instance discrimination to facilitate knowledge transfer from the teacher model to the student model, thereby empowering the student model to acquire a holistic semantic comprehension of the pre-training data. Experimental results demonstrate that CLIP-CID achieves state-of-the-art performance on various downstream tasks including linear probe and zero-shot classification.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

UniViT: Unifying Image and Video Understanding in One Vision Encoder

Feilong Tang
xiangan xiangan
Haolin Yang
Yin Xie
Kaicheng Yang
Ming Hu
Zheng Cheng
Xingyu Zhou

Despite the impressive progress of recent pretraining methods on multimodal tasks, existing methods are inherently biased towards either spatial modeling (e. g. , CLIP) or temporal modeling (e. g. , V-JEPA), limiting their joint capture of spatial details and temporal dynamics. To this end, we propose UniViT, a cluster-driven unified self-supervised learning framework that effectively captures the structured semantics of both image spatial content and video temporal dynamics through event-level and object-level clustering and discrimination. Specifically, we leverage offline clustering to generate semantic clusters across both modalities. For videos, multi-granularity event-level clustering progressively expands from single-event to structured multi-event segments, capturing coarse-to-fine temporal semantics; for images, object-level clustering captures fine-grained spatial semantics. However, while global clustering provides semantically consistent clusters, it lacks modeling of structured semantic relations (e. g. , temporal event structures). To address this, we introduce a contrastive objective that leverages these semantic clusters as pseudo-label supervision to explicitly enforce structural constraints, including temporal event relations and spatial object co-occurrences, capturing structured semantics beyond categories. Meanwhile, UniViT jointly embeds structured object-level and event-level semantics into a unified representation space. Furthermore, UniViT introduces two key components: (i) Unified Rotary Position Embedding integrates relative positional embedding with frequency-aware dimension allocation to support position-invariant semantic learning and enhance the stability of structured semantics in the discrimination stage; and (ii) Variable Spatiotemporal Streams adapt to inputs of varying frame lengths, addressing the rigidity of conventional fixed-input approaches. Extensive experiments across varying model scales demonstrate that UniViT achieves state-of-the-art performance on linear probing, attentive probing, question answering, and spatial understanding tasks.

PDF Details