LLaVA-OneVision: Easy Visual Task Transfer

Bo Li; Yuanhan Zhang; Dong Guo; Renrui Zhang; Feng Li; Hao Zhang; Kaichen Zhang; Peiyuan Zhang; Yanwei Li; Ziwei Liu; Chunyuan Li

Back to TMLR

TMLR 2025

LLaVA-OneVision: Easy Visual Task Transfer

Journal Article Articles Artificial Intelligence · Machine Learning

PDF Details

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: Transactions on Machine Learning Research
Archive span: 2022-2026
Indexed papers: 3849
Paper id: 1110520274115249229