Arrow Research search

Author name cluster

Junke Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

NeurIPS Conference 2025 Conference Paper

OmniGen-AR: AutoRegressive Any-to-Image Generation

  • Junke Wang
  • Xun Wang
  • Qiushan Guo
  • Peize Sun
  • Weilin Huang
  • Zuxuan Wu
  • Yu-Gang Jiang

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering competitive performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, \eg, text or category labels, restricting their applicability in real-world scenarios that demand image synthesis from diverse forms of controls. In this work, we present \system, the first unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, \system supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, \system achieves new state-of-the-art results across a range of benchmark, \eg, 0. 63 on GenEval and 80. 02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

NeurIPS Conference 2025 Conference Paper

Perception Encoder: The best visual embeddings are not at the output of the network

  • Daniel Bolya
  • Po-Yao Huang
  • Peize Sun
  • Jang Hyun Cho
  • Andrea Madotto
  • Chen Wei
  • Tengyu Ma
  • Jiale Zhi

We introduce Perception Encoder (PE), a family of state-of-the-art vision encoders for image and video understanding. Traditionally, vision encoders have relied on a variety of pretraining objectives, each excelling at different downstream tasks. Surprisingly, after scaling a carefully tuned image pretraining recipe and refining with a robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves state-of-the-art results on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, tracking, and depth estimation. We release our models, code, and novel dataset of synthetically and human-annotated videos: https: //github. com/facebookresearch/perception_models

NeurIPS Conference 2024 Conference Paper

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

  • Junke Wang
  • Yi Jiang
  • Zehuan Yuan
  • Binyue Peng
  • Zuxuan Wu
  • Yu-Gang Jiang

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to either image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window attention and causal attention for spatial and temporal modeling, respectively. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e. g. , 1. 11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method.

NeurIPS Conference 2022 Conference Paper

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

  • Junke Wang
  • DongDong Chen
  • Zuxuan Wu
  • Chong Luo
  • Luowei Zhou
  • Yucheng Zhao
  • Yujia Xie
  • Ce Liu

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e. g. , use image-language to help video-language). To this end, we propose a \emph{decoupled} joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e. g. , image classification), video-label (e. g. , video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e. g. , image classification, video action recognition), cross-modal alignment tasks (e. g. , image/video-text retrieval), and multi-modal understanding and generation tasks (e. g. , image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.