Author name cluster

Lu Jiang 0004

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

ICLR Conference 2024 Conference Paper

Language Model Beats Diffusion - Tokenizer is key to visual generation

Lijun Yu
José Lezama
Nitesh Bharadwaj Gundavarapu
Luca Versari
Kihyuk Sohn
David Minnen
Yong Cheng
Agrim Gupta

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce \modelname{}, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Details

ICML Conference 2024 Conference Paper

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk
Lijun Yu
Xiuye Gu
José Lezama
Jonathan Huang
Grant Schindler
Rachel Hornung
Vighnesh Birodkar

We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model’s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http: //sites. research. google/videopoet/

Details

ICLR Conference 2023 Conference Paper

Discrete Predictor-Corrector Diffusion Models for Image Synthesis

José Lezama
Tim Salimans
Lu Jiang 0004
Huiwen Chang
Jonathan Ho
Irfan Essa

We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies.

Details

ICML Conference 2023 Conference Paper

Muse: Text-To-Image Generation via Masked Generative Transformers

Huiwen Chang
Han Zhang 0010
Jarred Barber
Aaron Maschinot
José Lezama
Lu Jiang 0004
Ming-Hsuan Yang 0001
Kevin Murphy 0002

We present Muse, a text-to-image Transformermodel that achieves state-of-the-art image genera-tion performance while being significantly moreefficient than diffusion or autoregressive models. Muse is trained on a masked modeling task indiscrete token space: given the text embeddingextracted from a pre-trained large language model(LLM), Muse learns to predict randomly maskedimage tokens. Compared to pixel-space diffusionmodels, such as Imagen and DALL-E 2, Muse issignificantly more efficient due to the use of dis-crete tokens and requires fewer sampling itera-tions; compared to autoregressive models such asParti, Muse is more efficient due to the use of par-allel decoding. The use of a pre-trained LLM en-ables fine-grained language understanding, whichtranslates to high-fidelity image generation andthe understanding of visual concepts such as ob-jects, their spatial relationships, pose, cardinalityetc. Our 900M parameter model achieves a newSOTA on CC3M, with an FID score of 6. 06. TheMuse 3B parameter model achieves an FID of7. 88 on zero-shot COCO evaluation, along with aCLIP score of 0. 32. Muse also directly enables anumber of image editing applications without theneed to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More resultsand videos demonstrating editing are available at https: //muse-icml. github. io/

Details

ICLR Conference 2022 Conference Paper

Discrete Representations Strengthen Vision Transformer Robustness

Chengzhi Mao
Lu Jiang 0004
Mostafa Dehghani 0001
Carl Vondrick
Rahul Sukthankar
Irfan Essa

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs are overly reliant on local features (\eg, nuisances and texture) and fail to make adequate use of global context (\eg, shape and structure). As a result, ViTs fail to generalize to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12\% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

Details

ICLR Conference 2022 Conference Paper

ViTGAN: Training GANs with Vision Transformers

Kwonjoon Lee
Huiwen Chang
Lu Jiang 0004
Han Zhang 0010
Zhuowen Tu
Ce Liu 0001

Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to faciliate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN- based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.

Details

ICML Conference 2021 Conference Paper

Self-supervised and Supervised Joint Training for Resource-rich Machine Translation

Yong Cheng
Wei Wang
Lu Jiang 0004
Wolfgang Macherey

Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve notable gains on resource-rich NMT. In this paper, we propose a joint training approach, F2-XEnDec, to combine self-supervised and supervised learning to optimize NMT models. To exploit complementary self-supervised signals for supervised learning, NMT models are trained on examples that are interbred from monolingual and parallel sentences through a new process called crossover encoder-decoder. Experiments on two resource-rich translation benchmarks, WMT’14 English-German and WMT’14 English-French, demonstrate that our approach achieves substantial improvements over several strong baseline methods and obtains a new state of the art of 46. 19 BLEU on English-French when incorporating back translation. Results also show that our approach is capable of improving model robustness to input perturbations such as code-switching noise which frequently appears on the social media.

Details

ICML Conference 2020 Conference Paper

Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels

Lu Jiang 0004
Di Huang
Mason Liu
Weilong Yang

Performing controlled experiments on noisy data is essential in understanding deep learning across noise levels. Due to the lack of suitable datasets, previous research has only examined deep learning on controlled synthetic label noise, and real-world label noise has never been studied in a controlled setting. This paper makes three contributions. First, we establish the first benchmark of controlled real-world label noise from the web. This new benchmark enables us to study the web label noise in a controlled setting for the first time. The second contribution is a simple but effective method to overcome both synthetic and real noisy labels. We show that our method achieves the best result on our dataset as well as on two public benchmarks (CIFAR and WebVision). Third, we conduct the largest study by far into understanding deep neural networks trained on noisy labels across different noise levels, noise types, network architectures, and training settings.

Details

ICML Conference 2018 Conference Paper

MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels

Lu Jiang 0004
Zhengyuan Zhou
Thomas Leung
Li-Jia Li 0001
Li Fei-Fei 0001

Recent deep networks are capable of memorizing the entire data even when the labels are completely random. To overcome the overfitting on corrupted labels, we propose a novel technique of learning another neural network, called MentorNet, to supervise the training of the base deep networks, namely, StudentNet. During training, MentorNet provides a curriculum (sample weighting scheme) for StudentNet to focus on the sample the label of which is probably correct. Unlike the existing curriculum that is usually predefined by human experts, MentorNet learns a data-driven curriculum dynamically with StudentNet. Experimental results demonstrate that our approach can significantly improve the generalization performance of deep networks trained on corrupted training data. Notably, to the best of our knowledge, we achieve the best-published result on WebVision, a large benchmark containing 2. 2 million images of real-world noisy labels.

Details