Author name cluster

Lijun Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

JMLR Journal 2025 Journal Article

Unified Discrete Diffusion for Categorical Data

Lingxiao Zhao
Xueying Ding
Lijun Yu
Leman Akoglu

Discrete diffusion models have attracted significant attention for their application to naturally discrete data, such as language and graphs. While discrete-time discrete diffusion has been established for some time, it was only recently that Campbell et al. (2022) introduced the first framework for continuous-time discrete diffusion. However, their training and backward sampling processes significantly differ from those of the discrete-time version, requiring nontrivial approximations for tractability. In this paper, we first introduce a series of generalizations and simplifications of the evidence lower bound (ELBO) that facilitate more accurate and easier optimization both discrete- and continuous-time discrete diffusion. We further establish a unification of discrete- and continuous-time discrete diffusion through shared forward process and backward parameterization. Thanks to this unification, the continuous-time diffusion can now utilize the exact and efficient backward process developed for the discrete-time case, avoiding the need for costly and inexact approximations. Similarly, the discrete-time diffusion now also employ the MCMC corrector, which was previously exclusive to the continuous-time case. Extensive experiments and ablations demonstrate the significant improvement, and we open-source our code at: https://github.com/LingxiaoShawn/USD3. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

PDF Details

NeurIPS Conference 2024 Conference Paper

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Gwanghyun Kim
Alonso Martinez
Yu-Chuan Su
Brendan Jou
José Lezama
Agrim Gupta
Lijun Yu
Lu Jiang

Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space. Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: neurips13025. github. io

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Kai Hu
Weichen Yu
Yining Li
Tianjun Yao
Xiang Li
Wenhe Liu
Lijun Yu
Zhiqiang Shen

Recent research indicates that large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content. This paper introduces a novel token-level attack method, Adaptive Dense-to-Sparse Constrained Optimization (ADC), which has been shown to successfully jailbreak multiple open-source LLMs. Drawing inspiration from the difficulties of discrete token optimization, our method relaxes the discrete jailbreak optimization into a continuous optimization process while gradually increasing the sparsity of the optimizing vectors. This technique effectively bridges the gap between discrete and continuous space optimization. Experimental results demonstrate that our method is more effective and efficient than state-of-the-art token-level methods. On Harmbench, our approach achieves the highest attack success rate on seven out of eight LLMs compared to the latest jailbreak methods. \textcolor{red}{Trigger Warning: This paper contains model behavior that can be offensive in nature. }

PDF Details DOI

ICLR Conference 2024 Conference Paper

Language Model Beats Diffusion - Tokenizer is key to visual generation

Lijun Yu
José Lezama
Nitesh Bharadwaj Gundavarapu
Luca Versari
Kihyuk Sohn
David Minnen
Yong Cheng
Agrim Gupta

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce \modelname{}, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Details

TMLR Journal 2024 Journal Article

MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber
Lijun Yu
Qihang Yu
Xueqing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet $256\times256$ benchmark, with a compact generator model of mere 305M parameters. The code for this project is available on https://github.com/markweberdev/maskbit.

PDF Details

ICML Conference 2024 Conference Paper

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk
Lijun Yu
Xiuye Gu
José Lezama
Jonathan Huang
Grant Schindler
Rachel Hornung
Vighnesh Birodkar

We present VideoPoet, a language model capable of synthesizing high-quality video from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model’s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the ability to generate high-fidelity motions. Project page: http: //sites. research. google/videopoet/

Details

ICLR Conference 2023 Conference Paper

Score-based Continuous-time Discrete Diffusion Models

Haoran Sun
Lijun Yu
Bo Dai 0001
Dale Schuurmans
Hanjun Dai

Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, \ie, the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt SDE with score functions to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.

Details

NeurIPS Conference 2023 Conference Paper

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Lijun Yu
Yong Cheng
Zhiruo Wang
Vivek Kumar
Wolfgang Macherey
Yanping Huang
David Ross
Irfan Essa

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the rich semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3. 5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

PDF Details