Arrow Research search
Back to ICML

ICML 2024

FiT: Flexible Vision Transformer for Diffusion Model

Conference Paper Accept (Spotlight) Artificial Intelligence ยท Machine Learning

Abstract

In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions. Repository available at https: //github. com/whlzy/FiT.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
International Conference on Machine Learning
Archive span
1993-2025
Indexed papers
16471
Paper id
474545798239677519