Arrow Research search

Author name cluster

David J. Fleet

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

ICLR Conference 2025 Conference Paper

Controlling Space and Time with Diffusion Models

  • Daniel Watson
  • Saurabh Saxena
  • Lala Li
  • Andrea Tagliasacchi
  • David J. Fleet

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works which generally operate in limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation, which we illustrate qualitatively on a variety of scenes. See https://4d-diffusion.github.io for video samples.

AAAI Conference 2025 Conference Paper

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

  • Junhwa Hur
  • Charles Herrmann
  • Saurabh Saxena
  • Janne Kontkanen
  • Wei-Sheng Lai
  • Yichang Shih
  • Michael Rubinstein
  • David J. Fleet

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model’s task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines.

JBHI Journal 2025 Journal Article

Personalized Video-Based Hand Taxonomy Using Egocentric Video in the Wild

  • Mehdy Dousty
  • David J. Fleet
  • José Zariffa

Objective: Hand function is central to inter- actions with our environment. Developing a comprehen- sive model of hand grasps in naturalistic environments is crucial across various disciplines, including robotics, ergonomics, and rehabilitation. Creating such a taxonomy poses challenges due to the significant variation in grasping strategies that individuals may employ. For instance, individuals with impaired hands, such as those with spinal cord injuries (SCI), may develop unique grasps not used by unimpaired individuals. These grasping techniques may differ from person to person, influenced by variable senso- rimotor impairment, creating a need for personalized meth- ods of analysis. Method: This study aimed to automatically identify the dominant distinct hand grasps for each indi- vidual without reliance on a priori taxonomies, by applying semantic clustering to egocentric video. Egocentric video recordings collected in the homes of 19 individual with cervical SCI were used to cluster grasping actions with semantic significance. A deep learning model integrating posture and appearance data was employed to create a personalized hand taxonomy. Results: Quantitative analysis reveals a cluster purity of 67. 6% ± 24. 2% with 18. 0% ± 21. 8% redundancy. Qualitative assessment revealed meaningful clusters in video content. Discussion: This methodology provides a flexible and effective strategy to analyze hand function in the wild, with applications in clinical assess- ment and in-depth characterization of human-environment interactions in a variety of contexts

NeurIPS Conference 2024 Conference Paper

CryoSPIN: Improving Ab-Initio Cryo-EM Reconstruction with Semi-Amortized Pose Inference

  • Shayan Shekarforoush
  • David B. Lindell
  • Marcus A. Brubaker
  • David J. Fleet

Cryo-EM is an increasingly popular method for determining the atomic resolution 3D structure of macromolecular complexes (eg, proteins) from noisy 2D images captured by an electron microscope. The computational task is to reconstruct the 3D density of the particle, along with 3D pose of the particle in each 2D image, for which the posterior pose distribution is highly multi-modal. Recent developments in cryo-EM have focused on deep learning for which amortized inference has been used to predict pose. Here, we address key problems with this approach, and propose a new semi-amortized method, cryoSPIN, in which reconstruction begins with amortized inference and then switches to a form of auto-decoding to refine poses locally using stochastic gradient descent. Through evaluation on synthetic datasets, we demonstrate that cryoSPIN is able to handle multi-modal pose distributions during the amortized inference stage, while the later, more flexible stage of direct pose optimization yields faster and more accurate convergence of poses compared to baselines. On experimental data, we show that cryoSPIN outperforms the state-of-the-art cryoAI in speed and reconstruction quality.

ICLR Conference 2024 Conference Paper

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

  • Kevin Clark
  • Paul Vicol
  • Kevin Swersky
  • David J. Fleet

We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.

TMLR Journal 2024 Journal Article

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

  • Cristina Nader Vasconcelos
  • Abdullah Rashwan
  • Austin Waters
  • Trevor Walker
  • Keyang Xu
  • Jimmy Yan
  • Rui Qian
  • Yeqing Li

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components.The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment vs. high resolution rendering. We first demonstrate the benefits of scaling a Shallow UNet, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high resolution end-to-end models, while preserving the integrity of the pre-trained representation,stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes.Vermeer, our full pipeline model trained with internal datasets to produce 1024×1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

JBHI Journal 2024 Journal Article

Hand Grasp Classification in Egocentric Video After Cervical Spinal Cord Injury

  • Mehdy Dousty
  • David J. Fleet
  • José Zariffa

Objective: The hand function of individuals with spinal cord injury (SCI) plays a crucial role in their independence and quality of life. Wearable cameras provide an opportunity to analyze hand function in non-clinical environments. Summarizing the video data and documenting dominant hand grasps and their usage frequency would allow clinicians to quickly and precisely analyze hand function. Method: We introduce a new hierarchical model to summarize the grasping strategies of individuals with SCI at home. The first level classifies hand-object interaction using hand-object contact estimation. We developed a new deep model in the second level by incorporating hand postures and hand-object contact points using contextual information. Results: In the first hierarchical level, a mean of 86% $\pm 1. 0$ % was achieved among 17 participants. At the grasp classification level, the mean average accuracy was 66. 2 $\pm 12. 9\%$. The grasp classifier's performance was highly dependent on the participants, with accuracy varying from 41% to 78%. The highest grasp classification accuracy was obtained for the model with smoothed grasp classification, using a ResNet50 backbone architecture for the contextual head and a temporal pose head. Discussion: We introduce a novel algorithm that, for the first time, enables clinicians to analyze the quantity and type of hand movements in individuals with spinal cord injury at home. The algorithm can find applications in other research fields, including robotics, and most neurological diseases that affect hand function, notably, stroke and Parkinson's.

ICML Conference 2023 Conference Paper

Scalable Adaptive Computation for Iterative Generation

  • Allan Jabri
  • David J. Fleet
  • Ting Chen

Natural data is redundant yet predominant architectures tile computation uniformly across their input and output space. We propose the Recurrent Interface Network (RIN), an attention-based architecture that decouples its core computation from the dimensionality of the data, enabling adaptive computation for more scalable generation of high-dimensional data. RINs focus the bulk of computation (i. e. global self-attention) on a set of latent tokens, using cross-attention to read and write (i. e. route) information between latent and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and top-down (latent to data) feedback, leading to deeper and more expressive routing. While this routing introduces challenges, this is less problematic in recurrent computation settings where the task (and routing problem) changes gradually, such as iterative generation with diffusion models. We show how to leverage recurrence by conditioning the latent tokens at each forward pass of the reverse diffusion process with those from prior computation, i. e. latent self-conditioning. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to1024×1024 images without cascades or guidance, while being domain-agnostic and up to 10× more efficient than 2D and 3D U-Nets.

TMLR Journal 2023 Journal Article

Synthetic Data from Diffusion Models Improves ImageNet Classification

  • Shekoofeh Azizi
  • Simon Kornblith
  • Chitwan Saharia
  • Mohammad Norouzi
  • David J. Fleet

Deep generative models are becoming increasingly powerful, now generating diverse, high fidelity, photo-realistic samples given text prompts. Nevertheless, samples from such models have not been shown to significantly improve model training for challenging and well-studied discriminative tasks like ImageNet classification. In this paper we show that augmenting the ImageNet training set with samples from a generative diffusion model can yield substantial improvements in ImageNet classification accuracy over strong ResNet and Vision Transformer baselines. To this end we explore the fine-tuning of large-scale text-to-image diffusion models, yielding class-conditional ImageNet models with state-of-the-art FID score (1.76 at 256×256 resolution) and Inception Score (239 at 256×256). The model also yields a new state-of-the-art in Classification Accuracy Scores, i.e., ImageNet test accuracy for a ResNet-50 architecture trained solely on synthetic data (64.96 top-1 accuracy for 256×256 samples, improving to 69.24 for 1024×1024 samples). Adding up to three times as many synthetic samples as real training samples consistently improves ImageNet classification accuracy across multiple architectures.

NeurIPS Conference 2023 Conference Paper

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

  • Saurabh Saxena
  • Charles Herrmann
  • Junhwa Hur
  • Abhishek Kar
  • Mohammad Norouzi
  • Deqing Sun
  • David J. Fleet

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e. g. , capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, one can train state-of-the-art diffusion models for depth and optical flow estimation, with additional zero-shot coarse-to-fine refinement for high resolution estimates. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model obtains a state-of-the-art relative depth error of 0. 074 on the indoor NYU benchmark and an Fl-all score of 3. 26\% on the KITTI optical flow benchmark, about 25\% better than the best published method.

NeurIPS Conference 2022 Conference Paper

A Unified Sequence Interface for Vision Tasks

  • Ting Chen
  • Saurabh Saxena
  • Lala Li
  • Tsung-Yi Lin
  • David J. Fleet
  • Geoffrey E. Hinton

While language tasks are naturally expressed in a single, unified, modeling framework, i. e. , generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e. g. , bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

JMLR Journal 2022 Journal Article

Cascaded Diffusion Models for High Fidelity Image Generation

  • Jonathan Ho
  • Chitwan Saharia
  • William Chan
  • David J. Fleet
  • Mohammad Norouzi
  • Tim Salimans

We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

  • Chitwan Saharia
  • William Chan
  • Saurabh Saxena
  • Lala Li
  • Jay Whang
  • Emily L. Denton
  • Kamyar Ghasemipour
  • Raphael Gontijo Lopes

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e. g. , T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7. 27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.

ICLR Conference 2022 Conference Paper

Pix2seq: A Language Modeling Framework for Object Detection

  • Ting Chen 0007
  • Saurabh Saxena
  • Lala Li
  • David J. Fleet
  • Geoffrey E. Hinton

We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.

NeurIPS Conference 2022 Conference Paper

Residual Multiplicative Filter Networks for Multiscale Reconstruction

  • Shayan Shekarforoush
  • David Lindell
  • David J. Fleet
  • Marcus A. Brubaker

Coordinate networks like Multiplicative Filter Networks (MFNs) and BACON offer some control over the frequency spectrum used to represent continuous signals such as images or 3D volumes. Yet, they are not readily applicable to problems for which coarse-to-fine estimation is required, including various inverse problems in which coarse-to-fine optimization plays a key role in avoiding poor local minima. We introduce a new coordinate network architecture and training scheme that enables coarse-to-fine optimization with fine-grained control over the frequency support of learned reconstructions. This is achieved with two key innovations. First, we incorporate skip connections so that structure at one scale is preserved when fitting finer-scale structure. Second, we propose a novel initialization scheme to provide control over the model frequency spectrum at each stage of optimization. We demonstrate how these modifications enable multiscale optimization for coarse-to-fine fitting to natural images. We then evaluate our model on synthetically generated datasets for the the problem of single-particle cryo-EM reconstruction. We learn high resolution multiscale structures, on par with the state-of-the art. Project webpage: https: //shekshaa. github. io/ResidualMFN/.

NeurIPS Conference 2022 Conference Paper

Video Diffusion Models

  • Jonathan Ho
  • Tim Salimans
  • Alexey Gritsenko
  • William Chan
  • Mohammad Norouzi
  • David J. Fleet

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https: //video-diffusion. github. io/.

ICML Conference 2021 Conference Paper

Unsupervised Part Representation by Flow Capsules

  • Sara Sabour
  • Andrea Tagliasacchi
  • Soroosh Yazdani
  • Geoffrey E. Hinton
  • David J. Fleet

Capsule networks aim to parse images into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a way to learn primary capsule encoders that detect atomic parts from a single image. During training we exploit motion as a powerful perceptual cue for part definition, with an expressive decoder for part generation within a layered image model with occlusion. Experiments demonstrate robust part discovery in the presence of multiple objects, cluttered backgrounds, and occlusion. The learned part decoder is shown to infer the underlying shape masks, effectively filling in occluded regions of the detected shapes. We evaluate FlowCapsules on unsupervised part segmentation and unsupervised image classification.

NeurIPS Conference 2020 Conference Paper

Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation

  • Sajad Norouzi
  • David J. Fleet
  • Mohammad Norouzi

We introduce Exemplar VAEs, a family of generative models that bridge the gap between parametric and non-parametric, exemplar based generative models. Exemplar VAE is a variant of VAE with a non-parametric latent prior based on a Parzen window estimator. To sample from it, one first draws a random exemplar from a training set, then stochastically transforms that exemplar into a latent code and a new observation. We propose retrieval augmented training (RAT) as a way to speed up Exemplar VAE training by using approximate nearest neighbor search in the latent space to define a lower bound on log marginal likelihood. To enhance generalization, model parameters are learned using exemplar leave-one-out and subsampling. Experiments demonstrate the effectiveness of Exemplar VAEs on density estimation and representation learning. Importantly, generative data augmentation using Exemplar VAEs on permutation invariant MNIST and Fashion MNIST reduces classification error from 1. 17% to 0. 69% and from 8. 56% to 8. 16%.

UAI Conference 2019 Conference Paper

Differentiable Probabilistic Models of Scientific Imaging with the Fourier Slice Theorem

  • Karen Ullrich
  • Rianne van den Berg
  • Marcus A. Brubaker
  • David J. Fleet
  • Max Welling

Scientific imaging techniques such as optical and electron microscopy and computed tomography (CT) scanning are used to study the 3D structure of an object through 2D observations. These observations are related to the original 3D object through orthogonal integral projections. For common 3D reconstruction algorithms, computational efficiency requires the modeling of the 3D structures to take place in Fourier space by applying the Fourier slice theorem. At present, it is unclear how to differentiate through the projection operator, and hence current learning algorithms can not rely on gradient based methods to optimize 3D structure models. In this paper we show how back-propagation through the projection operator in Fourier space can be achieved. We demonstrate the validity of the approach with experiments on 3D reconstruction of proteins. We further extend our approach to learning probabilistic models of 3D objects. This allows us to predict regions of low sampling rates or estimate noise. A higher sample efficiency can be reached by utilizing the learned uncertainties of the 3D structure as an unsupervised estimate of the model fit. Finally, we demonstrate how the reconstruction algorithm can be extended with an amortized inference scheme on unknown attributes such as object pose. Through empirical studies we show that joint inference of the 3D structure and the object pose becomes more difficult when the ground truth object contains more symmetries. Due to the presence of for instance (approximate) rotational symmetries, the pose estimation can easily get stuck in local optima, inhibiting a fine-grained high-quality estimate of the 3D structure.

ICRA Conference 2005 Conference Paper

Learning Sensor Network Topology through Monte Carlo Expectation Maximization

  • Dimitri Marinakis
  • Gregory Dudek
  • David J. Fleet

We consider the problem of inferring sensor positions and a topological (i. e. qualitative) map of an environment given a set of cameras with non-overlapping fields of view. In this way, without prior knowledge of the environment nor the exact position of sensors within the environment, one can infer the topology of the environment, and common traffic patterns within it. In particular, we consider sensors stationed at the junctions of the hallways of a large building. We infer the sensor connectivity graph and the travel times between sensors (and hence the hallway topology) from the sequence of events caused by unlabeled agents (i. e. people) passing within view of the different sensors. We do this based on a first-order semi-Markov model of the agent's behavior. The paper describes a problem formulation and proposes a stochastic algorithm for its solution. The result of the algorithm is a probabilistic model of the sensor network connectivity graph and the underlying traffic patterns. We conclude with results from numerical simulations

UAI Conference 2001 Conference Paper

Lattice Particle Filters

  • Dirk Ormoneit
  • Christiane Lemieux
  • David J. Fleet

A standard approach to approximate inference in state-space models isto apply a particle filter, e.g., the Condensation Algorithm.However, the performance of particle filters often varies significantlydue to their stochastic nature.We present a class of algorithms, called lattice particle filters, thatcircumvent this difficulty by placing the particles deterministicallyaccording to a Quasi-Monte Carlo integration rule.We describe a practical realization of this idea, discuss itstheoretical properties, and its efficiency.Experimental results with a synthetic 2D tracking problem show that thelattice particle filter is equivalent to a conventional particle filterthat has between 10 and 60% more particles, depending ontheir ``sparsity'' in the state-space.We also present results on inferring 3D human motion frommoving light displays.