Author name cluster

Nataniel Ruiz

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

TMLR Journal 2026 Journal Article

\texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang
Mude Hui
Bingchen Zhao
Yuyin Zhou
Nataniel Ruiz
Cihang Xie

We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured "Chain-of-Edit" pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models’ ability to retain key elements from the input images; 3) Stronger models aren't necessarily more resilient towards higher complexity; 4) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 5) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 6) We observe a "curse of synthetic data": when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises --- a phenomenon that intriguingly also manifests in the latest GPT-Image-1's outputs. The code for evaluation and data generation, and the test set is released at https://github.com/UCSC-VLAA/Complex-Edit.

PDF Details

NeurIPS Conference 2025 Conference Paper

Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Luca Eyring
Shyamgopal Karthik
Alexey Dosovitskiy
Nataniel Ruiz
Zeynep Akata

The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e. g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost.

PDF Details

ICLR Conference 2025 Conference Paper

RB-Modulation: Training-Free Stylization using Reference-Based Modulation

Litu Rout
Yujia Chen 0001
Nataniel Ruiz
Abhishek Kumar
Constantine Caramanis
Sanjay Shakkottai
Wen-Sheng Chu

We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our test-time optimization framework demonstrates precise extraction and control of *content* and *style* in a training-free manner. Further, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets. See project page: https://rb-modulation.github.io/ for code and further details.

Details

ICLR Conference 2025 Conference Paper

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Litu Rout
Yujia Chen 0001
Nataniel Ruiz
Constantine Caramanis
Sanjay Shakkottai
Wen-Sheng Chu

Generative models transform random noise into images, while their inversion aims to reconstruct structured noise for recovery and editing. This paper addresses two key tasks: (i) *inversion* and (ii) *editing* of real images using stochastic equivalents of rectified flow models (e.g., Flux). While Diffusion Models (DMs) dominate the field of generative modeling for images, their inversion suffers from faithfulness and editability challenges due to nonlinear drift and diffusion. Existing DM inversion methods require costly training of additional parameters or test-time optimization of latent variables. Rectified Flows (RFs) offer a promising alternative to DMs, yet their inversion remains underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator, and prove that the resulting vector field is equivalent to a rectified stochastic differential equation. We further extend our framework to design a stochastic sampler for Flux. Our method achieves state-of-the-art performance in zero-shot inversion and editing, surpassing prior works in stroke-to-image synthesis and semantic image editing, with large-scale human evaluations confirming user preference. See our project page https://rf-inversion.github.io/ for code and demo.

Details

ICLR Conference 2025 Conference Paper

Unbounded: A Generative Infinite Game of Character Life Simulation

Jialu Li 0001
Yuanzhen Li
Neal Wadhwa
Yael Pritch
David E. Jacobs
Michael Rubinstein
Mohit Bansal
Nataniel Ruiz

We introduce the concept of a generative infinite game, a video game that transcends the traditional boundaries of finite, hard-coded systems by using generative models. Inspired by James P. Carse's distinction between finite and infinite games, we leverage recent advances in generative AI to create Unbounded: a game of character life simulation that is fully encapsulated in generative models. Specifically, Unbounded draws inspiration from sandbox life simulations and allows you to interact with your autonomous virtual character in a virtual world by feeding, playing with and guiding it - with open-ended mechanics generated by an LLM, some of which can be emergent. In order to develop Unbounded, we propose technical innovations in both the LLM and visual generation domains. Specifically, we present: (1) a specialized, distilled large language model (LLM) that dynamically generates game mechanics, narratives, and character interactions in real-time, and (2) a new dynamic regional image prompt Adapter (IP-Adapter) for vision models that ensures consistent yet flexible visual generation of a character across multiple environments. We evaluate our system through both qualitative and quantitative analysis, showing significant improvements in character life simulation, user instruction following, narrative coherence, and visual consistency for both characters and the environments compared to traditional related approaches.

Details

AAAI Conference 2023 Conference Paper

Practical Disruption of Image Translation Deepfake Networks

Nataniel Ruiz
Sarah Adel Bargal
Cihang Xie
Stan Sclaroff

By harnessing the latest advances in deep learning, image-to-image translation architectures have recently achieved impressive capabilities. Unfortunately, the growing representational power of these architectures has prominent unethical uses. Among these, the threats of (1) face manipulation ("DeepFakes") used for misinformation or pornographic use (2) "DeepNude" manipulations of body images to remove clothes from individuals, etc. Several works tackle the task of disrupting such image translation networks by inserting imperceptible adversarial attacks into the input image. Nevertheless, these works have limitations that may result in disruptions that are not practical in the real world. Specifically, most works generate disruptions in a white-box scenario, assuming perfect knowledge about the image translation network. The few remaining works that assume a black-box scenario require a large number of queries to successfully disrupt the adversary's image translation network. In this work we propose Leaking Transferable Perturbations (LTP), an algorithm that significantly reduces the number of queries needed to disrupt an image translation network by dynamically re-purposing previous disruptions into new query efficient disruptions.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

StyleDrop: Text-to-Image Synthesis of Any Style

Kihyuk Sohn
Lu Jiang
Jarred Barber
Kimin Lee
Nataniel Ruiz
Dilip Krishnan
Huiwen Chang
Yuanzhen Li

Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language, and out-of-distribution effects make it hard to synthesize arbitrary image styles, leveraging a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. StyleDrop is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. StyleDrop works by efficiently learning a new style by fine-tuning very few trainable parameters (less than 1\% of total model parameters), and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image specifying the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https: //styledrop. github. io.

PDF Details

NeurIPS Conference 2023 Conference Paper

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Wenhu Chen
Hexiang Hu
Yandong Li
Nataniel Ruiz
Xuhui Jia
Ming-Wei Chang
William W. Cohen

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with {in-context} learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by {apprenticeship learning}, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth.

PDF Details

NeurIPS Conference 2022 Conference Paper

Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing

Nataniel Ruiz
Sarah Bargal
Cihang Xie
Kate Saenko
Stan Sclaroff

Modern deep neural networks tend to be evaluated on static test sets. One shortcoming of this is the fact that these deep neural networks cannot be easily evaluated for robustness issues with respect to specific scene variations. For example, it is hard to study the robustness of these networks to variations of object scale, object pose, scene lighting and 3D occlusions. The main reason is that collecting real datasets with fine-grained naturalistic variations of sufficient scale can be extremely time-consuming and expensive. In this work, we present Counterfactual Simulation Testing, a counterfactual framework that allows us to study the robustness of neural networks with respect to some of these naturalistic variations by building realistic synthetic scenes that allow us to ask counterfactual questions to the models, ultimately providing answers to questions such as "Would your classification still be correct if the object were viewed from the top? " or "Would your classification still be correct if the object were partially occluded by another object? ". Our method allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers, with respect to these naturalistic variations. We find evidence that ConvNext is more robust to pose and scale variations than Swin, that ConvNext generalizes better to our simulated domain and that Swin handles partial occlusion better than ConvNext. We also find that robustness for all networks improves with network scale and with data scale and variety. We release the Naturalistic Variation Object Dataset (NVD), a large simulated dataset of 272k images of everyday objects with naturalistic variations such as object pose, scale, viewpoint, lighting and occlusions. Project page: https: //counterfactualsimulation. github. io

PDF Details