Arrow Research search

Author name cluster

Manuel Brack

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

ICML Conference 2025 Conference Paper

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

  • Lukas Helff
  • Felix Friedrich
  • Manuel Brack
  • Kristian Kersting
  • Patrick Schramowski

This paper introduces Llavaguard, a suite of VLM-based vision safeguards that address the critical need for reliable tools in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting Llavaguard models, ranging from 0. 5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, Llavaguard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate Llavaguard’s performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code, publicly available at https: //ml-research. github. io/human-centered-genai/projects/llavaguard.

NeurIPS Conference 2025 Conference Paper

Measuring and Guiding Monosemanticity

  • Ruben Härle
  • Felix Friedrich
  • Manuel Brack
  • Björn Deiseroth
  • Stephan Waeldchen
  • Patrick Schramowski
  • Kristian Kersting

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

NeurIPS Conference 2024 Conference Paper

DeiSAM: Segment Anything with Deictic Prompting

  • Hikaru Shindo
  • Manuel Brack
  • Gopika Sudhakaran
  • Devendra S. Dhami
  • Patrick Schramowski
  • Kristian Kersting

Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i. e. , referring to something depending on the context such as "The object that is on the desk and behind the cup. ". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM — a combination of large pre-trained neural networks with differentiable logic reasoners — for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.

JAIR Journal 2024 Journal Article

Does CLIP Know My Face?

  • Dominik Hintersdorf
  • Lukas Struppek
  • Manuel Brack
  • Felix Friedrich
  • Patrick Schramowski
  • Kristian Kersting

With the rise of deep learning in various applications, privacy concerns around the protection of training data have become a critical area of research. Whereas prior studies have focused on privacy risks in single-modal models, we introduce a novel method to assess privacy for multi-modal models, specifically vision-language models like CLIP. The proposed Identity Inference Attack (IDIA) reveals whether an individual was included in the training data by querying the model with images of the same person. Letting the model choose from a wide variety of possible text labels, the model reveals whether it recognizes the person and, therefore, was used for training. Our large-scale experiments on CLIP demonstrate that individuals used for training can be identified with very high accuracy. We confirm that the model has learned to associate names with depicted individuals, implying the existence of sensitive information that can be extracted by adversaries. Our results highlight the need for stronger privacy protection in large-scale models and suggest that IDIAs can be used to prove the unauthorized use of data for training and to enforce privacy laws. This article appears in the AI & Society track.

IJCAI Conference 2024 Conference Paper

Exploiting Cultural Biases via Homoglyphs inText-to-Image Synthesis (Abstract Reprint)

  • Lukas Struppek
  • Dominik Hintersdorf
  • Felix Friedrich
  • Manuel Brack
  • Patrick Schramowski
  • Kristian Kersting

Models for text-to-image synthesis, such as DALL-E 2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in the textual description, common models reflect cultural biases in their generated images. We analyze this behavior both qualitatively and quantitatively and identify a model’s text encoder as the root cause of the phenomenon. Such behavior can be interpreted as a model feature, offering users a simple way to customize the image generation and reflect their own cultural background. Yet, malicious users or service providers may also try to intentionally bias the image generation. One goal might be to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.

TMLR Journal 2024 Journal Article

Learning by Self-Explaining

  • Wolfgang Stammer
  • Felix Friedrich
  • David Steinmann
  • Manuel Brack
  • Hikaru Shindo
  • Kristian Kersting

Much of explainable AI research treats explanations as a means for model inspection. Yet, this neglects findings from human psychology that describe the benefit of self-explanations in an agent’s learning process. Motivated by this, we introduce a novel workflow in the context of image classification, termed Learning by Self-Explaining (LSX). LSX utilizes aspects of self-refining AI and human-guided explanatory machine learning. The underlying idea is that a learner model, in addition to optimizing for the original predictive task, is further optimized based on explanatory feedback from an internal critic model. Intuitively, a learner’s explanations are considered “useful” if the internal critic can perform the same task given these explanations. We provide an overview of important components of LSX and, based on this, perform extensive experimental evaluations via three different example instantiations. Our results indicate improvements via Learning by Self-Explaining on several levels: in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations. Overall, our work provides evidence for the potential of self-explaining within the learning phase of an AI model.

NeurIPS Conference 2023 Conference Paper

ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

  • Björn Deiseroth
  • Mayukh Deb
  • Samuel Weinbach
  • Manuel Brack
  • Patrick Schramowski
  • Kristian Kersting

Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of additional memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use explanations in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method relying on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.

JAIR Journal 2023 Journal Article

Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

  • Lukas Struppek
  • Dominik Hintersdorf
  • Felix Friedrich
  • Manuel Brack
  • Patrick Schramowski
  • Kristian Kersting

Models for text-to-image synthesis, such as DALL-E 2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in the textual description, common models reflect cultural biases in their generated images. We analyze this behavior both qualitatively and quantitatively and identify a model’s text encoder as the root cause of the phenomenon. Such behavior can be interpreted as a model feature, offering users a simple way to customize the image generation and reflect their own cultural background. Yet, malicious users or service providers may also try to intentionally bias the image generation. One goal might be to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.

ICML Conference 2023 Conference Paper

ILLUME: Rationalizing Vision-Language Models through Human Interactions

  • Manuel Brack
  • Patrick Schramowski
  • Björn Deiseroth
  • Kristian Kersting

Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user’s rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine-generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM’s rationalization capabilities that are aligned with human intent. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised finetuning while using significantly fewer training data and only requiring minimal feedback.

NeurIPS Conference 2023 Conference Paper

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

  • Marco Bellagente
  • Manuel Brack
  • Hannah Teufel
  • Felix Friedrich
  • Björn Deiseroth
  • Constantin Eichenberg
  • Andrew M. Dai
  • Robert Baldock

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MultiFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

NeurIPS Conference 2023 Conference Paper

SEGA: Instructing Text-to-Image Models using Semantic Guidance

  • Manuel Brack
  • Felix Friedrich
  • Dominik Hintersdorf
  • Lukas Struppek
  • Patrick Schramowski
  • Kristian Kersting

Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user’s intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, composition and style changes, and optimizing the overall artistic conception. We demonstrate SEGA’s effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility and flexibility.