Author name cluster

Bernt Schiele

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers

2 author rows

TMLR Journal 2026 Journal Article

Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

Robin Hesse
Doğukan Bağcı
Bernt Schiele
Simone Schaub-Meyer
Stefan Roth

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

PDF Details

NeurIPS Conference 2025 Conference Paper

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi
Sukrut Rao
Jonas Fischer
Bernt Schiele

Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.

PDF Details

ICLR Conference 2025 Conference Paper

How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

Siddhartha Gairola
Moritz Böhle
Francesco Locatello
Bernt Schiele

Post-hoc importance attribution methods are a popular tool for “explaining” Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model’s classification layer (<10% of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.

Details

NeurIPS Conference 2025 Conference Paper

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu
Zonghui Li
Xinting Hu
Xinyu Ye
Xianfang Zeng
Gang Yu
Wenbo Zhu
Bernt Schiele

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1, 267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

PDF Details

ICML Conference 2025 Conference Paper

Pixel-level Certified Explanations via Randomized Smoothing

Alaa Anani
Tobias Lorenz 0002
Mario Fritz
Bernt Schiele

Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel’s importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https: //github. com/AlaaAnani/certified-attributions.

Details

ICLR Conference 2025 Conference Paper

Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Mattia Segù
Luigi Piccinelli
Siyuan Li 0008
Yung-Hsu Yang
Luc Van Gool
Bernt Schiele

Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

Details

NeurIPS Conference 2025 Conference Paper

Solving Inverse Problems with FLAIR

Julius Erbach
Dominik Narnhofer
Andreas Dombos
Bernt Schiele
Jan Eric Lenssen
Konrad Schindler

Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the data likelihood term is usually intractable; (ii) learned generative models cannot be directly conditioned on the distorted observations, leading to conflicting objectives between data likelihood and prior; and (iii) the reconstructions can deviate from the observed data. We present FLAIR, a novel, training-free variational framework that leverages flow-based generative models as prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to guide the prior towards regions which are more likely under the posterior. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity. Source code is available at https: //inverseflair. github. io/.

PDF Details

ICML Conference 2025 Conference Paper

Spatial Reasoning with Denoising Models

Christopher Wewer
Bartlomiej Pogodzinski
Bernt Schiele
Jan Eric Lenssen

We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from $ $50%. Our project website provides additional videos, code, and the benchmark datasets.

Details

ICLR Conference 2025 Conference Paper

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Haiyang Wang
Yue Fan
Muhammad Ferjad Naeem
Yongqin Xian
Jan Eric Lenssen
Liwei Wang 0001
Federico Tombari
Bernt Schiele

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at {\color{red}\url{https://github.com/Haiyang-W/TokenFormer.git}}

Details

ICML Conference 2024 Conference Paper

Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing

Alaa Anani
Tobias Lorenz 0002
Bernt Schiele
Mario Fritz

Certification for machine learning is proving that no adversarial sample can evade a model within a range under certain conditions, a necessity for safety-critical domains. Common certification methods for segmentation use a flat set of fine-grained classes, leading to high abstain rates due to model uncertainty across many classes. We propose a novel, more practical setting, which certifies pixels within a multi-level hierarchy, and adaptively relaxes the certification to a coarser level for unstable components classic methods would abstain from, effectively lowering the abstain rate whilst providing more certified semantically meaningful information. We mathematically formulate the problem setup, introduce an adaptive hierarchical certification algorithm and prove the correctness of its guarantees. Since certified accuracy does not take the loss of information into account for coarser classes, we introduce the Certified Information Gain ($\mathrm{CIG}$) metric, which is proportional to the class granularity level. Our extensive experiments on the datasets Cityscapes, PASCAL-Context, ACDC and COCO-Stuff demonstrate that our adaptive algorithm achieves a higher $\mathrm{CIG}$ and lower abstain rate compared to the current state-of-the-art certification method. Our code can be found here: https: //github. com/AlaaAnani/adaptive-certify.

Details

NeurIPS Conference 2024 Conference Paper

B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

Shreyash Arya
Sukrut Rao
Moritz Böhle
Bernt Schiele

B-cos Networks have been shown to be effective for obtaining highly human interpretable explanations of model decisions by architecturally enforcing stronger alignment between inputs and weight. B-cos variants of convolutional networks (CNNs) and vision transformers (ViTs), which primarily replace linear layers with B-cos transformations, perform competitively to their respective standard variants while also yielding explanations that are faithful by design. However, it has so far been necessary to train these models from scratch, which is increasingly infeasible in the era of large, pre-trained foundation models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose ‘B-cosification’, a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability, while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to a pretrained CLIP model, and show that, even with limited data and compute cost, we obtain a B-cosified version that is highly interpretable and competitive on zero shot performance across a variety of datasets. We release ourcode and pre-trained model weights at https: //github. com/shrebox/B-cosification.

PDF Details DOI

ICLR Conference 2024 Conference Paper

On Adversarial Training without Perturbing all Examples

Max Maria Losch
Mohamed Omran
David Stutz
Mario Fritz
Bernt Schiele

Adversarial training is the de-facto standard for improving robustness against adversarial examples. This usually involves a multi-step adversarial attack applied on each example during training. In this paper, we explore only constructing adversarial examples (AE) on a subset of the training examples. That is, we split the training set in two subsets $A$ and $B$, train models on both ($A\cup B$) but construct AEs only for examples in $A$. Starting with $A$ containing only a single class, we systematically increase the size of $A$ and consider splitting by class and by examples. We observe that: (i) adv. robustness transfers by difficulty and to classes in $B$ that have never been adv. attacked during training, (ii) we observe a tendency for hard examples to provide better robustness transfer than easy examples, yet find this tendency to diminish with increasing complexity of datasets (iii) generating AEs on only $50$% of training data is sufficient to recover most of the baseline AT performance even on ImageNet. We observe similar transfer properties across tasks, where generating AEs on only $30$% of data can recover baseline robustness on the target task. We evaluate our subset analysis on a wide variety of image datasets like CIFAR-10, CIFAR-100, ImageNet-200 and show transfer to SVHN, Oxford-Flowers-102 and Caltech-256. In contrast to conventional practice, our experiments indicate that the utility of computing AEs varies by class and examples and that weighting examples from $A$ higher than $B$ provides high transfer performance. Code is available at [http://github.com/mlosch/SAT](http://github.com/mlosch/SAT).

Details

NeurIPS Conference 2024 Conference Paper

Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Wolfgang Boettcher
Lukas Hoyer
Ozan Unal
Jan Eric Lenssen
Bernt Schiele

In this work, we introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels. Training or fine-tuning semantic segmentation models with weak supervision has become an important topic recently and was subject to significant advances in model quality. In this setting, scribbles are a promising label type to achieve high quality segmentation results while requiring a much lower annotation effort than usual pixel-wise dense semantic segmentation annotations. The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation, which hinders the development of novel methods and conclusive evaluations. To overcome this limitation, Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations, paving the way for new insights and model advancements in the field of weakly supervised segmentation. In addition to providing datasets and algorithm, we evaluate state-of-the-art segmentation models on our datasets and show that models trained with our synthetic labels perform competitively with respect to models trained on manual labels. Thus, our datasets enable state-of-the-art research into methods for scribble-labeled semantic segmentation. The datasets, scribble generation algorithm, and baselines are publicly available at https: //github. com/wbkit/Scribbles4All.

PDF Details DOI

ICLR Conference 2023 Conference Paper

FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning

Yidong Wang 0003
Hao Chen 0102
Qiang Heng
Wenxin Hou
Yue Fan
Zhen Wu 0002
Jindong Wang 0001
Marios Savvides

Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https://github.com/microsoft/Semi-supervised-learning.

Details

AAAI Conference 2023 Conference Paper

Online Hyperparameter Optimization for Class-Incremental Learning

Yaoyao Liu
Yingying Li
Bernt Schiele
Qianru Sun

Class-incremental learning (CIL) aims to train a classification model while the number of classes increases phase-by-phase. An inherent challenge of CIL is the stability-plasticity tradeoff, i.e., CIL models should keep stable to retain old knowledge and keep plastic to absorb new knowledge. However, none of the existing CIL models can achieve the optimal tradeoff in different data-receiving settings—where typically the training-from-half (TFH) setting needs more stability, but the training-from-scratch (TFS) needs more plasticity. To this end, we design an online learning method that can adaptively optimize the tradeoff without knowing the setting as a priori. Specifically, we first introduce the key hyperparameters that influence the tradeoff, e.g., knowledge distillation (KD) loss weights, learning rates, and classifier types. Then, we formulate the hyperparameter optimization process as an online Markov Decision Process (MDP) problem and propose a specific algorithm to solve it. We apply local estimated rewards and a classic bandit algorithm Exp3 to address the issues when applying online MDP methods to the CIL protocol. Our method consistently improves top-performing CIL methods in both TFH and TFS settings, e.g., boosting the average accuracy of TFH and TFS by 2.2 percentage points on ImageNet-Full, compared to the state-of-the-art. Code is provided at https://class-il.mpi-inf.mpg.de/online/

PDF Details DOI

ICLR Conference 2023 Conference Paper

SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning

Hao Chen 0102
Ran Tao 0013
Yue Fan
Yidong Wang 0003
Jindong Wang 0001
Bernt Schiele
Xing Xie 0001
Bhiksha Raj

The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.

Details

ICLR Conference 2023 Conference Paper

Temperature Schedules for self-supervised contrastive methods on long-tail data

Anna Kukleva
Moritz Böhle
Bernt Schiele
Hildegard Kuehne
Christian Rupprecht 0001

Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on imbalanced data. In particular, we investigate the role of the temperature parameter $\tau$ in the contrastive loss, by analysing the loss through the lens of average distance maximisation, and find that a large $\tau$ emphasises group-wise discrimination, whereas a small $\tau$ leads to a higher degree of instance discrimination. While $\tau$ has thus far been treated exclusively as a constant hyperparameter, in this work, we propose to employ a dynamic $\tau$ and show that a simple cosine schedule can yield significant improvements in the learnt representations. Such a schedule results in a constant `task switching' between an emphasis on instance discrimination and group-wise discrimination and thereby ensures that the model learns both group-wise features, as well as instance-specific details. Since frequent classes benefit from the former, while infrequent classes require the latter, we find this method to consistently improve separation between the classes in long-tail data without any additional computational cost.

Details

ICRA Conference 2023 Conference Paper

Test-time Domain Adaptation for Monocular Depth Estimation

Zhi Li 0055
Shaoshuai Shi
Bernt Schiele
Dengxin Dai

Test-time domain adaptation, i. e. adapting source-pretrained models to the test data on-the-fly in a source-free, unsupervised manner, is a highly practical yet very challenging task. Due to the domain gap between source and target data, inference quality on the target domain can drop drastically especially in terms of absolute scale of depth. In addition, unsupervised adaptation can degrade the model performance due to inaccurate pseudo labels. Furthermore, the model can suffer from catastrophic forgetting when errors are accumulated over time. We propose a test-time domain adaptation framework for monocular depth estimation which achieves both stability and adaptation performance by benefiting from both self-training of the supervised branch and pseudo labels from self-supervised branch, and is able to tackle the above problems: our scale alignment scheme aligns the input features between source and target data, correcting the absolute scale inference on the target domain; with pseudo label consistency check, we select confident pixels thus improve pseudo label quality; regularisation and self-training schemes are applied to help avoid catastrophic forgetting. Without requirement of further supervisions on the target domain, our method adapts the source-trained models to the test data with significant improvements over the direct inference results, providing scale-aware depth map outputs that outperform the state-of-the-arts. Code is available at https://github.com/Malefikus/ada-depth.

Details

ICLR Conference 2023 Conference Paper

Towards Robust Object Detection Invariant to Real-World Domain Shifts

Qi Fan
Mattia Segù
Yu-Wing Tai
Fisher Yu 0001
Chi-Keung Tang
Bernt Schiele
Dengxin Dai

Safety-critical applications such as autonomous driving require robust object detection invariant to real-world domain shifts. Such shifts can be regarded as different domain styles, which can vary substantially due to environment changes and sensor noises, but deep models only know the training domain style. Such domain style gap impedes object detection generalization on diverse real-world domains. Existing classification domain generalization (DG) methods cannot effectively solve the robust object detection problem, because they either rely on multiple source domains with large style variance or destroy the content structures of the original images. In this paper, we analyze and investigate effective solutions to overcome domain style overfitting for robust object detection without the above shortcomings. Our method, dubbed as Normalization Perturbation (NP), perturbs the channel statistics of source domain low-level features to synthesize various latent styles, so that the trained deep model can perceive diverse potential domains and generalizes well even without observations of target domain data in training. This approach is motivated by the observation that feature channel statistics of the target domain images deviate around the source domain statistics. We further explore the style-sensitive channels for effective style synthesis. Normalization Perturbation only relies on a single source domain and is surprisingly simple and effective, contributing a practical solution by effectively adapting or generalizing classification DG methods to robust object detection. Extensive experiments demonstrate the effectiveness of our method for generalizing object detectors under real-world domain shifts.

Details

NeurIPS Conference 2022 Conference Paper

Assaying Out-Of-Distribution Generalization in Transfer Learning

Florian Wenzel
Andrea Dittadi
Peter Gehler
Carl-Johann Simon-Gabriel
Max Horn
Dominik Zietlow
David Kernert
Chris Russell

Since out-of-distribution generalization is a generally ill-posed problem, various proxy targets (e. g. , calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) were studied across different research programs resulting in different recommendations. While sharing the same aspirational goal, these approaches have never been tested under the same experimental conditions on real data. In this paper, we take a unified view of previous work, highlighting message discrepancies that we address empirically, and providing recommendations on how to measure the robustness of a model and how to improve it. To this end, we collect 172 publicly available dataset pairs for training and out-of-distribution evaluation of accuracy, calibration error, adversarial attacks, environment invariance, and synthetic corruptions. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting. Our findings confirm that in- and out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies.

PDF Details

AAAI Conference 2022 Conference Paper

Keypoint Message Passing for Video-Based Person Re-identification

Di Chen
Andreas Doering
Shanshan Zhang
Jian Yang
Juergen Gall
Bernt Schiele

Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras. Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement. In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph. These keypoint features are then updated by message passing from their connected nodes with a graph convolutional network (GCN). During training, the GCN can be attached to any CNN-based person re-ID model to assist representation learning on feature maps, whilst it can be dropped after training for better inference speed. Our method brings significant improvements over the CNN-based baseline model on the MARS dataset with generated person keypoints and a newly annotated dataset: PoseTrackReID. It also defines a new state-of-the-art method in terms of top-1 accuracy and mean average precision in comparison to prior works.

PDF Details

NeurIPS Conference 2022 Conference Paper

Motion Transformer with Global Intention Localization and Local Movement Refinement

Shaoshuai Shi
Li Jiang
Dengxin Dai
Bernt Schiele

Predicting multimodal future behavior of traffic participants is essential for robotic vehicles to make safe decisions. Existing works explore to directly predict future trajectories based on latent features or utilize dense goal candidates to identify agent's destinations, where the former strategy converges slowly since all motion modes are derived from the same feature while the latter strategy has efficiency issue since its performance highly relies on the density of goal candidates. In this paper, we propose the Motion TRansformer (MTR) framework that models motion prediction as the joint optimization of global intention localization and local movement refinement. Instead of using goal candidates, MTR incorporates spatial intention priors by adopting a small set of learnable motion query pairs. Each motion query pair takes charge of trajectory prediction and refinement for a specific motion mode, which stabilizes the training process and facilitates better multimodal predictions. Experiments show that MTR achieves state-of-the-art performance on both the marginal and joint motion prediction challenges, ranking 1st on the leaderbaords of Waymo Open Motion Dataset. Code will be available at https: //github. com/sshaoshuai/MTR.

PDF Details

NeurIPS Conference 2022 Conference Paper

USB: A Unified Semi-supervised Learning Benchmark for Classification

Yidong Wang
Hao Chen
Yue Fan
Wang Sun
Ran Tao
Wenxin Hou
Renjie Wang
Linyi Yang

Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples. However, currently, popular SSL evaluation protocols are often constrained to computer vision (CV) tasks. In addition, previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly. To address the above issues, we construct a Unified SSL Benchmark (USB) for classification by selecting 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio), on which we systematically evaluate the dominant SSL methods, and also open-source a modular and extensible codebase for fair evaluation of these SSL methods. We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning. USB enables the evaluation of a single SSL algorithm on more tasks from multiple domains but with less cost. Specifically, on a single NVIDIA V100, only 39 GPU days are required to evaluate FixMatch on 15 tasks in USB while 335 GPU days (279 GPU days on 4 CV datasets except for ImageNet) are needed on 5 CV tasks with TorchSSL.

PDF Details

NeurIPS Conference 2021 Conference Paper

RMM: Reinforced Memory Management for Class-Incremental Learning

Yaoyao Liu
Bernt Schiele
Qianru Sun

Class-Incremental Learning (CIL) [38] trains classifiers under a strict memory budget: in each incremental phase, learning is done for new data, most of which is abandoned to free space for the next phase. The preserved data are exemplars used for replaying. However, existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal. In this work, we propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes. We call our method reinforced memory management (RMM), leveraging reinforcement learning. RMM training is not naturally compatible with CIL as the past, and future data are strictly non-accessible during the incremental phases. We solve this by training the policy function of RMM on pseudo CIL tasks, e. g. , the tasks built on the data of the zeroth phase, and then applying it to target tasks. RMM propagates two levels of actions: Level-1 determines how to split the memory between old and new classes, and Level-2 allocates memory for each specific class. In essence, it is an optimizable and general method for memory management that can be used in any replaying-based CIL method. For evaluation, we plug RMM into two top-performing baselines (LUCIR+AANets and POD+AANets [28]) and conduct experiments on three benchmarks (CIFAR-100, ImageNet-Subset, and ImageNet-Full). Our results show clear improvements, e. g. , boosting POD+AANets by 3. 6%, 4. 4%, and 1. 9% in the 25-Phase settings of the above benchmarks, respectively. The code is available at https: //class-il. mpi-inf. mpg. de/rmm/.

PDF Details

ICLR Conference 2021 Conference Paper

You Only Need Adversarial Supervision for Semantic Image Synthesis

Edgar Schönfeld
Vadim Sushko
Dan Zhang 0003
Juergen Gall
Bernt Schiele
Anna Khoreva

Despite their recent successes, GAN models for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Historically, additionally employing the VGG-based perceptual loss has helped to overcome this issue, significantly improving the synthesis quality, but at the same time limiting the progress of GAN models for semantic image synthesis. In this work, we propose a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results. We re-design the discriminator as a semantic segmentation network, directly using the given semantic label maps as the ground truth for training. By providing stronger supervision to the discriminator as well as to the generator through spatially- and semantically-aware discriminator feedback, we are able to synthesize images of higher fidelity with better alignment to their input label maps, making the use of the perceptual loss superfluous. Moreover, we enable high-quality multi-modal image synthesis through global and local sampling of a 3D noise tensor injected into the generator, which allows complete or partial image change. We show that images synthesized by our model are more diverse and follow the color and texture distributions of real images more closely. We achieve an average improvement of $6$ FID and $5$ mIoU points over the state of the art across different datasets using only adversarial supervision.

Details

NeurIPS Conference 2020 Conference Paper

Attribute Prototype Network for Zero-Shot Learning

Wenjia Xu
Yongqin Xian
Jiuniu Wang
Bernt Schiele
Zeynep Akata

From the beginning of zero-shot learning research, visual attributes have been shown to play an important role. In order to better transfer attribute-based knowledge from known to unknown classes, we argue that an image representation with integrated attribute localization ability would be beneficial for zero-shot learning. To this end, we propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features using only class-level attributes. While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features. We show that our locality augmented image representations achieve a new state-of-the-art on three zero-shot learning benchmarks. As an additional benefit, our model points to the visual evidence of the attributes in an image, e. g. for the CUB dataset, confirming the improved attribute localization ability of our image representation.

PDF Details

ICML Conference 2020 Conference Paper

Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks

David Stutz
Matthias Hein 0001
Bernt Schiele

Adversarial training yields robust models against a specific threat model, e. g. , $L_\infty$ adversarial examples. Typically robustness does not generalize to previously unseen threat models, e. g. , other $L_p$ norms, or larger perturbations. Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples. By allowing to reject examples with low confidence, robustness generalizes beyond the threat model employed during training. CCAT, trained only on $L_\infty$ adversarial examples, increases robustness against larger $L_\infty$, $L_2$, $L_1$ and $L_0$ attacks, adversarial frames, distal adversarial examples and corrupted examples and yields better clean accuracy compared to adversarial training. For thorough evaluation we developed novel white- and black-box attacks directly attacking CCAT by maximizing confidence. For each threat model, we use $7$ attacks with up to $50$ restarts and $5000$ iterations and report worst-case robust test error, extended to our confidence-thresholded setting, across all attacks.

Details

NeurIPS Conference 2020 Conference Paper

Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring

Jiangxin Dong
Stefan Roth
Bernt Schiele

We present a simple and effective approach for non-blind image deblurring, combining classical techniques and deep learning. In contrast to existing methods that deblur the image directly in the standard image space, we propose to perform an explicit deconvolution process in a feature space by integrating a classical Wiener deconvolution framework with learned deep features. A multi-scale feature refinement module then predicts the deblurred image from the deconvolved deep features, progressively recovering detail and small-scale structures. The proposed model is trained in an end-to-end manner and evaluated on scenarios with both simulated and real-world image blur. Our extensive experimental results show that the proposed deep Wiener deconvolution network facilitates deblurred results with visibly fewer artifacts. Moreover, our approach quantitatively outperforms state-of-the-art non-blind image deblurring methods by a wide margin.

PDF Details

AAAI Conference 2020 Conference Paper

Hierarchical Online Instance Matching for Person Search

Di Chen
Shanshan Zhang
Wanli Ouyang
Jian Yang
Bernt Schiele

Person Search is a challenging task which requires to retrieve a person’s image and the corresponding position from an image dataset. It consists of two sub-tasks: pedestrian detection and person re-identiﬁcation (re-ID). One of the key challenges is to properly combine the two sub-tasks into a uniﬁed framework. Existing works usually adopt a straightforward strategy by concatenating a detector and a re-ID model directly, either into an integrated model or into separated models. We argue that simply concatenating detection and re-ID is a sub-optimal solution, and we propose a Hierarchical Online Instance Matching (HOIM) loss which exploits the hierarchical relationship between detection and re-ID to guide the learning of our network. Our novel HOIM loss function harmonizes the objectives of the two sub-tasks and encourages better feature learning. In addition, we improve the loss update policy by introducing Selective Memory Refreshment (SMR) for unlabeled persons, which takes advantage of the potential discrimination power of unlabeled data. From the experiments on two standard person search benchmarks, i. e. CUHK-SYSU and PRW, we achieve state-of-the-art performance, which justiﬁes the effectiveness of our proposed HOIM loss on learning robust features.

PDF Details

ICLR Conference 2020 Conference Paper

Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks

Tribhuvanesh Orekondy
Bernt Schiele
Mario Fritz

High-performance Deep Neural Networks (DNNs) are increasingly deployed in many real-world applications e.g., cloud prediction APIs. Recent advances in model functionality stealing attacks via black-box access (i.e., inputs in, predictions out) threaten the business model of such applications, which require a lot of time, money, and effort to develop. Existing defenses take a passive role against stealing attacks, such as by truncating predicted information. We find such passive defenses ineffective against DNN stealing attacks. In this paper, we propose the first defense which actively perturbs predictions targeted at poisoning the training objective of the attacker. We find our defense effective across a wide range of challenging datasets and DNN model stealing attacks, and additionally outperforms existing defenses. Our defense is the first that can withstand highly accurate model stealing attacks for tens of thousands of queries, amplifying the attacker's error rate up to a factor of 85$\times$ with minimal impact on the utility for benign users.

Details

NeurIPS Conference 2019 Conference Paper

Learning to Self-Train for Semi-Supervised Few-Shot Classification

Xinzhe Li
Qianru Sun
Yaoyao Liu
Qin Zhou
Shibao Zheng
Tat-Seng Chua
Bernt Schiele

Few-shot classification (FSC) is challenging due to the scarcity of labeled training data (e. g. only one labeled data point per class). Meta-learning has shown to achieve promising results by learning to initialize a classification model for FSC. In this paper we propose a novel semi-supervised meta-learning method called learning to self-train (LST) that leverages unlabeled data and specifically meta-learns how to cherry-pick and label such unsupervised data to further improve performance. To this end, we train the LST model through a large number of semi-supervised few-shot tasks. On each task, we train a few-shot model to predict pseudo labels for unlabeled data, and then iterate the self-training steps on labeled and pseudo-labeled data with each step followed by fine-tuning. We additionally learn a soft weighting network (SWN) to optimize the self-training weights of pseudo labels so that better ones can contribute more to gradient descent optimization. We evaluate our LST method on two ImageNet benchmarks for semi-supervised few-shot classification and achieve large improvements over the state-of-the-art.

PDF Details

NeurIPS Conference 2018 Conference Paper

Adversarial Scene Editing: Automatic Object Removal from Weak Supervision

Rakshith Shetty
Mario Fritz
Bernt Schiele

While great progress has been made recently in automatic image manipulation, it has been limited to object centric images like faces or structured scene datasets. In this work, we take a step towards general scene-level image editing by developing an automatic interaction-free object removal model. Our model learns to find and remove objects from general scene images using image-level labels and unpaired data in a generative adversarial network (GAN) framework. We achieve this with two key contributions: a two-stage editor architecture consisting of a mask generator and image in-painter that co-operate to remove objects, and a novel GAN based prior for the mask generator that allows us to flexibly incorporate knowledge about object shapes. We experimentally show on two datasets that our method effectively removes a wide variety of objects using weak supervision only.

PDF Details

AAAI Conference 2018 Conference Paper

Long-Term Image Boundary Prediction

Apratim Bhattacharyya
Mateusz Malinowski
Bernt Schiele
Mario Fritz

Boundary estimation in images and videos has been a very active topic of research, and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on estimating boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and corresponding motion patterns – including a notion of “intuitive physics”. We experiment on natural video sequences along with synthetic sequences with deterministic physics-based and agentbased motions. While not being our primary goal, we also show that fusion of RGB and boundary prediction leads to improved RGB predictions.

PDF Details

NeurIPS Conference 2017 Conference Paper

Pose Guided Person Image Generation

Liqian Ma
Xu Jia
Qianru Sun
Bernt Schiele
Tinne Tuytelaars
Luc Van Gool

This paper proposes the novel Pose Guided Person Generation Network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose. Our generation framework PG$^2$ utilizes the pose information explicitly and consists of two key stages: pose integration and image refinement. In the first stage the condition image and the target pose are fed into a U-Net-like network to generate an initial but coarse image of the person with the target pose. The second stage then refines the initial and blurry result by training a U-Net-like generator in an adversarial way. Extensive experimental results on both 128$\times$64 re-identification images and 256$\times$256 fashion photos show that our model generates high-quality person images with convincing details.

PDF Details

ICML Conference 2016 Conference Paper

Generative Adversarial Text to Image Synthesis

Scott E. Reed
Zeynep Akata
Xinchen Yan
Lajanugen Logeswaran
Bernt Schiele
Honglak Lee

Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories such as faces, album covers, room interiors and flowers. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.

Details

NeurIPS Conference 2016 Conference Paper

Learning What and Where to Draw

Scott Reed
Zeynep Akata
Santosh Mohan
Samuel Tenka
Bernt Schiele
Honglak Lee

Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 × 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e. g. only the beak and tail), yielding an efficient interface for picking part locations.

PDF Details

NeurIPS Conference 2015 Conference Paper

Efficient Output Kernel Learning for Multiple Tasks

Pratik Kumar Jawanpuria
Maksim Lapin
Matthias Hein
Bernt Schiele

The paradigm of multi-task learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other. While previously the relationship between tasks had to be user-defined in the form of an output kernel, recent approaches jointly learn the tasks and the output kernel. As the output kernel is a positive semidefinite matrix, the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step. Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. This leads to an unconstrained dual problem which can be solved efficiently. Experiments on several multi-task and multi-class data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance.

PDF Details

NeurIPS Conference 2015 Conference Paper

Top-k Multiclass SVM

Maksim Lapin
Matthias Hein
Bernt Schiele

Class ambiguity is typical in image classification problems with a large number of classes. When classes are difficult to discriminate, it makes sense to allow k guesses and evaluate classifiers based on the top-k error instead of the standard zero-one loss. We propose top-k multiclass SVM as a direct method to optimize for top-k performance. Our generalization of the well-known multiclass SVM is based on a tight convex upper bound of the top-k error. We propose a fast optimization scheme based on an efficient projection onto the top-k simplex, which is of its own interest. Experiments on five datasets show consistent improvements in top-k accuracy compared to various baselines.

PDF Details

NeurIPS Conference 2013 Conference Paper

Transfer Learning in a Transductive Setting

Marcus Rohrbach
Sandra Ebert
Bernt Schiele

Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels however is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three main ingredients. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expert-specified information, e. g. , by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm - so far only used for semi-supervised learning - to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets.

PDF Details

IROS Conference 2010 Conference Paper

Vision based victim detection from unmanned aerial vehicles

Mykhaylo Andriluka
Paul Schnitzspan
Johannes Meyer
Stefan Kohlbrecher
Karen Petersen
Oskar von Stryk
Stefan Roth 0001
Bernt Schiele

Finding injured humans is one of the primary goals of any search and rescue operation. The aim of this paper is to address the task of automatically finding people lying on the ground in images taken from the on-board camera of an unmanned aerial vehicle (UAV). In this paper we evaluate various state-of-the-art visual people detection methods in the context of vision based victim detection from an UAV. The top performing approaches in this comparison are those that rely on flexible part-based representations and discriminatively trained part detectors. We discuss their strengths and weaknesses and demonstrate that by combining multiple models we can increase the reliability of the system. We also demonstrate that the detection performance can be substantially improved by integrating the height and pitch information provided by on-board sensors. Jointly these improvements allow us to significantly boost the detection performance over the current de-facto standard, which provides a substantial step towards making autonomous victim detection for UAVs practical.

Details

ICRA Conference 1998 Conference Paper

Position Estimation Using Principal Components of Range Data

James L. Crowley
Frank Wallner
Bernt Schiele

Describes an approach to mobile robot position estimation based on principal component analysis of laser range data. An eigenspace is constructed from the principal components of a large number of range data sets. The structure of an environment, as seen by a range sensor, is represented as a family of surfaces in this space. Subsequent range data sets from the environment project as a point in this space. Associating this point to the family of surfaces gives a set of candidate positions and orientations (poses) for the sensor. These candidate poses correspond to positions and orientations in the environment which have similar range profiles. A Kalman filter can used to select the most likely candidate pose based on coherence with small movements. The first part of this paper describes how a relatively small number of depth profiles of an environment can be used to generate a complete eigenspace. This space is used to build a representation of the range scan profiles obtained from a regular grid of positions and orientations (poses). This representation has the form of a family of surfaces (a manifold). This representation converts the problem of associating a range profile to possible positions and orientations into a table lookup. As a side benefit, the method provides a simple means to detect obstacles in a range profile. The final section of the paper reviews the use of estimation theory to determine the correct pose hypothesis by tracking.

Details

IROS Conference 1996 Conference Paper

Where to look next and what to look for

Bernt Schiele
James L. Crowley

The authors (1996) introduced the use of multidimensional receptive field histograms for probabilistic object recognition. In this paper we reverse the object recognition problem by asking the question "where should we look? ", when we want to verify the presence of an object, to track an object or to actively explore a scene. This paper describes the statistical framework from which we obtain a network of salient points for an object. This network of salient points may be used for fixation control in the context of active object recognition.

Details

ICRA Conference 1994 Conference Paper

A Comparison of Position Estimation Techniques Using Occupancy Grids

Bernt Schiele
James L. Crowley

A mobile robot requires perception of its local environment for both sensor based locomotion and for position estimation. Occupancy grids, based on ultrasonic range data, provide a robust description of the local environment for locomotion. Unfortunately, current techniques for position estimation based on occupancy grids are both unreliable and computationally expensive. This paper reports on experiments with four techniques for position estimation using occupancy grids. A world modeling technique based on combining global and local occupancy grids is described. Techniques are described for extracting line segments from an occupancy grid based on a Hough transform. The use of an extended Kalman filter for position estimation is then adapted to this framework. Four matching techniques are presented for obtaining the innovation vector required by the Kalman filter equations. Experimental results show that matching of segments extracted from the both the local and global occupancy grids gives results which are superior to a direct matching of grids, or to a mixed matching of segments to grids. >

Details