Author name cluster

Juergen Gall

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

2 author rows

ICML Conference 2025 Conference Paper

Canonical Rank Adaptation: An Efficient Fine-Tuning Strategy for Vision Transformers

Lokesh Veeramacheneni
Moritz Wolter
Hildegard Kuehne
Juergen Gall

Modern methods for fine-tuning a Vision Transformer (ViT) like Low-Rank Adaptation (LoRA) and its variants demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this limitation, we propose Canonical Rank Adaptation (CaRA). CaRA leverages tensor mathematics, first by tensorising the transformer into two different tensors; one for projection layers in MHA and the other for feed-forward layers. Second, the tensorised formulation is fine-tuned using the low-rank adaptation in Canonical-Polyadic Decomposition (CPD) form. Employing CaRA efficiently minimizes the number of trainable parameters. Experimentally, CaRA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as Visual Task Adaptation Benchmark (VTAB)-1k and Fine-Grained Visual Categorization (FGVC).

Details

ICLR Conference 2025 Conference Paper

Fréchet Wavelet Distance: A Domain-Agnostic Metric for Image Generation

Lokesh Veeramacheneni
Moritz Wolter
Hildegard Kuehne
Juergen Gall

Modern metrics for generative learning like Fréchet Inception Distance (FID) and DINOv2-Fréchet Distance (FD-DINOv2) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on the Wavelet Packet Transform ($\mathcal{W}_p$). FWD provides a sight across a broad spectrum of frequencies in images with a high resolution, preserving both spatial and textural aspects. Specifically, we use $\mathcal{W}_p$ to project generated and real images to the packet coefficient space. We then compute the Fréchet distance with the resultant coefficients to evaluate the quality of a generator. This metric is general-purpose and dataset-domain agnostic, as it does not rely on any pre-trained network, while being more interpretable due to its ability to compute Fréchet distance per packet, enhancing transparency. We conclude with an extensive evaluation of a wide variety of generators across various datasets that the proposed FWD can generalize and improve robustness to domain shifts and various corruptions compared to other metrics.

Details

AAAI Conference 2025 Conference Paper

Hierarchical Vector Quantization for Unsupervised Action Segmentation

Federico Spurio
Emad Bahrami
Gianpiero Francesca
Juergen Gall

In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (HVQ), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the Jensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.

PDF Details DOI

EAAI Journal 2025 Journal Article

Multi-modal temporal action segmentation for manufacturing scenarios

Laura Romeo
Roberto Marani
Anna Gina Perri
Juergen Gall

Industrial robots have become prevalent in manufacturing due to their advantages of accuracy, speed, and reduced operator fatigue. Nevertheless, human operators play a crucial role in primary production lines. This study focuses on the temporal segmentation of human actions, aiming to identify the physical and cognitive behavior of operators working alongside collaborative robots. While existing literature explores temporal action segmentation datasets, there is a lack of evaluation for manufacturing tasks. This work assesses six state-of-the-art action segmentation models using the Human Action Multi-Modal Monitoring in Manufacturing (HA4M) dataset, where subjects assemble an industrial object in realistic manufacturing scenarios. By employing Cross-Subject and Cross-Location approaches, the study not only demonstrates the effectiveness of these models in industrial settings but also introduces a new benchmark for evaluating generalization across different subjects and locations. The evaluation further includes new videos in simulated industrial locations, assessed with both fully and semi-supervised learning approaches. The findings reveal that the Multi-Stage Temporal Convolutional Network ++ (MS-TCN++) and the Action Segmentation Transformer (ASFormer) architectures exhibit high performance in supervised and semi-supervised learning settings, also using new data, particularly when trained with Skeletal features, advancing the capabilities of temporal action segmentation in real-world manufacturing environments. This research lays the foundation for addressing video activity understanding challenges in manufacturing and presents opportunities for future investigations.

Details DOI

ICRA Conference 2024 Conference Paper

A Multimodal Handover Failure Detection Dataset and Baselines

Santosh Thoduka
Nico Hochgeschwender
Juergen Gall
Paul G. Plöger

An object handover between a robot and a human is a coordinated action which is prone to failure for reasons such as miscommunication, incorrect actions and unexpected object properties. Existing works on handover failure detection and prevention focus on preventing failures due to object slip or external disturbances. However, there is a lack of datasets and evaluation methods that consider unpreventable failures caused by the human participant. To address this deficit, we present the multimodal Handover Failure Detection dataset, which consists of failures induced by the human participant, such as ignoring the robot or not releasing the object. We also present two baseline methods for handover failure detection: (i) a video classification method using 3D CNNs and (ii) a temporal action segmentation approach which jointly classifies the human action, robot action and overall outcome of the action. The results show that video is an important modality, but using force-torque data and gripper position help improve failure detection and action segmentation accuracy.

Details

NeurIPS Conference 2024 Conference Paper

Identifying Spatio-Temporal Drivers of Extreme Events

Mohamad H. Eddin
Juergen Gall

The spatio-temporal relations of impacts of extreme events and their drivers in climate data are not fully understood and there is a need of machine learning approaches to identify such spatio-temporal relations from data. The task, however, is very challenging since there are time delays between extremes and their drivers, and the spatial response of such drivers is inhomogeneous. In this work, we propose a first approach and benchmarks to tackle this challenge. Our approach is trained end-to-end to predict spatio-temporally extremes and spatio-temporally drivers in the physical input variables jointly. By enforcing the network to predict extremes from spatio-temporal binary masks of identified drivers, the network successfully identifies drivers that are correlated with extremes. We evaluate our approach on three newly created synthetic benchmarks, where two of them are based on remote sensing or reanalysis climate data, and on two real-world reanalysis datasets. The source code and datasets are publicly available at the project page https: //hakamshams. github. io/IDE.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird’s-Eye View

Peizheng Li
Shuxiao Ding
Xieyuanli Chen
Niklas Hanselmann
Marius Cordts
Juergen Gall

Accurately perceiving instances and predicting their future motion are key tasks for autonomous vehicles, enabling them to navigate safely in complex urban traffic. While bird’s-eye view (BEV) representations are commonplace in perception for autonomous driving, their potential in a motion prediction setting is less explored. Existing approaches for BEV instance prediction from surround cameras rely on a multi-task auto-regressive setup coupled with complex post-processing to predict future instances in a spatio-temporally consistent manner. In this paper, we depart from this paradigm and propose an efficient novel end-to-end framework named PowerBEV, which differs in several design choices aimed at reducing the inherent redundancy in previous methods. First, rather than predicting the future in an auto-regressive fashion, PowerBEV uses a parallel, multi-scale module built from lightweight 2D convolutional networks. Second, we show that segmentation and centripetal backward flow are sufficient for prediction, simplifying previous multi-task objectives by eliminating redundant output modalities. Building on this output representation, we propose a simple, flow warping-based post-processing approach which produces more stable instance associations across time. Through this lightweight yet powerful design, PowerBEV outperforms state-of-the-art baselines on the NuScenes Dataset and poses an alternative paradigm for BEV instance prediction. We made our code publicly available at: https: //github. com/EdwardLeeLPZ/PowerBEV.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Keypoint Message Passing for Video-Based Person Re-identification

Di Chen
Andreas Doering
Shanshan Zhang
Jian Yang
Juergen Gall
Bernt Schiele

Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras. Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement. In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph. These keypoint features are then updated by message passing from their connected nodes with a graph convolutional network (GCN). During training, the GCN can be attached to any CNN-based person re-ID model to assist representation learning on feature maps, whilst it can be dropped after training for better inference speed. Our method brings significant improvements over the CNN-based baseline model on the MARS dataset with generated person keypoints and a newly annotated dataset: PoseTrackReID. It also defines a new state-of-the-art method in terms of top-1 accuracy and mean average precision in comparison to prior works.

PDF Details

AAAI Conference 2022 Conference Paper

Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

David T. Hoffmann
Nadine Behrmann
Juergen Gall
Thomas Brox
Mehdi Noroozi

This paper introduces Ranking Info Noise Contrastive Estimation (RINCE), a new member in the family of InfoNCE losses that preserves a ranked ordering of positive samples. In contrast to the standard InfoNCE loss, which requires a strict binary separation of the training pairs into similar and dissimilar samples, RINCE can exploit information about a similarity ranking for learning a corresponding embedding space. We show that the proposed loss function learns favorable embeddings compared to the standard InfoNCE whenever at least noisy ranking information can be obtained or when the definition of positives and negatives is blurry. We demonstrate this for a supervised classification task with additional superclass labels and noisy similarity scores. Furthermore, we show that RINCE can also be applied to unsupervised training with experiments on unsupervised representation learning from videos. In particular, the embedding yields higher classification accuracy, retrieval rates and performs better in out-of-distribution detection than the standard InfoNCE loss.

PDF Details

IROS Conference 2021 Conference Paper

Using Visual Anomaly Detection for Task Execution Monitoring

Santosh Thoduka
Juergen Gall
Paul G. Plöger

Execution monitoring is essential for robots to detect and respond to failures. Since it is impossible to enumerate all failures for a given task, we learn from successful executions of the task to detect visual anomalies during runtime. Our method learns to predict the motions that occur during the nominal execution of a task, including camera and robot body motion. A probabilistic U-Net architecture is used to learn to predict optical flow, and the robot’s kinematics and 3D model are used to model camera and body motion. The errors between the observed and predicted motion are used to calculate an anomaly score. We evaluate our method on a dataset of a robot placing a book on a shelf, which includes anomalies such as falling books, camera occlusions, and robot disturbances. We find that modeling camera and body motion, in addition to the learning-based optical flow prediction, results in an improvement of the area under the receiver operating characteristic curve from 0. 752 to 0. 804, and the area under the precision-recall curve from 0. 467 to 0. 549.

Details

ICLR Conference 2021 Conference Paper

You Only Need Adversarial Supervision for Semantic Image Synthesis

Edgar Schönfeld
Vadim Sushko
Dan Zhang 0003
Juergen Gall
Bernt Schiele
Anna Khoreva

Despite their recent successes, GAN models for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Historically, additionally employing the VGG-based perceptual loss has helped to overcome this issue, significantly improving the synthesis quality, but at the same time limiting the progress of GAN models for semantic image synthesis. In this work, we propose a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results. We re-design the discriminator as a semantic segmentation network, directly using the given semantic label maps as the ground truth for training. By providing stronger supervision to the discriminator as well as to the generator through spatially- and semantically-aware discriminator feedback, we are able to synthesize images of higher fidelity with better alignment to their input label maps, making the use of the perceptual loss superfluous. Moreover, we enable high-quality multi-modal image synthesis through global and local sampling of a 3D noise tensor injected into the generator, which allows complete or partial image change. We show that images synthesized by our model are more diverse and follow the color and texture distributions of real images more closely. We achieve an average improvement of $6$ FID and $5$ mIoU points over the state of the art across different datasets using only adversarial supervision.

Details

NeurIPS Conference 2011 Conference Paper

Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

Angela Yao
Juergen Gall
Luc Gool
Raquel Urtasun

A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from "simple data", i. e. , single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art.

PDF Details