Author name cluster

Daniel Sonntag

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

AAAI Conference 2026 Conference Paper

Reinforce Trustworthiness in Multimodal Emotional Support System

Huy M. Le
Dat Tien Nguyen
Ngan T. T. Vo
Tuan D. Q. Nguyen
Nguyen Le Binh
Duy Minh Ho Nguyen
Daniel Sonntag
Lizi Liao

In today’s world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce MULTIMOOD, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.

PDF Details DOI

TMLR Journal 2026 Journal Article

The Speed-up Factor: A Quantitative Multi-Iteration Active Learning Performance Metric

Hannes Kath
Thiago S. Gouvêa
Daniel Sonntag

Machine learning models excel with abundant annotated data, but annotation is often costly and time-intensive. Active learning (AL) aims to improve the performance-to-annotation ratio by using query methods (QMs) to iteratively select the most informative samples. While AL research focuses mainly on QM development, the evaluation of this iterative process lacks appropriate performance metrics. This work reviews eight years of AL evaluation literature and formally introduces the speed-up factor, a quantitative multi-iteration QM performance metric that indicates the fraction of samples needed to match random sampling performance. Using four datasets from diverse domains and seven QMs of various types, we empirically evaluate the speed-up factor and compare it with state-of-the-art AL performance metrics. The results confirm the assumptions underlying the speed-up factor, demonstrate its accuracy in capturing the described fraction, and reveal its superior stability across iterations.

PDF Details

NeurIPS Conference 2025 Conference Paper

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy M. H. Nguyen
Nghiem Diep
Trung Nguyen
Hoang-Bao Le
Tai Nguyen
Anh-Tien Nguyen
TrungTin Nguyen
Nhat Ho

State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e. g. , LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med’s performance using just 10\% of pre-training data, achieving a 20. 13\% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.

PDF Details

NeurIPS Conference 2025 Conference Paper

How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Tran Anh
Duy M. H. Nguyen
Hoai-Chau Tran
Michael Barz
Khoa D Doan
Roger Wattenhofer
Vien Ngo
Mathias Niepert

Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce \textbf{GitMerge3D}, a \textbf{g}lobally \textbf{i}nformed graph \textbf{t}oken \textbf{merging} method that can reduce the token count by up to 90–95\% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at \href{https: //gitmerge3d. github. io/}{https: //gitmerge3d. github. io}.

PDF Details

TMLR Journal 2025 Journal Article

MGPATH: A Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot Whole Slide Pathology Classification

Anh-Tien Nguyen
Duy Minh Ho Nguyen
Nghiem Tuong Diep
Trung Quoc Nguyen
Nhat Ho
Jacqueline Michelle Metsch
Miriam Cindy Maurer
Daniel Sonntag

Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this https://github.com/HauschildLab/MGPATH.

PDF Details

NeurIPS Conference 2025 Conference Paper

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Nguyen Phuc
Ngoc-Hieu Nguyen
Duy M. H. Nguyen
Anji Liu
An Mai
Thanh Binh Nguyen
Daniel Sonntag
Khoa D Doan

Recently, Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. Surprisingly, while DAAs do not use a separate proxy reward model as in RLHF, their performance can still deteriorate over the course of training -- an over-optimization phenomenon found in RLHF where the learning policy exploits the overfitting to inaccuracies of the reward model to achieve high rewards. One attributed source of over-optimization in DAAs is the under-constrained nature of their offline optimization, which can gradually shift probability mass toward non-preferred responses not presented in the preference dataset. This paper proposes a novel importance-sampling approach to mitigate the distribution shift problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem.

PDF Details

ICML Conference 2025 Conference Paper

On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Nghiem Tuong Diep
Huy Nguyen
Chau Nguyen
Minh Le
Duy Minh Ho Nguyen
Daniel Sonntag
Mathias Niepert
Nhat Ho

LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Details

NeurIPS Conference 2024 Conference Paper

Accelerating Transformers with Spectrum-Preserving Token Merging

Hoai-Chau Tran
Duy M. Nguyen
TrungTin Nguyen
Ngan Le
Pengtao Xie
Daniel Sonntag
James Zou
Binh T. Nguyen

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e. g. , GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior work has proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top $k$ similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the \textit{energy score}. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0. 5\% average performance drop of ViT-MAEH compared to 2. 6\% as baselines), image-text retrieval (0. 3\% average performance drop of Clip on Flick30k compared to 4. 5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties to the original token space under mild conditions.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Demo: Enhancing Wildlife Acoustic Data Annotation Efficiency through Transfer and Active Learning

Hannes Kath
Patricia P. Serafini
Ivan B. Campos
Thiago S. Gouvêa
Daniel Sonntag

Passive Acoustic Monitoring (PAM) has become a key technology in wildlife monitoring, generating large amounts of acoustic data. However, the effective application of machine learning methods for sound event detection in PAM datasets is highly dependent on the accessibility of annotated data, a process that can be labour intensive. As a team of domain experts and machine learning researchers, in this paper we present a no-code annotation tool designed for PAM datasets that incorporates transfer learning and active learning strategies to address the data annotation challenge inherent in PAM. Transfer learning is applied to use pre-trained models to compute meaningful embeddings from the PAM audio files. Active learning iteratively identifies the most informative samples and then presents them to the user for annotation. This iterative approach improves the performance of the model compared to random sample selection. In a preliminary evaluation of the tool, a domain expert annotated part of a real PAM data set. Compared to conventional tools, the workflow of the proposed tool showed a speed improvement of 2-4 times. Further enhancements, such as the incorporation of sound examples, have the potential to further improve efficiency.

PDF Details DOI

ICML Conference 2024 Conference Paper

Structure-Aware E(3)-Invariant Molecular Conformer Aggregation Networks

Duy Minh Ho Nguyen
Nina Lukashina
Tai Nguyen 0008
An T. Le 0001
TrungTin Nguyen
Nhat Ho
Jan Peters 0001
Daniel Sonntag

A molecule’s 2D representation consists of its atoms, their attributes, and the molecule’s covalent bonds. A 3D (geometric) representation of a molecule is called a conformer and consists of its atom types and Cartesian coordinates. Every conformer has a potential energy, and the lower this energy, the more likely it occurs in nature. Most existing machine learning methods for molecular property prediction consider either 2D molecular graphs or 3D conformer structure representations in isolation. Inspired by recent work on using ensembles of conformers in conjunction with 2D graph representations, we propose E(3)-invariant molecular conformer aggregation networks. The method integrates a molecule’s 2D representation with that of multiple of its conformers. Contrary to prior work, we propose a novel 2D–3D aggregation mechanism based on a differentiable solver for the Fused Gromov-Wasserstein Barycenter problem and the use of an efficient conformer generation method based on distance geometry. We show that the proposed aggregation mechanism is E(3) invariant and propose an efficient GPU implementation. Moreover, we demonstrate that the aggregation mechanism helps to significantly outperform state-of-the-art molecule property prediction methods on established datasets.

Details

IJCAI Conference 2023 Conference Paper

A Human-in-the-Loop Tool for Annotating Passive Acoustic Monitoring Datasets

Hannes Kath
Thiago S. Gouvêa
Daniel Sonntag

Deep learning methods are well suited for data analysis in several domains, but application is often limited by technical entry barriers and the availability of large annotated datasets. We present an interactive machine learning tool for annotating passive acoustic monitoring datasets created for wildlife monitoring, which are time-consuming and costly to annotate manually. The tool, designed as a web application, consists of an interactive user interface implementing a human-in-the-loop workflow. Class label annotations provided manually as bounding boxes drawn over a spectrogram are consumed by a deep generative model (DGM) that learns a low-dimensional representation of the input data, as well as the available class labels. The learned low-dimensional representation is displayed as an interactive interface element, where new bounding boxes can be efficiently generated by the user with lasso-selection; alternatively, the DGM can propose new, automatically generated bounding boxes on demand. The user can accept, edit, or reject annotations suggested by the model, thus owning final judgement. Generated annotations can be used to fine-tune the underlying model, thus closing the loop. Investigations of the prediction accuracy and first empirical experiments show promising results on an artificial data set, laying the ground for application to a real life scenario.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Interactive Machine Learning Solutions for Acoustic Monitoring of Animal Wildlife in Biosphere Reserves

Thiago S. Gouvêa
Hannes Kath
Ilira Troshani
Bengt Lüers
Patricia P. Serafini
Ivan B. Campos
André S. Afonso
Sergio M. F. M. Leandro

Biodiversity loss is taking place at accelerated rates globally, and a business-as-usual trajectory will lead to missing internationally established conservation goals. Biosphere reserves are sites designed to be of global significance in terms of both the biodiversity within them and their potential for sustainable development, and are therefore ideal places for the development of local solutions to global challenges. While the protection of biodiversity is a primary goal of biosphere reserves, adequate information on the state and trends of biodiversity remains a critical gap for adaptive management in biosphere reserves. Passive acoustic monitoring (PAM) is an increasingly popular method for continued, reproducible, scalable, and cost-effective monitoring of animal wildlife. PAM adoption is on the rise, but its data management and analysis requirements pose a barrier for adoption for most agencies tasked with monitoring biodiversity. As an interdisciplinary team of machine learning scientists and ecologists experienced with PAM and working at biosphere reserves in marine and terrestrial ecosystems on three different continents, we report on the co-development of interactive machine learning tools for semi-automated assessment of animal wildlife.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Joint Self-Supervised Image-Volume Representation Learning with Intra-inter Contrastive Clustering

Duy M. H. Nguyen
Hoang Nguyen
Truong T. N. Mai
Tri Cao
Binh T. Nguyen
Nhat Ho
Paul Swoboda
Shadi Albarqouni

Collecting large-scale medical datasets with fully annotated samples for training of deep networks is prohibitively expensive, especially for 3D volume data. Recent breakthroughs in self-supervised learning (SSL) offer the ability to overcome the lack of labeled training samples by learning feature representations from unlabeled data. However, most current SSL techniques in the medical field have been designed for either 2D images or 3D volumes. In practice, this restricts the capability to fully leverage unlabeled data from numerous sources, which may include both 2D and 3D data. Additionally, the use of these pre-trained networks is constrained to downstream tasks with compatible data dimensions. In this paper, we propose a novel framework for unsupervised joint learning on 2D and 3D data modalities. Given a set of 2D images or 2D slices extracted from 3D volumes, we construct an SSL task based on a 2D contrastive clustering problem for distinct classes. The 3D volumes are exploited by computing vectored embedding at each slice and then assembling a holistic feature through deformable self-attention mechanisms in Transformer, allowing incorporating long-range dependencies between slices inside 3D volumes. These holistic features are further utilized to define a novel 3D clustering agreement-based SSL task and masking embedding prediction inspired by pre-trained language models. Experiments on downstream tasks, such as 3D brain segmentation, lung nodule detection, 3D heart structures segmentation, and abnormal chest X-ray detection, demonstrate the effectiveness of our joint 2D and 3D SSL approach. We improve plain 2D Deep-ClusterV2 and SwAV by a significant margin and also surpass various modern 2D and 3D SSL approaches.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching

Duy M. H. Nguyen
Hoang Nguyen
Nghiem Diep
Tan Ngoc Pham
Tri Cao
Binh Nguyen
Paul Swoboda
Nhat Ho

Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained networks on ImageNet and vision-language foundation models trained on web-scale data are the prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1. 3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed through a combinatorial graph-matching objective, and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.

PDF Details

AIIM Journal 2019 Journal Article

An architecture of open-source tools to combine textual information extraction, faceted search and information visualisation

Daniel Sonntag
Hans-Jürgen Profitlich

Details DOI

IJCAI Conference 2017 Conference Paper

Speech-based Medical Decision Support in VR using a Deep Neural Network (Demonstration)

Alexander Prange
Michael Barz
Daniel Sonntag

We present a speech dialogue system that facilitates medical decision support for doctors in a virtual reality (VR) application. The therapy prediction is based on a recurrent neural network model that incorporates the examination history of patients. A central supervised patient database provides input to our predictive model and allows us, first, to add new examination reports by a pen-based mobile application on-the-fly, and second, to get therapy prediction results in real-time. This demo includes a visualisation of patient records, radiology image data, and the therapy prediction results in VR.

PDF Details

IJCAI Conference 2009 Conference Paper

Daniel Sonntag

Dialogue-based Question Answering (QA) is a highly complex task that brings together a QA system including various natural language processing components (i. e. , components for question classi- ﬁcation, information extraction, and retrieval) with dialogue systems for effective and natural communication. The dialogue-based access is difﬁcult to establish when the QA system in use is complex and combines many different answer services with different quality and access characteristics. For example, some questions are processed by opendomain QA services with a broad coverage. Others should be processed by using a domain-speciﬁc instance ontology for more reliable answers. Different answer services may change their characteristics over time and the dialogue reaction models have to be updated according to that. To solve this problem, we developed introspective methods to integrate adaptable models of the answer services. We evaluated the impact of the learned models on the dialogue performance, i. e. , whether the adaptable models can be used for a more convenient dialogue formulation process. We show signiﬁcant effectiveness improvements in the resulting dialogues when using the machine learning (ML) models. Examples are provided in the context of the generation of system-initiative feedback to user questions and answers, as provided by heterogeneous information services.

PDF Details