Author name cluster

Maarten de Rijke

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers

2 author rows

TMLR Journal 2026 Journal Article

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Shashank Gupta
Chaitanya Ahuja
Tsung-Yu Lin
Sreya Dutta Roy
Harrie Oosterhuis
Maarten de Rijke
Satya Narayan Shukla

Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization ( PPO) is a popular choice of method for policy optimization. While effective in terms of performance and sample complexity, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some implementation complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high variance and crucially sample inefficiency, which is the primary notion of efficiency we study in this work. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the sample efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO ( LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between sample efficiency and final performance.

PDF Details

AAAI Conference 2026 Conference Paper

Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

Wenda Wei
Yu-An Liu
Ruqing Zhang
Jiafeng Guo
Lixin Su
Shuaiqiang Wang
Dawei Yin
Maarten de Rijke

Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios. Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.

PDF Details DOI

JAIR Journal 2026 Journal Article

Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles

Siddharth Mehrotra
Jin Huang
Xuelong Fu
Roel Dobbe
Clara I. Sánchez
Maarten de Rijke

Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. Current research often adopts techno-centric approaches, focusing primarily on technical attributes such as accuracy, reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of the AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks Against Black-Box Neural Ranking Models

Yu-An Liu
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke
Yixing Fan
Xueqi Cheng

Neural ranking models (NRMs) have been shown to be highly effective in terms of retrieval performance. Unfortunately, they have also displayed a higher degree of sensitivity to attacks than previous generation models. To help expose and address this lack of robustness, we introduce a novel ranking attack framework named Attack-in-the-Chain, which tracks interactions between large language models (LLMs) and NRMs based on chain-of-thought (CoT) prompting to generate adversarial examples under black-box settings. Our approach starts by identifying anchor documents with higher ranking positions than the target document as nodes in the reasoning chain. We then dynamically assign the number of perturbation words to each node and prompt LLMs to execute attacks. Finally, we verify the attack performance of all nodes at each reasoning step and proceed to generate the next reasoning step. Empirical results on two web search benchmarks show the effectiveness of our method.

PDF Details DOI

UAI Conference 2025 Conference Paper

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

Sami Jullien
Romain Deffayet
Jean-Michel Renders
Paul Groth
Maarten de Rijke

Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and extracts a rich feedback from environment samples. The commonly used quantile regression approach to distributional RL – based on asymmetric $L_1$ losses – provides a flexible and effective way of learning arbitrary return distributions. In practice, it is often improved by using a more efficient, asymmetric hybrid $L_1$-$L_2$ Huber loss for quantile regression. However, by doing so, distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean. Indeed, asymmetric $L_2$ losses, corresponding to expectile regression, cannot be readily used for distributional temporal difference learning. Motivated by the efficiency of $L_2$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns. We prove that our proposed operator converges to the distributional Bellman operator in the limit of infinite estimated quantile and expectile fractions, and we benchmark a practical implementation on a toy example and at scale. On the Atari benchmark, our approach matches the performance of the Huber-based IQN-1 baseline after $200$M training frames but avoids distributional collapse and keeps estimates of the full distribution of returns.