Arrow Research search

Author name cluster

Federico Cabitza

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AIIM Journal 2026 Journal Article

Calibration-informed metrics for instance-level predictive reliability in medical AI

  • Federico Cabitza

Conventional performance metrics in clinical decision support systems, such as accuracy or sensitivity, fail to reflect the reliability of individual predictions-an essential concern for clinicians operating in high-stakes environments. We introduce a calibration-informed framework featuring two novel metrics: the Local Predictive Value (LPV) and the Credible Predictive Value (CPV). LPV estimates the empirical reliability of a prediction by assessing the observed correctness frequency in the neighborhood of its confidence score. CPV refines this estimate using a Bayesian approach, integrating global predictive values as priors to produce a posterior distribution over correctness probabilities. LPV offers a descriptive, data-driven view of local reliability, while CPV provides a belief-adjusted estimate that mitigates overfitting to sparse local data. Applied to benchmark medical imaging datasets, these metrics yielded locally adaptive, interpretable reliability estimates. Divergences between LPV and CPV identified cases where local evidence was insufficient or misleading, highlighting how Bayesian smoothing improves stability against sparse or misleading local evidence. By combining local calibration with Bayesian inference, LPV and CPV advance the development of medical AI systems that are not only accurate but also interpretable and trustworthy at the individual case level.

AAAI Conference 2026 Conference Paper

Too Sure for Our Own Good: A User Study on AI Confidence and Human Reliance

  • Caterina Fregosi
  • Lucia Vicente
  • Andrea Campagner
  • Federico Cabitza

Achieving appropriate human reliance on Artificial Intelligence (AI) systems remains a central challenge in Human-Computer Interaction. Confidence scores—indicators of an AI system’s certainty in its recommendations—have been proposed as a means to help users calibrate their trust and reliance on AI Decision Support Systems (DSS). However, limited research has explored how well-calibrated versus miscalibrated confidence scores affect human decision-making. We report a study examining the effects of confidence calibration on user reliance, decision accuracy, and perceived utility of an AI DSS. In a within-subjects experiment involving 184 participants solving logic puzzles, we found that well-calibrated confidence scores significantly improved decision accuracy (+20%, 95% CI: [0.18, 0.23]), whereas miscalibrated scores yielded minimal accuracy gains (+2%, 95% CI: [-0.00, 0.04]) and increased vulnerability to automation bias and conservatism bias. Participants were more likely to accept AI recommendations when high confidence was expressed, even when those recommendations were incorrect, resulting in errors. Conversely, miscalibrated and low-confidence recommendations increased conservatism bias, leading users to reject even accurate AI suggestions. Perceived utility of the AI system was higher when confidence levels were high (p < 0.001) and when confidence was well-calibrated (p = 0.002). These findings underscore the importance of designing AI systems with properly calibrated confidence cues to improve human-AI collaboration and mitigate reliance-related biases.

ECAI Conference 2025 Conference Paper

An Evidence-Theoretic Framework for Online Learning from Expert Advice

  • Andrea Campagner
  • Francesca Arredondo
  • Davide Ciucci
  • Federico Cabitza

The use of belief function theory (BFT) in machine learning has gained attention as researchers seek more principled foundations for decision-making in uncertain environments. However, research has mostly focused on the setting of batch learning. In this article, in contrast and to our knowledge for the first time in the literature, we study the application of BFT to the setting of online (machine) learning. Within this context, online learning from expert advice (LEA) offers a framework where learners iteratively update their predictions based on experts’ input and (adversarially labeled) observed outcomes. Despite extensive study and strong theoretical results, the epistemological underpinnings of LEA remain largely heuristic. This work addresses this gap by proposing belief function theory (BFT) as a formal foundation for LEA. Here we report a theoretical and algorithmic integration of BFT into LEA, showing that classical LEA algorithms such as Halving and Weighted Majority can be derived as special cases of evidential reasoning. We further introduce two novel LEA algorithms—Evidential Halving and Evidential Weighted Majority—which fully exploit BFT and support cautious prediction through abstention. These new algorithms demonstrate improved regret bounds over traditional methods, under mild assumptions. These findings open a new direction in online learning by leveraging the full expressive power of BFT to design theoretically grounded algorithms.

AIIM Journal 2024 Journal Article

Never tell me the odds: Investigating pro-hoc explanations in medical decision making

  • Federico Cabitza
  • Chiara Natali
  • Lorenzo Famiglini
  • Andrea Campagner
  • Valerio Caccavella
  • Enrico Gallazzi

This paper examines a kind of explainable AI, centered around what we term pro-hoc explanations, that is a form of support that consists of offering alternative explanations (one for each possible outcome) instead of a specific post-hoc explanation following specific advice. Specifically, our support mechanism utilizes explanations by examples, featuring analogous cases for each category in a binary setting. Pro-hoc explanations are an instance of what we called frictional AI, a general class of decision support aimed at achieving a useful compromise between the increase of decision effectiveness and the mitigation of cognitive risks, such as over-reliance, automation bias and deskilling. To illustrate an instance of frictional AI, we conducted an empirical user study to investigate its impact on the task of radiological detection of vertebral fractures in x-rays. Our study engaged 16 orthopedists in a ‘human-first, second-opinion’ interaction protocol. In this protocol, clinicians first made initial assessments of the x-rays without AI assistance and then provided their final diagnosis after considering the pro-hoc explanations. Our findings indicate that physicians, particularly those with less experience, perceived pro-hoc XAI support as significantly beneficial, even though it did not notably enhance their diagnostic accuracy. However, their increased confidence in final diagnoses suggests a positive overall impact. Given the promisingly high effect size observed, our results advocate for further research into pro-hoc explanations specifically, and into the broader concept of frictional AI.

AIIM Journal 2023 Journal Article

Rams, hounds and white boxes: Investigating human–AI collaboration protocols in medical diagnosis

  • Federico Cabitza
  • Andrea Campagner
  • Luca Ronzio
  • Matteo Cameli
  • Giulia Elena Mandoli
  • Maria Concetta Pastore
  • Luca Maria Sconfienza
  • Duarte Folgado

In this paper, we study human–AI collaboration protocols, a design-oriented construct aimed at establishing and evaluating how humans and AI can collaborate in cognitive tasks. We applied this construct in two user studies involving 12 specialist radiologists (the knee MRI study) and 44 ECG readers of varying expertise (the ECG study), who evaluated 240 and 20 cases, respectively, in different collaboration configurations. We confirm the utility of AI support but find that XAI can be associated with a “white-box paradox”, producing a null or detrimental effect. We also find that the order of presentation matters: AI-first protocols are associated with higher diagnostic accuracy than human-first protocols, and with higher accuracy than both humans and AI alone. Our findings identify the best conditions for AI to augment human diagnostic skills, rather than trigger dysfunctional responses and cognitive biases that can undermine decision effectiveness.

AAAI Conference 2023 Conference Paper

Toward a Perspectivist Turn in Ground Truthing for Predictive Computing

  • Federico Cabitza
  • Andrea Campagner
  • Valerio Basile

Most current Artificial Intelligence applications are based on supervised Machine Learning (ML), which ultimately grounds on data annotated by small teams of experts or large ensemble of volunteers. The annotation process is often performed in terms of a majority vote, however this has been proved to be often problematic by recent evaluation studies. In this article, we describe and advocate for a different paradigm, which we call perspectivism: this counters the removal of disagreement and, consequently, the assumption of correctness of traditionally aggregated gold-standard datasets, and proposes the adoption of methods that preserve divergence of opinions and integrate multiple perspectives in the ground truthing process of ML development. Drawing on previous works which inspired it, mainly from the crowdsourcing and multi-rater labeling settings, we survey the state-of-the-art and describe the potential of our proposal for not only the more subjective tasks (e.g. those related to human language) but also those tasks commonly understood as objective (e.g. medical decision making). We present the main benefits of adopting a perspectivist stance in ML, as well as possible disadvantages, and various ways in which such a stance can be implemented in practice. Finally, we share a set of recommendations and outline a research agenda to advance the perspectivist stance in ML.

ECAI Conference 2023 Conference Paper

Towards a Rigorous Calibration Assessment Framework: Advancements in Metrics, Methods, and Use

  • Lorenzo Famiglini
  • Andrea Campagner
  • Federico Cabitza

Calibration is paramount in developing and validating Machine Learning models, particularly in sensitive domains such as medicine. Despite its significance, existing metrics to assess calibration have been found to have shortcomings in regard to their interpretation and theoretical properties. This article introduces a novel and comprehensive framework to assess the calibration of Machine and Deep Learning models that addresses the above limitations. The proposed framework is based on a modification of the Expected Calibration Error (ECE), called the Estimated Calibration Index (ECI), which grounds on and extends prior research. ECI was initially formulated for binary settings, and we adapted it to fit multiclass settings. ECI offers a more nuanced, both locally and globally, and informative measure of a model’s tendency towards over/underconfidence. The paper first outlines the issues related to the prevalent definitions of ECE, including potential biases that may arise in the evaluation of their measures. Then, we present the results of a series of experiments conducted to demonstrate the effectiveness of the proposed framework in supporting a more accurate understanding of a model’s calibration level. Additionally, we discuss how to address and potentially mitigate some biases in calibration assessment.