Author name cluster

Francesco Bacchiocchi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

ICML Conference 2025 Conference Paper

Contract Design Under Approximate Best Responses

Francesco Bacchiocchi
Jiarui Gan
Matteo Castiglioni
Alberto Marchesi 0001
Nicola Gatti 0001

Principal-agent problems model scenarios where a principal aims at incentivizing an agent to take costly, unobservable actions through the provision of payments. Such interactions are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal’s payment scheme (a. k. a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This is perhaps surprising, as computing an optimal commitment under approximate best responses is known to be computationally intractable in Stackelberg games. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal does not know anything about the environment.

Details

AIJ Journal 2025 Journal Article

Learning optimal contracts with small action spaces

Francesco Bacchiocchi
Matteo Castiglioni
Nicola Gatti
Alberto Marchesi

Details DOI

NeurIPS Conference 2025 Conference Paper

Markov Persuasion Processes: Learning to Persuade From Scratch

Francesco Bacchiocchi
Francesco Emanuele Stradi
Matteo Castiglioni
Alberto Marchesi
Nicola Gatti

In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e. g. , they assume that the sender knows receivers' rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment. We design a learning algorithm for the sender, working with partial feedback. We prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes, as it is the case for the loss in persuasiveness cumulated while learning. Moreover, we provide lower bounds for our setting matching the guarantees of our algorithm.

PDF Details

NeurIPS Conference 2025 Conference Paper

Online Bilateral Trade With Minimal Feedback: Don’t Waste Seller’s Time

Francesco Bacchiocchi
Matteo Castiglioni
Roberto Colomboni
Alberto Marchesi

Online learning algorithms for designing optimal bilateral trade mechanisms have recently received significant attention. This paper addresses a key inefficiency in prior two-bit feedback models, which synchronously query both the buyer and the seller for their willingness to trade. This approach is inherently inefficient as it offers a trade to the seller even if the buyer rejects the offer. We propose an asynchronous mechanism that queries the seller only if the buyer has already accepted the offer. Consequently, the mechanism receives one bit of feedback from the buyer and a "censored" bit from the seller---a signal richer than the standard one-bit (trade/no-trade) feedback, but less informative than the two-bit model. Assuming independent valuations with bounded densities---the same distributional conditions underlying the two-bit results of Cesa-Bianchi et al. [2024a]---we design an algorithm that achieves $\tilde{O}(T^{2/3})$ regret against the best fixed price in hindsight. This matches the lower bound for the strictly richer two-bit model, showing that our mechanism elicits the minimal feedback necessary to attain optimal rates.

PDF Details

NeurIPS Conference 2024 Conference Paper

Bandits with Ranking Feedback

Davide Maran
Francesco Bacchiocchi
Francesco Emanuele Stradi
Matteo Castiglioni
Nicola Gatti
Marcello Restelli

In this paper, we introduce a novel variation of multi-armed bandits called bandits with ranking feedback. Unlike traditional bandits, this variation provides feedback to the learner that allows them to rank the arms based on previous pulls, without quantifying numerically the difference in performance. This type of feedback is well-suited for scenarios where the arms' values cannot be precisely measured using metrics such as monetary scores, probabilities, or occurrences. Common examples include human preferences in matchmaking problems. Furthermore, its investigation answers the theoretical question on how numerical rewards are crucial in bandit settings. In particular, we study the problem of designing no-regret algorithms with ranking feedback both in the stochastic and adversarial settings. We show that, with stochastic rewards, differently from what happens with non-ranking feedback, no algorithm can suffer a logarithmic regret in the time horizon $T$ in the instance-dependent case. Furthermore, we provide two algorithms. The first, namely DREE, guarantees a superlogarithmic regret in $T$ in the instance-dependent case thus matching our lower bound, while the second, namely R-LPE, guarantees a regret of $\mathcal{\widetilde O}(\sqrt{T})$ in the instance-independent case. Remarkably, we show that no algorithm can have an optimal regret bound in both instance-dependent and instance-independent cases. Finally, we prove that no algorithm can achieve a sublinear regret when the rewards are adversarial.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Learning Optimal Contracts: How to Exploit Small Action Spaces

Francesco Bacchiocchi
Matteo Castiglioni
Alberto Marchesi 0001
Nicola Gatti 0001

We study principal-agent problems in which a principal commits to an outcome-dependent payment scheme---called contract---in order to induce an agent to take a costly, unobservable action leading to favorable outcomes. We consider a generalization of the classical (single-round) version of the problem in which the principal interacts with the agent by committing to contracts over multiple rounds. The principal has no information about the agent, and they have to learn an optimal contract by only observing the outcome realized at each round. We focus on settings in which the size of the agent's action space is small. We design an algorithm that learns an approximately-optimal contract with high probability in a number of rounds polynomial in the size of the outcome space, when the number of actions is constant. Our algorithm solves an open problem by Zhu et al. [2022]. Moreover, it can also be employed to provide a $\widetilde{\mathcal{O}}(T^{4/5})$ regret bound in the related online learning setting in which the principal aims at maximizing their cumulative utility, thus considerably improving previously-known regret bounds.

Details

NeurIPS Conference 2024 Conference Paper

Online Bayesian Persuasion Without a Clue

Francesco Bacchiocchi
Matteo Bollini
Matteo Castiglioni
Alberto Marchesi
Nicola Gatti

We study online Bayesian persuasion problems in which an informed sender repeatedly faces a receiver with the goal of influencing their behavior through the provision of payoff-relevant information. Previous works assume that the sender has knowledge about either the prior distribution over states of nature or receiver's utilities, or both. We relax such unrealistic assumptions by considering settings in which the sender does not know anything about the prior and the receiver. We design an algorithm that achieves sublinear---in the number of rounds T---regret with respect to an optimal signaling scheme, and we also provide a collection of lower bounds showing that the guarantees of such an algorithm are tight. Our algorithm works by searching a suitable space of signaling schemes in order to learn receiver's best responses. To do this, we leverage a non-standard representation of signaling schemes that allows to cleverly overcome the challenge of not knowing anything about the prior over states of nature and receiver's utilities. Finally, our results also allow to derive lower/upper bounds on the sample complexity of learning signaling schemes in a related Bayesian persuasion PAC-learning problem.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Online Learning with Off-Policy Feedback in Adversarial MDPs

Francesco Bacchiocchi
Francesco Emanuele Stradi
Matteo Papini
Alberto Maria Metelli
Nicola Gatti

In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague's policy. First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w. r. t. the dissimilarity between the optimal policy in hindsight and the colleague's policy. Then, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague's policy is unknown.

PDF Details DOI

EWRL Workshop 2023 Workshop Paper

Online Adversarial MDPs with Off-Policy Feedback and Known Transitions

Francesco Bacchiocchi
Francesco Emanuele Stradi
Matteo Papini
Alberto Maria Metelli
Nicola Gatti

In this paper, we face the challenge of online learning in adversarial Markov decision processes with known transitions and off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy), whose losses are revealed to the learner. The off-policy feedback presents an additional technical issue that is not present in traditional exploration-exploitation trade-off problems: the learner is charged with the regret of its chosen policy (w. r. t. a comparator policy) but it observes only the losses suffered by the colleague's policy. Contrariwise, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the more desirable dissimilarity between any comparator policy and the colleague's policy, even when the latter is unknown.

PDF

EWRL Workshop 2023 Workshop Paper

Online Learning in Autoregressive Dynamics

Francesco Bacchiocchi
Gianmarco Genalti
Davide Maran
Marco Mussi
Marcello Restelli
Nicola Gatti
Alberto Maria Metelli

Autoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2}\right)$, where $T$ is the optimization horizon, $n$ is the number of actions, and $\Gamma < 1$ is a stability index of the process. Finally, we empirically evaluate our algorithm in both synthetic and real-world domains, illustrating its advantages w. r. t. relevant bandit baselines.

PDF

IJCAI Conference 2022 Conference Paper

Public Signaling in Bayesian Ad Auctions

Francesco Bacchiocchi
Matteo Castiglioni
Alberto Marchesi
Giulia Romano
Nicola Gatti

We study signaling in Bayesian ad auctions, in which bidders' valuations depend on a random, unknown state of nature. The auction mechanism has complete knowledge of the actual state of nature, and it can send signals to bidders so as to disclose information about the state and increase revenue. For instance, a state may collectively encode some features of the user that are known to the mechanism only, since the latter has access to data sources unaccessible to the bidders. We study the problem of computing how the mechanism should send signals to bidders in order to maximize revenue. While this problem has already been addressed in the easier setting of second-price auctions, to the best of our knowledge, our work is the first to explore ad auctions with more than one slot. In this paper, we focus on public signaling and VCG mechanisms, under which bidders truthfully report their valuations. We start with a negative result, showing that, in general, the problem does not admit a PTAS unless P = NP, even when bidders' valuations are known to the mechanism. The rest of the paper is devoted to settings in which such negative result can be circumvented. First, we prove that, with known valuations, the problem can indeed be solved in polynomial time when either the number of states d or the number of slots m is fixed. Moreover, in the same setting, we provide an FPTAS for the case in which bidders are single minded, but d and m can be arbitrary. Then, we switch to the random valuations setting, in which these are randomly drawn according to some probability distribution. In this case, we show that the problem admits an FPTAS, a PTAS, and a QPTAS, when, respectively, d is fixed, m is fixed, and bidders' valuations are bounded away from zero.

PDF Details DOI