Arrow Research search

Author name cluster

Alexandre Drouin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

NeurIPS Conference 2025 Conference Paper

Causal Differentiating Concepts: Interpreting LM Behavior via Causal Representation Learning

  • Navita Goyal
  • Hal Daumé
  • Alexandre Drouin
  • Dhanya Sridhar

Language model activations entangle concepts that mediate their behavior, making it difficult to interpret these factors, which has implications for generalizability and robustness. We introduce an approach for disentangling these concepts without supervision. Existing methods for concept discovery often rely on external labels, contrastive prompts, or known causal structures, which limits their scalability and biases them toward predefined, easily annotatable features. In contrast, we propose a new unsupervised algorithm that identifies causal differentiating concepts—interpretable latent directions in LM activations that must be changed to elicit a different model behavior. These concepts are discovered using a constrained contrastive learning objective, guided by the insight that eliciting a target behavior requires only sparse changes to the underlying concepts. We formalize this notion and show that, under a particular assumption about the sparsity of these causal differentiating concepts, our method learns disentangled representations that align with human-interpretable factors influencing LM decisions. We empirically show the ability of our method to recover ground-truth causal factors in synthetic and semi-synthetic settings. Additionally, we illustrate the utility of our method through a case study on refusal behavior in language models. Our approach offers a scalable and interpretable lens into the internal workings of LMs, providing a principled foundation for interpreting language model behavior.

ICML Conference 2025 Conference Paper

Context is Key: A Benchmark for Forecasting with Essential Textual Information

  • Andrew Robert Williams
  • Arjun Ashok
  • Étienne Marcotte
  • Valentina Zantedeschi
  • Jithendaraa Subramanian
  • Roland Riachi
  • James Requeima
  • Alexandre Lacoste

Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https: //servicenow. github. io/context-is-key-forecasting/v0.

ICML Conference 2025 Conference Paper

Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks

  • Benjamin Leblanc
  • Mathieu Bazinet
  • Nathaniel D'Amours
  • Alexandre Drouin
  • Pascal Germain

Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bounds for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.

NeurIPS Conference 2025 Conference Paper

How to Train Your LLM Web Agent: A Statistical Diagnosis

  • Dheeraj Vattikonda
  • Santhoshi Ravichandran
  • Emiliano Penaloza
  • Hadi Nekoei
  • Thibault de Chezelles
  • Megh Thakkar
  • Nicolas Gontier
  • Miguel Muñoz-Mármol

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary agents. Bridging this gap is key to enabling customizable, efficient, and privacy-preserving agents. Two challenges hinder progress: the reproducibility issues in RL and LLM agent training, where results often depend on sensitive factors like seeds and decoding parameters, and the focus of prior work on single-step tasks, overlooking the complexities of web-based, multi-step decision-making. We address these gaps by providing a statistically driven study of training LLM agents for web tasks. Our two-stage pipeline combines imitation learning from a Llama 3. 3 70B teacher with on-policy fine-tuning via Group Relative Policy Optimization (GRPO) on a Llama 3. 1 8B student. Through 240 configuration sweeps and rigorous bootstrapping, we chart the first compute allocation curve for open-source LLM web agents. Our findings show that dedicating one-third of compute to teacher traces and the rest to RL improves MiniWoB++ success by 6 points and closes 60\% of the gap to GPT-4o on WorkArena, while cutting GPU costs by 45\%. We introduce a principled hyperparameter sensitivity analysis, offering actionable guidelines for robust and cost-effective agent training.

ICLR Conference 2025 Conference Paper

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

  • Gaurav Sahu
  • Abhay Puri
  • Juan A. Rodríguez
  • Amirhossein Abaskohi
  • Mohammad Chegini
  • Alexandre Drouin
  • Perouz Taslakian
  • Valentina Zantedeschi

Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We introduce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3 as an effective, open-source evaluator to assess agents’ ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics and can be accessed here: https://github.com/ServiceNow/insight-bench.

TMLR Journal 2025 Journal Article

The BrowserGym Ecosystem for Web Agent Research

  • Thibault Le Sellier de Chezelles
  • Maxime Gasse
  • Alexandre Lacoste
  • Massimo Caccia
  • Alexandre Drouin
  • Léo Boisvert
  • Megh Thakkar
  • Tom Marty

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and actionspaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic’s latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

ICLR Conference 2024 Conference Paper

TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series

  • Arjun Ashok
  • Étienne Marcotte
  • Valentina Zantedeschi
  • Nicolas Chapados
  • Alexandre Drouin

We introduce a new model for multivariate probabilistic time series prediction, designed to flexibly address a range of tasks including forecasting, interpolation, and their combinations. Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS), wherein the number of distributional parameters now scales linearly with the number of variables instead of factorially. The new objective requires the introduction of a training curriculum, which goes hand-in-hand with necessary changes to the original architecture. We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks, while maintaining the flexibility of prior work, such as seamless handling of unaligned and unevenly-sampled time series. Code is made available at https://github.com/ServiceNow/TACTiS.

NeurIPS Conference 2024 Conference Paper

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

  • Léo Boisvert
  • Megh Thakkar
  • Maxime Gasse
  • Massimo Caccia
  • Thibault L. De Chezelles
  • Quentin Cappart
  • Nicolas Chapados
  • Alexandre Lacoste

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress towards capable autonomous agents. The benchmark can be found at https: //github. com/ServiceNow/WorkArena.

ICML Conference 2024 Conference Paper

WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?

  • Alexandre Drouin
  • Maxime Gasse
  • Massimo Caccia
  • Issam H. Laradji
  • Manuel Del Verme
  • Tom Marty
  • David Vázquez 0001
  • Nicolas Chapados

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents’ ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

NeurIPS Conference 2023 Conference Paper

GEO-Bench: Toward Foundation Models for Earth Monitoring

  • Alexandre Lacoste
  • Nils Lehmann
  • Pau Rodriguez
  • Evan Sherwin
  • Hannah Kerner
  • Björn Lütjens
  • Jeremy Irvin
  • David Dao

Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

ICML Conference 2023 Conference Paper

Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

  • Étienne Marcotte
  • Valentina Zantedeschi
  • Alexandre Drouin
  • Nicolas Chapados

Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i. e. , functions that are minimal in expectation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time series forecasting evaluation. Through a power analysis, we identify the “region of reliability” of a scoring rule, i. e. , the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.

ICML Conference 2022 Conference Paper

TACTiS: Transformer-Attentional Copulas for Time Series

  • Alexandre Drouin
  • Étienne Marcotte
  • Nicolas Chapados

The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on multiple real-world datasets.

NeurIPS Conference 2020 Conference Paper

Differentiable Causal Discovery from Interventional Data

  • Philippe Brouillard
  • Sébastien Lachapelle
  • Alexandre Lacoste
  • Simon Lacoste-Julien
  • Alexandre Drouin

Learning a causal directed acyclic graph from data is a challenging task that involves solving a combinatorial problem for which the solution is not always identifiable. A new line of work reformulates this problem as a continuous constrained optimization one, which is solved via the augmented Lagrangian method. However, most methods based on this idea do not make use of interventional data, which can significantly alleviate identifiability issues. This work constitutes a new step in this direction by proposing a theoretically-grounded method based on neural networks that can leverage interventional data. We illustrate the flexibility of the continuous-constrained framework by taking advantage of expressive neural architectures such as normalizing flows. We show that our approach compares favorably to the state of the art in a variety of settings, including perfect and imperfect interventions for which the targeted nodes may even be unknown.

NeurIPS Conference 2020 Conference Paper

In search of robust measures of generalization

  • Gintare Karolina Dziugaite
  • Alexandre Drouin
  • Brady Neal
  • Nitarshan Rajkumar
  • Ethan Caballero
  • Linbo Wang
  • Ioannis Mitliagkas
  • Daniel M. Roy

One of the principal scientific challenges in deep learning is explaining generalization, i. e. , why the particular way the community now trains networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories -- such as those based on the VC dimension of the class of predictors induced by modern neural network architectures -- are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically. Jiang et al. (2020) recently described a large-scale empirical study aimed at uncovering potential causal relationships between bounds/measures and generalization. Building on their study, we highlight where their proposed methods can obscure failures and successes of generalization measures in explaining generalization. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.

NeurIPS Conference 2020 Conference Paper

Synbols: Probing Learning Algorithms with Synthetic Datasets

  • Alexandre Lacoste
  • Pau Rodríguez López
  • Frederic Branchaud-Charron
  • Parmida Atighehchian
  • Massimo Caccia
  • Issam Hadj Laradji
  • Alexandre Drouin
  • Matthew Craddock

Progress in the field of machine learning has been fueled by the introduction of benchmark datasets pushing the limits of existing algorithms. Enabling the design of datasets to test specific properties and failure modes of learning algorithms is thus a problem of high interest, as it has a direct impact on innovation in the field. In this sense, we introduce Synbols — Synthetic Symbols — a tool for rapidly generating new datasets with a rich composition of latent features rendered in low resolution images. Synbols leverages the large amount of symbols available in the Unicode standard and the wide range of artistic font provided by the open font community. Our tool's high-level interface provides a language for rapidly generating new distributions on the latent features, including various types of textures and occlusions. To showcase the versatility of Synbols, we use it to dissect the limitations and flaws in standard learning algorithms in various learning setups including supervised learning, active learning, out of distribution generalization, unsupervised representation learning, and object counting.

NeurIPS Conference 2017 Conference Paper

Maximum Margin Interval Trees

  • Alexandre Drouin
  • Toby Hocking
  • Francois Laviolette

Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.

IJCAI Conference 2013 Conference Paper

Accelerated Robust Point Cloud Registration in Natural Environments through Positive and Unlabeled Learning

  • Maxime Latulippe
  • Alexandre Drouin
  • Philippe Giguère
  • François Laviolette

Localization of a mobile robot is crucial for autonomous navigation. Using laser scanners, this can be facilitated by the pairwise alignment of consecutive scans. In this paper, we are interested in improving this scan alignment in challenging natural environments. For this purpose, local descriptors are generally effective as they facilitate point matching. However, we show that in some natural environments, many of them are likely to be unreliable, which affects the accuracy and robustness of the results. Therefore, we propose to filter the unreliable descriptors as a prior step to alignment. Our approach uses a fast machine learning algorithm, trained on-the-fly under the positive and unlabeled learning paradigm without the need for human intervention. Our results show that the number of descriptors can be significantly reduced, while increasing the proportion of reliable ones, thus speeding up and improving the robustness of the scan alignment process.