Author name cluster

Milind Tambe

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

257 papers

2 author rows

JAAMAS Journal 2026 Journal Article

An Automated Teamwork Infrastructure for Heterogeneous Software Agents and Humans

David V. Pynadath
Milind Tambe

Abstract Agent integration architectures enable a heterogeneous, distributed set of agents to work together to address problems of greater complexity than those addressed by the individual agents themselves. Unfortunately, integrating software agents and humans to perform real-world tasks in a large-scale system remains difficult, especially due to three main challenges: ensuring robust execution in the face of a dynamic environment, providing abstract task specifications without all the low-level coordination details, and finding appropriate agents for inclusion in the overall system. To address these challenges, our Teamcore project provides the integration architecture with general-purpose teamwork coordination capabilities. We make each agent team-ready by providing it with a proxy capable of general teamwork reasoning. Thus, a key novelty and strength of our framework is that powerful teamwork capabilities are built into its foundations by providing the proxies themselves with a teamwork model. Given this teamwork model, the Teamcore proxies addresses the first agent integration challenge, robust execution, by automatically generating the required coordination actions for the agents they represent. We can also exploit the proxies' reusable general teamwork knowledge to address the second agent integration challenge. Through team - oriented programming, a developer specifies a hierarchical organization and its goals and plans, abstracting away from coordination details. Finally, KARMA, our Knowledgeable Agent Resources Manager Assistant, can aid the developer in conquering the third agent integration challenge by locating agents that match the specified organization's requirements. Our integration architecture enables teamwork among agents with no coordination capabilities, and it establishes and automates consistent teamwork among agents with some coordination capabilities. Thus, team-oriented programming provides a level of abstraction that can be used on top of previous approaches to agent-oriented programming. We illustrate how the Teamcore architecture successfully addressed the challenges of agent integration in two application domains: simulated rehearsal of a military evacuation mission and facilitation of human collaboration.

Details DOI

JAAMAS Journal 2026 Journal Article

Automated Assistants for Analyzing Team Behaviors

Ranjit Nair
Milind Tambe
Taylor Raines

Abstract Multi-agent teamwork is critical in a large number of agent applications, including training, education, virtual enterprises and collective robotics. The complex interactions of agents in a team as well as with other agents make it extremely difficult for human developers to understand and analyze agent-team behavior. It has thus become increasingly important to develop tools that can help humans analyze, evaluate, and understand team behaviors. However, the problem of automated team analysis is largely unaddressed in previous work. In this article, we identify several key constraints faced by team analysts. Most fundamentally, multiple types of models of team behavior are necessary to analyze different granularities of team events, including agent actions, interactions, and global performance. In addition, effective ways of presenting the analysis to humans is critical and the presentation techniques depend on the model being presented. Finally, analysis should be independent of underlying team architecture and implementation. We also demonstrate an approach to addressing these constraints by building an automated team analyst called ISAAC for post-hoc, off-line agent-team analysis. ISAAC acquires multiple, heterogeneous team models via machine learning over teams' external behavior traces, where the specific learning techniques are tailored to the particular model learned. Additionally, ISAAC employs multiple presentation techniques that can aid human understanding of the analyses. ISAAC also provides feedback on team improvement in two novel ways: (i) It supports principled “what-if” reasoning about possible agent improvements; (ii) It allows the user to compare different teams based on their patterns of interactions. This paper presents ISAAC's general conceptual framework, motivating its design, as well as its concrete application in two domains: (i) RoboCup Soccer; (ii) software agent teams participating in a simulated evacuation scenario. In the RoboCup domain, ISAAC was used prior to and during the RoboCup '99 tournament, and was awarded the RoboCup Scientific Challenge Award. In the evacuation domain, ISAAC was used to analyze patterns of message exchanges among software agents, illustrating the generality of ISAAC's techniques. We present detailed algorithms and experimental results from ISAAC's application.

Details DOI

JMLR Journal 2026 Journal Article

Contrasting Local and Global Modeling with Machine Learning and Satellite Data: A Case Study Estimating Tree Canopy Height in African Savannas

Esther Rolf
Lucia Gordon
Milind Tambe
Andrew Davies

While advances in machine learning with satellite imagery (SatML) are facilitating environmental monitoring at a global scale, developing SatML models that are accurate and useful for local regions remains critical to understanding and acting on an ever-changing planet. As increasing attention and resources are being devoted to training SatML models with global data, it is important to understand when improvements in global models will make it easier to train or fine-tune models that are accurate in specific regions. To explore this question, we design the first study that explicitly contrasts local and global training paradigms for SatML, through a case study of tree canopy height (TCH) mapping in the Karingani Game Reserve, Mozambique. We find that recent advances in global TCH mapping do not necessarily translate to better local modeling abilities in our study region. Specifically, small models trained only with locally-collected data outperform published global TCH maps, and even outperform globally pretrained models that we fine-tune using local data. Analyzing these results further, we identify specific points of conflict and synergy between local and global modeling paradigms that can inform future research toward aligning local and global performance objectives in geospatial machine learning. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2026. ( edit, beta )

PDF Details

JAAMAS Journal 2026 Journal Article

Experiences Acquired in the Design of RoboCup Teams: A Comparison of Two Fielded Teams

Stacy Marsella
Milind Tambe
Ion Muslea

Abstract Increasingly, multi-agent systems are being designed for a variety of complex, dynamic domains. Effective agent interactions in such domains raise some of the most fundamental research challenges for agent-based systems, in teamwork, multi-agent learning and agent modelling. The RoboCup research initiative, particularly the simulation league, has been proposed to pursue such multi-agent research challenges, using the common testbed of simulation soccer. Despite the significant popularity of RoboCup within the research community, general lessons have not often been extracted from participation in RoboCup. This is what we attempt to do here. We have fielded two teams, ISIS97 and ISIS98, in RoboCup competitions. These teams have been in the top four teams in these competitions. We compare the teams, and attempt to analyze and generalize the lessons learned. This analysis reveals several surprises, pointing out lessons for teamwork and for multi-agent learning.

Details DOI

IS Journal 2026 Journal Article

Generative Artificial Intelligence for Social Impact

Lingkai Kong
Cheol Woo Kim
Davin Choo
Milind Tambe

Artificial intelligence for social impact has achieved compelling results in public health, conservation, and security, yet scaling these successes remains difficult due to a persistent deployment bottleneck. We characterize this bottleneck through three coupled gaps: observational scarcity resulting from limited or unreliable data, policy synthesis challenges involving combinatorial decisions and nonstationarity, and the friction of human–AI alignment when incorporating tacit expert knowledge and dynamic constraints. We argue that generative AI offers a unified pathway to bridge these gaps. Large language model agents assist in human–AI alignment by translating natural-language guidance into executable objectives and constraints for downstream planners, while diffusion models generate realistic synthetic data and support uncertainty-aware modeling to improve policy robustness and transfer across deployments. Together, these tools enable scalable, adaptable, and human-aligned AI systems for resource optimization in high-stakes settings.

Details DOI

AAAI Conference 2026 Conference Paper

Optimizing Health Coverage in Ethiopia: A Learning-augmented Approach and Persistent Proportionality Under an Online Budget

Davin Choo
Yohai Trabelsi
Fentabil Getnet
Samson Warkaye Lamma
Wondesen Nigatu
Kasahun Sime
Lisa Matay
Milind Tambe

As part of nationwide efforts aligned with the United Nations' Sustainable Development Goal 3 on Universal Health Coverage, Ethiopia's Ministry of Health is strengthening health posts to expand access to essential healthcare services. However, only a fraction of this health system strengthening effort can be implemented each year due to limited budgets and other competing priorities, thus the need for an optimization framework to guide prioritization across the regions of Ethiopia. In this paper, we develop a tool, Health Access Resource Planner (HARP), based on a principled decision-support optimization framework for sequential facility planning that aims to maximize population coverage under budget uncertainty while satisfying region-specific proportionality targets at every time step. We then propose two algorithms: (i) a learning-augmented approach that improves upon expert recommendations at any single-step; and (ii) a greedy algorithm for multi-step planning, both with strong worst-case approximation estimation. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we demonstrated the empirical efficacy of our method on three regions across various planning scenarios.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Preference Robustness for DPO with Applications to Public Health

Cheol Woo Kim
Shresth Verma
Mauricio Tec
Milind Tambe

We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.

PDF Details DOI

JAAMAS Journal 2026 Journal Article

Towards Flexible Teamwork in Persistent Teams: Extended Report

Milind Tambe
Weixiong Zhang

Abstract Teamwork is a critical capability in multi-agent environments. Many such environments mandate that the agents and agent-teams must be persistent i. e. , exist over long periods of time. Agents in such persistent teams are bound together by their long-term common interests and goals. This paper focuses on flexible teamwork in such persistent teams. Unfortunately, while previous work has investigated flexible teamwork, persistent teams remain unexplored. For flexible teamwork, one promising approach that has emerged is model-based, i. e. , providing agents with general models of teamwork that explicitly specify their commitments in teamwork. Such models enable agents to autonomously reason about coordination. Unfortunately, for persistent teams, such models may lead to coordination and communication actions that while locally optimal, are highly problematic for the team's long-term goals. We present a decision-theoretic technique based on Markov decision processes to enable persistent teams to overcome such limitations of the model-based approach. In particular, agents reason about expected team utilities of future team states that are projected to result from actions recommended by the teamwork model, as well as lower-cost (or higher-cost) variations on these actions. To accommodate real-time constraints, this reasoning is done in an any-time fashion. Implemented examples from an analytic search tree and some real-world domains are presented.

Details DOI

AAAI Conference 2026 Conference Paper

VORTEX: Aligning Task Utility and Human Preferences Through LLM-Guided Reward Shaping

Guojun Xiong
Milind Tambe

In social impact optimization, AI decision systems often rely on solvers that optimize well-calibrated mathematical objectives. However, these solvers cannot directly accommodate evolving human preferences, typically expressed in natural language rather than formal constraints. Recent approaches address this by using large language models (LLMs) to generate new reward functions from preference descriptions. While flexible, they risk sacrificing the system's core utility guarantees. In this paper, we propose VORTEX, a language-guided reward shaping framework that preserves established optimization goals while adaptively incorporating human feedback. By formalizing the problem as multi-objective optimization, we use LLMs to iteratively generate shaping rewards based on verbal reinforcement and text-gradient prompt updates. This allows stakeholders to steer decision behavior via natural language without modifying solvers or specifying trade-off weights. We provide theoretical guarantees that VORTEX converges to Pareto-optimal trade-offs between utility and preference satisfaction. Empirical results in real-world allocation tasks demonstrate that VORTEX outperforms baselines in satisfying human-aligned coverage goals while maintaining high task performance. This work introduces a practical and theoretically grounded paradigm for human-AI collaborative optimization guided by natural language.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing

Choo, XianJun, Davin
Yuqi Pan
Tonghan Wang
Milind Tambe
Alastair van Heerden
Cheryl Johnson

We study a sequential decision-making problem on a $n$-node graph $\mathcal{G}$ where each node has an unknown label from a finite set $\mathbf{\Omega}$, drawn from a joint distribution $\mathcal{P}$ that is Markov with respect to $\mathcal{G}$. At each step, selecting a node reveals its label and yields a label-dependent reward. The goal is to adaptively choose nodes to maximize expected accumulated discounted rewards. We impose a frontier exploration constraint, where actions are limited to neighbors of previously selected nodes, reflecting practical constraints in settings such as contact tracing and robotic exploration. We design a Gittins index-based policy that applies to general graphs and is provably optimal when $\mathcal{G}$ is a forest. Our implementation runs in $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|^2)$ time while using $\mathcal{O}(n \cdot |\mathbf{\Omega}|^2)$ oracle calls to $\mathcal{P}$ and $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|)$ space. Experiments on synthetic and real-world graphs show that our method consistently outperforms natural baselines, including in non-tree, budget-limited, and undiscounted settings. For example, in HIV testing simulations on real-world sexual interaction networks, our policy detects nearly all positive cases with only half the population tested, substantially outperforming other baselines.

PDF Details

AAMAS Conference 2025 Conference Paper

Bayesian Collaborative Bandits with Thompson Sampling for Improved Outreach in Maternal Health

Arpan Dasgupta
Gagan Jain
Arun Suggala
Karthikeyan Shanmugam
Milind Tambe
Aparna Taneja

Mobile health (mHealth) programs face a critical challenge in optimizing the timing of automated health information calls to beneficiaries. This challenge has been formulated as a collaborative multiarmed bandit problem, requiring online learning of a low-rank reward matrix. Existing solutions often rely on heuristic combinations of offline matrix completion and exploration strategies. In this work, we propose a principled Bayesian approach using Thompson Sampling for this collaborative bandit problem. Our method leverages prior information through efficient Gibbs sampling for posterior inference over the low-rank matrix factors, enabling faster convergence. We demonstrate significant improvements over stateof-the-art baselines on a real-world dataset from the world’s largest maternal mHealth program. Our approach achieves a 16% reduction in the number of calls compared to existing methods and a 47% reduction compared to the deployed random policy. This efficiency gain translates to a potential increase in program capacity by 0. 5 − 1. 4 million beneficiaries, granting them access to vital antenatal and post-natal care information. Furthermore, we observe a 7% and 29% improvement in beneficiary retention (an extremely hard metric to impact) compared to state-of-the-art and deployed baselines, respectively. Synthetic simulations further demonstrate the superiority of our approach, particularly in low-data regimes and in effectively utilizing prior information. We also provide a theoretical analysis of our algorithm in a special setting using Eluder dimension.