Author name cluster

Robert Wilson

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Ji-An Li
Huadong Xiong
Robert Wilson
Marcelo G Mattar
Marcus K. Benna

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition --- the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e. g. , safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired \emph{neurofeedback} paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to \textit{report} and \textit{control} their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a ``metacognitive space'' with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e. g. , adversarial attack and defense).

NeurIPS Conference 2025 Conference Paper

Large Language Models Think Too Fast To Explore Effectively

Lan Pan
Hanbo Xie
Robert Wilson

Large Language Models (LLMs) have emerged with many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore—an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment. Results indicate that traditional reasoning-focused LLMs, such as GPT-4o, exhibit a significantly faster and less detailed reasoning process, limiting their exploratory performance. In contrast, the DeepSeek reasoning model demonstrates prolonged, iterative thought processes marked by repetitive analysis of combinations and past trials, reflecting a more thorough and human-like exploration strategy. Representational analysis of the models with Sparse Autoencoders (SAE) revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.

RLDM Conference 2017 Conference Abstract

A causal role for right frontopolar cortex in directed, but not random, exploration

Wojciech Zajkowski
Malgorzata Kossut
Robert Wilson

The explore-exploit dilemma occurs anytime we must choose between exploring unknown op- tions for information and exploiting known resources for reward. Previous work suggests that people use two different strategies to solve the explore-exploit dilemma: directed exploration, driven by information seeking, and random exploration, driven by decision noise. Here, we show that these two strategies rely on different neural systems. Using transcranial magnetic stimulation to selectively inhibit right frontopo- lar cortex, we were able to selectively inhibit directed exploration while leaving random exploration intact. This suggests a causal role for right frontopolar cortex in directed, but not random, exploration and that the systems underlying directed and random exploration are, at least partially, dissociable.

RLDM Conference 2017 Conference Abstract

Engagement matters: pupil and mental effort mediate depletion effect on subsequent physical tasks

Bryan Kromenacker
Robert Wilson

Self-control depletion theory claims to account for between-task performance changes in terms of the consumption of a limited cognitive resource. Dual-task designs have been used to demonstrate that increased self-control on an initial effortful task predicted a decreased use of self-control on a later categori- cally distinct effortful task, suggesting a limited resource model. These accounts struggle to identify specific mechanisms linking them to rational theories of effort, and the reported effect size has recently come into question. Subject engagement during the depleting task is often assumed, but systematic disengagement may account for inconsistencies in the observed effect. We recreated a common dual-task depletion paradigm us- ing a computer-automated design allowing for measurement of individual task performance as well as pupil size. We found evidence that task engagement measures do indeed account for some individual variation in the depletion effect, offering a possible explanation for inconsistent group-level effects.

RLDM Conference 2017 Conference Abstract

Mechanisms of Overharvesting in Patch Foraging

Gary Kane
Aaron Bornstein
Amitai Shenhav
Robert Wilson
Nathaniel Daw
Jonathan Cohen

Serial stay-or-search decisions are ubiquitous across many domains, including decisions regard- ing employment, relationships, and foraging for resources or information. Studies of animal foraging, in which animals decide to harvest depleting rewards contained within a patch or to leave the patch in search of a new, full one, have revealed a consistent bias towards overharvesting, or staying in patches longer than is predicted by optimal foraging theory (the Marginal Value Theorem; MVT). Yet, the cognitive biases that lead to overharvesting are poorly understood. We attempt to determine the cognitive biases that underlie overharvesting in rats. We characterized rat foraging behavior in response to two basic manipulations in patch foraging tasks: travel time between reward sources and depletion rate of the source; and to two novel manipulations to the foraging environment: proportional changes to the size of rewards and length of delays, and placement of delays (pre- vs. post-reward). In response to the basic manipulations, rats qualitatively followed predictions of MVT, but stayed in patches longer than is predicted. In the latter two manipulations, rats deviated from predictions of MVT, exhibiting changes in behavior not predicted by MVT. We formally tested whether four separate cognitive biases — subjective costs, decreasing marginal utility for reward, discounting of future reward, and ignoring post-reward delays — could explain overharvesting in the former two manipulations and deviations from MVT in the latter two. All the biases tested explained overharvest- ing behavior in the former contexts, but only one bias — in which rats ignore post-reward delays — also explained deviations from MVT in the latter contexts. Our results reveal that multiple cognitive biases may contribute to overharvesting, but inaccurate estimation of post-reward delays provided the best explanation across all contexts.

RLDM Conference 2017 Conference Abstract

Spontaneous Blink Rate Correlates With Financial Risk Taking

Emily Sherman
Chrysta Andrade
Catie Sikora
Emily Long
Robert Wilson

Dopamine has long been thought to play a role in risky decision-making, with higher tonic levels of dopamine associated with more risk seeking behavior. In this work, we aimed to shed more light on this relationship using spontaneous blink rate as an indirect measure of dopamine. In particular we used video recording to measure blink rate and a decision-making survey to measure risk taking in 45 participants. Consistent with previous work linking dopamine to risky decisions, we found a strong positive correlation between blink rate and the number of risky choices a participant made. This correlation was not dependent on age or gender and was identical for both gain and loss framing. This work suggests that dopamine plays a crucial and quite general role in determining risk attitude across the population and validates this simple method of probing dopamine for decision-making research.

RLDM Conference 2017 Conference Abstract

What is the nature of decision noise in random exploration?

Siyu Wang
Robert Wilson

The explore-exploit tradeoff is a fundamental behavioral dilemma faced by all adaptive organ- isms. Should we explore new options in the hopes of finding a better meal, a better house or a better mate, or should we exploit the options we currently believe to be best? Striking the right balance between explo- ration and exploitation is hard computational problem and there is significant interest in how humans and other animals make explore-exploit decisions in practice. One particularly effective strategy for solving the explore-exploit dilemma is choice randomization [1]. In this strategy, the decision process is corrupted by noise meaning that high value “exploit” options are not always chosen and exploratory choices are some- times made by chance. In theory, such “random exploration”, can be surprisingly effective in explore-exploit problems and, if implemented correctly, can come close to optimal performance. Recent work suggests that humans actually use random exploration to solve simple explore-exploit problems [2]. Despite this progress a number of questions remain about the nature of random exploration as there are a number of ways in which seemingly stochastic choices could be generated. In one strategy, that we call the “external noise strategy”, participants could rely on stochasticity in the world and allow irrelevant features of the stimulus to drive choice. In another strategy called “internal noise strategy”, people could rely on stochastic processes within their own brains. In this work, we modified our recently published “Horizon Task” in such a way as to distinguish these two strategies. Using both a model-free and model-based analysis of human behavior we show that, while both types of noise are present in explore-exploit decisions, random exploration is domi- nated by internal noise. This suggests that random exploration depends on adaptive noise processes in the brain which are subject to (perhaps unconscious) cognitive control.

RLDM Conference 2017 Conference Abstract

Wide-eyed and wrong? Pupil dilation and imperfect evidenceaccumulation in auditory percep- tual decision

Todd Hagen
Robert Wilson

Integrating evidence over time is crucial for effective decision making. For simple perceptual decisions, a large body of research suggests that humans and animals are capable of perfect evidence inte- gration in some settings. Although there has been significant interest in the neural systems underlying the information integration process, the role of the norepinephrine (NE) system has been relatively neglected. Norepinephrine is an interesting candidate for investigation the information integration process because it may work to modulate the signal-to-noise ratio of perceptual information, and the accumulation process of such information. To investigate whether and how the temporal integration of evidence is modulated by the locus coeruleus-norepinephrine system, we measured pupil dilation (a putative correlate of NE tone) in humans making a series of decisions based on rapidly-presented auditory information in a modified ver- sion of the Poisson Clicks Task (PCT). Our results suggest that people weigh information equally on trials they ultimately answer correctly, and weigh early and late information relatively lower on trials they answer incorrectly. Preliminary individual difference results further suggest that high pupil diameter at the onset of the stimulus is associated with worse task performance–an association related to overall stimulus noise, rather than the information integration process. These results coincide with previous work showing that humans are capable of perfect information integration, while pointing to a potential role of the NE system in conditions of imperfect integration.

RLDM Conference 2015 Conference Abstract

A Drift Diffusion Model of Proactive and Reactive Control in a Context-Dependent Two- Alternative Forced Choice Task

Olga Lositsky
Robert Wilson
Michael Shvartsman
Jonathan Cohen

Most of our everyday decisions rely crucially on context: foraging for food in the fridge may be appropriate at home, but not at someone else’s house. Yet the mechanism by which context modulates how we respond to stimuli remains a topic of intense investigation. In order to isolate such decisions experi- mentally, investigators have employed simple context-based decision-making tasks like the AX-Continuous Performance Test (AX-CPT). In this task, the correct response to a probe stimulus depends on a cue stim- ulus that appeared several seconds earlier. It has been proposed (Braver, 2007) that humans might employ two strategies to perform this task: one in which rule information is proactively maintained in working memory, and another one in which rule information is retrieved reactively at the time of probe onset. While this framework has inspired considerable investigation, it has not yet been committed to a formal model. Such a model would be valuable for testing quantitative predictions about the influence of proactive and reactive strategies on choice and reaction time behavior. To this end, we have built a drift diffusion model of behavior on the AX-CPT, in which evidence accumulation about a stimulus is modulated by context. We implemented proactive and reactive strategies as two distinct models: in the proactive variant, perception of the probe is modulated by the remembered cue; in the reactive variant, retrieval of the cue from memory is modulated by the perceived probe. Fitting these models to data shows that, counter-intuitively, behavior taken as a signature of reactive control is better fit by the proactive variant of the model, while proactive pro- files of behavior are better fit by the reactive variant. We offer possible interpretations of this result, and use simulations to suggest experimental manipulations for which the two models make divergent predictions.

RLDM Conference 2015 Conference Abstract

Directed and random exploration in realistic environments

Paul Krueger
Alexandria Oliver
Jonathan Cohen
Robert Wilson

Many everyday decisions involve a tradeoff between exploiting well-known options and explor- ing lesser-known options in hopes of a better outcome. Our previous work has shown that humans use at least two strategies to address this dilemma: directed exploration, driven by information-seeking, and random exploration, driven by decision noise. However, in the interest of simplicity, our task had two arti- ficial constraints—explicit cues for previous outcomes and numeric rewards—that are often not present in real-world decisions. In the current study, we relaxed these constraints to test whether our previous finding hold true in more ecologically valid situations. Our first experiment removed cues for previous outcomes while still using numeric rewards, requiring participants to use working memory to track past outcomes. Experiment 2 went further and also presented rewards as patches of green dots instead of numbers, with more dense patches of green corresponding to higher reward. In all conditions, we replicated our previous findings thus showing that both directed and random exploration are robust across a variety of conditions. Poster T40*: Choice reflexes in the rodent habit system Aaron Gruber*, University of Lethbridge; Ali Mashhoori, University of Lethbridge; Rajat Thapa, University of Lethbridge We examined the neural mechanisms by which rats rapidly adjust choices following reward omission. Animals often employ a ‘lose-switch’ strategy in which they switch responses following reward omission. We surprisingly found that such responding was greatly reduced following lesions of the dor- solateral striatum (DLS), a brain region hypothesized to be involved in the gradual formation of habits. Moreover, we found that a modified Q-learning model better fit behavioural data from the DLS-lesioned animals than controls or animals with lesions of dorsomedial striatum (DMS), a region associated with ‘goal-directed’ responding. The model-based analysis revealed that animals with striatal lesions, particu- larly of the DLS, had blunted reward sensitivity and less stochasticity in the choice mechanism. Subsequent experiments showed that lose-switch responding was reduced by systemic administration of amphetamine, or by infusion of agonists for D2 type dopamine receptors in the DLS (but not into DMS). These data reveal that the DLS is able to drive rapid switches following reward omission (< 15 seconds) via inactivation of D2 receptors by periods of low dopamine (negative reward prediction error signal). We propose that this serves as a ‘choice reflex’ following errors that prevents animals from repeating mistakes while other behavioural control systems update expected action/state values.

RLDM Conference 2015 Conference Abstract

Humans tradeoff information seeking and randomness in explore-exploit decisions

Robert Wilson
Jonathan Cohen

The explore-exploit dilemma occurs when we must choose between exploring options that yield information (potentially useful for the future) and exploiting options that yield known reward (certain to be useful right now). We have previously shown that humans use two distinct strategies for resolving this dilemma: the optimal-but-complex ‘directed exploration‘ in which choices are biased towards information, and the suboptimal-but-simple ‘random exploration’ in which choice variability leads to exploration by chance. Here we ask how these two strategies interact. We find that humans exhibit a tradeoff between these two forms of exploration, with higher levels of directed exploration associated with lower random exploration and vice versa. This directed-random tradeoff is described remarkably well by a parameter-free optimal theory that accurately captures individual differences between participants, as well as adjustments by individuals in response to simple experimental manipulations. These results show that humans combine information seeking and randomness in a rational way to solve the explore-exploit dilemma.

RLDM Conference 2015 Conference Abstract

Strategies for exploration in the domain of losses

Paul Krueger
Robert Wilson
Jonathan Cohen

In everyday life, many decisions involve choosing between familiar options (exploiting) and unfamiliar options (exploring). On average, exploiting yields good results but tells you nothing new, while exploring yields information but at a cost of uncertain and often worse outcomes. In previous work we have shown that a key factor in these ‘explore-exploit’ choices is the number of future decisions that people will make, the ‘time horizon’. As this horizon gets longer, people are more likely to explore, because acquiring information is useful for making future choices. Moreover, we found that this exploration is effected with two distinct strategies: directed exploration, in which an ‘information bonus’ that grows with horizon explicitly biases subjects to explore, and random exploration, in which increasing ‘decision noise’ drives exploration by chance. However, this, as well as most other previous work on the explore-exploit dilemma, focused on decisions in the domain of gains where the goal was to maximize reward. In many real- world decisions, however, the primary objective is to minimize losses and it is well known that humans can behave differently in this domain. In this study, we compared explore-exploit behavior of human participants under conditions of gain and loss. We found that people use both directed and random exploration regardless of whether they are exploring in response to gains or losses and that there is quantitative agreement between the exploration parameters across domains. Our model also revealed an overall bias towards exploration in the domain of losses that did not change with horizon. This seems to reflect an overall bias towards the uncertain outcomes in the domain of losses. Taken together, our results show that explore-exploit decisions in the domain of losses are driven by three independent processes: a baseline bias toward the uncertain option, and directed and random exploration.

RLDM Conference 2013 Conference Abstract

Exploration strategies in human decision making

Robert Wilson
Andra Geana
John White
Elliot Ludvig
Jonathan Cohen

The tradeoff between pursuing a known reward (exploitation) and sampling unknown, potential- ly better opportunities (exploration) is a fundamental challenge faced by all adaptive organisms. Theories formalize the value of exploration (gathering information) as an information bonus. However, this may be difficult to compute; a simpler alternative is to increase decision noise, driving random exploration. Relative- ly few studies have characterized human exploratory behavior, and most have failed to find an information bonus, suggesting it relies entirely on random exploration. However, these previous studies have either con- founded reward and information or failed to account for baseline levels ambiguity aversion and decision noise. To overcome these limitations, we conducted a sequential choice task that independently manipulated reward, information, and number of choices. Contrary to previous work, we found that humans do show an information bonus when given the opportunity to explore. In addition we found adaptive changes in decision noise consistent with a type of random exploration that is subject to cognitive control.

RLDM Conference 2013 Conference Abstract

Is model fitting necessary for model-based fMRI?

Robert Wilson
Yael Niv

Model-based analysis of functional magnetic resonance imaging (fMRI) data is an important tool for investigating the computational role of different brain regions. With this method, theoretical models of behavior can be leveraged to find the brain structures underlying latent variables that are key to specific algorithms, such as prediction errors in temporal difference learning. A key step in this type of analysis is model fitting. Most commonly, a model is first fit to behavioral data to establish ‘good’ parameters. These are then used to generate model-based regressors of the quantity of interest, for regressing against brain activations acquired using fMRI. While such model fitting may intuitively seem like good practice, in this work we ask whether it is really necessary. We focus on the classic reinforcement learning regressors for value and prediction error and examine their sensitivity to perturbations of the learning rate parameter both in theory and in a previously published dataset. Surprisingly, in many cases, we find that fitting the learning rate is not necessary to generate good regressors and in some situations, even use of the worst possible parameter settings affects the model-based analysis only marginally. Our results suggest that precise model fitting is not necessary for model-based fMRI, thereby freeing experimental design from the constraint of allowing precise fits. They also highlight the limited use of fMRI data for arbitrating between different (correlated) models or model parameters.

RLDM Conference 2013 Conference Abstract

Reward, Risk and Ambiguity in Human Exploration: A Wheel of Fortune Task

Andra Geana
Robert Wilson
Jonathan Cohen

In realistic environments, organisms are frequently faced with multiple resource alternatives, and must balance the tradeoff between pursuing the known options (exploitation), and searching the environment for unknown opportunities (exploration). Exploration can be most beneficial in the presence of environmen- tal uncertainty - when the range and benefits of all reward options are not fully known, exploration can lead to the discovery of new, better resources and an ultimately higher overall reward. However, uncertainty can take many forms, and it is unclear how different types of uncertainty impact people’s exploratory behaviour. We used a ‘wheel of fortune’ task to separate two well-established sources of uncertainty: risk (when out- comes are stochastic, but the probabilities of outcomes are known) and ambiguity (when the probabilities and/or the outcomes are unknown), and examine how they impact exploration. The results suggest that the presence of ambiguity in the environment drives people to explore in order to acquire more information and reduce the ambiguity. Conversely, a higher risk level in the environment increases exploration by increasing decision noise and making people less sensitive to the reward values of the available options. We examined these effects under two different decision horizons, and found that ambiguity-, and not risk-related explo- ration increases with decision horizon. These findings imply that different sources of uncertainty impact exploration differently, and may shed light on the mechanisms behind two distinguishable types of explo- ration that have been previously identified: random (characterized by an increase in decision noise) and directed (information-seeking) exploration.

IROS Conference 2010 Conference Paper

Multi-DOF equalization of haptic devices for accurate rendering at high frequencies

Robert Wilson
Sonny Chan
J. Kenneth Salisbury
Günter Niemeyer

Previous work has shown that high frequency content is important for realistic haptic feedback, while stability considerations limit the ability of closed-loop control to effectively generate high frequencies. Open-loop playback of high frequencies offers a promising way to generate rich contact transients and textures, but complex high frequency dynamics cause distortion. This paper explores the equalization and dynamic decoupling of multi-DOF haptic devices for accurate open-loop playback. Toward this end, a user study is performed to determine the frequency limit of human force direction sensitivity at 35Hz. This information together with experimental system identification techniques is used to develop a strategy for equalization in different frequency bands. Finally, MIMO equalization is accomplished through online simulation of the system model under the control of an LQR tracking controller.

NeurIPS Conference 2010 Conference Paper

The Neural Costs of Optimal Control

Samuel Gershman
Robert Wilson

Optimal control entails combining probabilities and utilities. However, for most practical problems probability densities can be represented only approximately. Choosing an approximation requires balancing the benefits of an accurate approximation against the costs of computing it. We propose a variational framework for achieving this balance and apply it to the problem of how a population code should optimally represent a distribution under resource constraints. The essence of our analysis is the conjecture that population codes are organized to maximize a lower bound on the log expected utility. This theory can account for a plethora of experimental data, including the reward-modulation of sensory receptive fields.

NeurIPS Conference 2009 Conference Paper

A Neural Implementation of the Kalman Filter

Robert Wilson
Leif Finkel

There is a growing body of experimental evidence to suggest that the brain is capable of approximating optimal Bayesian inference in the face of noisy input stimuli. Despite this progress, the neural underpinnings of this computation are still poorly understood. In this paper we focus on the problem of Bayesian filtering of stochastic time series. In particular we introduce a novel neural network, derived from a line attractor architecture, whose dynamics map directly onto those of the Kalman Filter in the limit where the prediction error is small. When the prediction error is large we show that the network responds robustly to change-points in a way that is qualitatively compatible with the optimal Bayesian model. The model suggests ways in which probability distributions are encoded in the brain and makes a number of testable experimental predictions.

ICRA Conference 2009 Conference Paper

Motion control of impedance-type haptic devices

Robert Wilson
Günter Niemeyer

Impedance type devices are frequently used in haptics due to their excellent rendition of free-space, low cost, and convenience. The spring drive, a recently proposed alternative to the traditional current motor driver for such devices, has been shown to improve their stiffness in rigid contact by moving the haptic coupling from the digital domain to analog circuitry. Unlike the current drive, which operates the motor as a force source, the spring drive converts the motor to a motion source. In this paper we construct a motion-based virtual environment to fully leverage the benefits of operating an impedance device with motion control. This quasi-static environment is connected to a mid-level controller which interfaces with the analog motor drive. A full system analysis and experimental demonstration verify the anticipated performance.