Marcus K. Benna Papers

NeurIPS Conference 2025 Conference Paper

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Ji-An Li
Huadong Xiong
Robert Wilson
Marcelo G Mattar
Marcus K. Benna

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition --- the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e. g. , safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired \emph{neurofeedback} paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to \textit{report} and \textit{control} their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a ``metacognitive space'' with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e. g. , adversarial attack and defense).

PDF Details

NeurIPS Conference 2024 Conference Paper

Linking In-context Learning in Transformers to Human Episodic Memory

Li Ji-An
Corey Y. Zhou
Marcus K. Benna
Marcelo G. Mattar

Understanding connections between artificial and biological intelligent systems can reveal fundamental principles of general intelligence. While many artificial intelligence models have a neuroscience counterpart, such connections are largely missing in Transformer models and the self-attention mechanism. Here, we examine the relationship between interacting attention heads and human episodic memory. We focus on induction heads, which contribute to in-context learning in Transformer-based large language models (LLMs). We demonstrate that induction heads are behaviorally, functionally, and mechanistically similar to the contextual maintenance and retrieval (CMR) model of human episodic memory. Our analyses of LLMs pre-trained on extensive text data show that CMR-like heads often emerge in the intermediate and late layers, qualitatively mirroring human memory biases. The ablation of CMR-like heads suggests their causal role in in-context learning. Our findings uncover a parallel between the computational mechanisms of LLMs and human memory, offering valuable insights into both research fields.

PDF Details DOI

Possible papers

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Linking In-context Learning in Transformers to Human Episodic Memory