Author name cluster

Henryk Michalewski

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

AAAI Conference 2025 Conference Paper

Structured Packing in LLM Training Improves Long Context Utilization

Konrad Staniszewski
Szymon Tworkowski
Sebastian Jaszczur
Yu Zhao
Henryk Michalewski
Łukasz Kuciński
Piotr Miłoś

Recent advancements in long-context language modeling have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. To efficiently address this issue, we introduce the Structured Packing for Long Context, SPLiCe, a method that uses retrieval to collate mutually relevant documents into long training samples. We demonstrate that SPLiCe improves performance on long-context tasks, particularly by achieving perfect accuracy on the synthetic Needle in the Haystack benchmark, and effectively mitigating the ‘lost-in-the-middle’ phenomenon often observed in large language models. Notably, these long-context capabilities also extend to realistic downstream tasks, such as Qasper, across multiple model sizes—3B, 7B, and 13B—and are achieved with only brief fine-tuning on 2-6 billion tokens. We supplement these results with a detailed analysis of SPLiCe, examining the impact of hyperparameter choices, the different mixtures and proportions of SPLiCe-generated training data, and the choice of the retriever. We also study the transfer of long-context utilization skills between the modalities. An intriguing finding from our analysis is that training on a corpus of code can enhance performance on natural language tasks.

PDF Details DOI

ICML Conference 2024 Conference Paper

Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution

Chrisantha Fernando
Dylan Banarse
Henryk Michalewski
Simon Osindero
Tim Rocktäschel

Popular prompt strategies like Chain-of-Thought Prompting can dramatically improve the reasoning abilities of Large Language Models (LLMs) in various domains. However, such hand-crafted prompt-strategies are often sub-optimal. In this paper, we present Promptbreeder, a general-purpose self-referential self-improvement mechanism that evolves and adapts prompts for a given domain. Driven by an LLM, Promptbreeder mutates a population of task-prompts, evaluates them for fitness on a training set, and repeats this process over multiple generations to evolve task-prompts. Crucially, the mutation of these task-prompts is governed by mutation-prompts that the LLM generates and improves throughout evolution in a self-referential way. That is, Promptbreeder is not just improving task-prompts, but it is also improving the mutation-prompts that improve these task-prompts. Promptbreeder outperforms state-of-the-art prompt strategies such as Chain-of-Thought and Plan-and-Solve Prompting on commonly used arithmetic and commonsense reasoning benchmarks. Furthermore, Promptbreeder is able to evolve intricate task-prompts for the challenging problem of hate speech classification.

Details

NeurIPS Conference 2023 Conference Paper

Focused Transformer: Contrastive Training for Context Scaling

Szymon Tworkowski
Konrad Staniszewski
Mikołaj Pacek
Yuhuai Wu
Henryk Michalewski
Piotr Miłoś

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an additional context, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3 B$ and $7 B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.

PDF Details

JMLR Journal 2023 Journal Article

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
Adam Roberts
Paul Barham
Hyung Won Chung

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

PDF Details

ICLR Conference 2022 Conference Paper

Measuring CLEVRness: Black-box Testing of Visual Reasoning Models

Spyridon Mouselinos
Henryk Michalewski
Mateusz Malinowski

How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. These models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise could perform at a ``human-level'', can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.

Details

NeurIPS Conference 2022 Conference Paper

Multi-Game Decision Transformers

Kuang-Huei Lee
Ofir Nachum
Mengjiao (Sherry) Yang
Lisa Lee
Daniel Freeman
Sergio Guadarrama
Ian Fischer
Winnie Xu

A longstanding goal of the field of AI is a method for learning a highly capable, generalist agent from diverse experience. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model – with a single set of weights – trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.

PDF Details

AAMAS Conference 2022 Conference Paper

Off-Policy Correction For Multi-Agent Reinforcement Learning

Michał Zawalski
Błażej Osiński
Henryk Michalewski
Piotr Miłoś

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with negligible impact on the quality of training. Furthermore, our algorithm is theoretically grounded – we provide a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the Star- Craft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.

PDF

NeurIPS Conference 2022 Conference Paper

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz
Anders Andreassen
David Dohan
Ethan Dyer
Henryk Michalewski
Vinay Ramasesh
Ambrose Slone
Cem Anil

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering questions at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves strong performance in a variety of evaluations, including state-of-the-art performance on the MATH dataset. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a quarter of them.

PDF Details

NeurIPS Conference 2021 Conference Paper

Sparse is Enough in Scaling Transformers

Sebastian Jaszczur
Aakanksha Chowdhery
Afroz Mohiuddin
Lukasz Kaiser
Wojciech Gajewski
Henryk Michalewski
Jonni Kanerva

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.

PDF Details

ICLR Conference 2020 Conference Paper

Model Based Reinforcement Learning for Atari

Lukasz Kaiser
Mohammad Babaeizadeh
Piotr Milos
Blazej Osinski
Roy H. Campbell
Konrad Czechowski
Dumitru Erhan
Chelsea Finn

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.

Details

ICRA Conference 2020 Conference Paper

Simulation-Based Reinforcement Learning for Real-World Autonomous Driving

Blazej Osinski
Adam Jakubowski
Pawel Ziecina
Piotr Milos
Christopher Galias
Silviu Homoceanu
Henryk Michalewski

We use reinforcement learning in simulation to obtain a driving system controlling a full-size real-world vehicle. The driving policy takes RGB images from a single camera and their semantic segmentation as input. We use mostly synthetic data, with labelled real-world data appearing only in the training of the segmentation network. Using reinforcement learning in simulation and synthetic data is motivated by lowering costs and engineering effort. In real-world experiments we confirm that we achieved successful sim-to-real policy transfer. Based on the extensive evaluation, we analyze how design decisions about perception, control, and training impact the real-world performance.

Details

NeurIPS Conference 2018 Conference Paper

Reinforcement Learning of Theorem Proving

Cezary Kaliszyk
Josef Urban
Henryk Michalewski
Miroslav Olšák

We introduce a theorem proving algorithm that uses practically no domain heuristics for guiding its connection-style proof search. Instead, it runs many Monte-Carlo simulations guided by reinforcement learning from previous proof attempts. We produce several versions of the prover, parameterized by different learning and guiding algorithms. The strongest version of the system is trained on a large corpus of mathematical problems and evaluated on previously unseen problems. The trained system solves within the same number of inferences over 40% more problems than a baseline prover, which is an unusually high improvement in this hard AI domain. To our knowledge this is the first time reinforcement learning has been convincingly applied to solving general mathematical problems on a large scale.

PDF Details

MFCS Conference 2017 Conference Paper

A Characterisation of Pi^0_2 Regular Tree Languages

Filippo Cavallari
Henryk Michalewski
Michal Skrzypczak

We show an algorithm that for a given regular tree language L decides if L is in Pi^0_2, that is if L belongs to the second level of Borel Hierarchy. Moreover, if L is in Pi^0_2, then we construct a weak alternating automaton of index (0, 2) which recognises L. We also prove that for a given language L, L is recognisable by a weak alternating (1, 3)-automaton if and only if it is recognisable by a weak non-deterministic (1, 3)-automaton.

Details

ECAI Conference 2016 Conference Paper

Non-Utilitarian Coalition Structure Generation

Oskar Skibski
Henryk Michalewski
Andrzej Nagórko
Tomasz Pawel Michalak
Andrew James Dowell
Talal Rahwan
Michael J. Wooldridge

The coalition structure generation problem is one of the key challenges in multi-agent coalition formation. It involves partitioning a set of agents into coalitions so that system performance is optimized. To date, the multi-agent systems literature has focused exclusively on the utilitarian version of this problem which seeks to maximize the sum of the values of the coalitions involved. However, there are many examples of situations in which other performance metrics are of interest. In particular, in games with non-transferable utility, we may be more interested in an egalitarian optimal coalition structure, or in minimizing the difference between the utilities of the most affluent and poorest agents. In this paper, we present a number of exact algorithms to solve such non-utilitarian formulations of the coalition structure generation problem.

Details

CSL Conference 2016 Conference Paper

The Logical Strength of Büchi's Decidability Theorem

Leszek Aleksander Kolodziejczyk
Henryk Michalewski
Cécilia Pradic
Michal Skrzypczak

We study the strength of axioms needed to prove various results related to automata on infinite words and Büchi's theorem on the decidability of the MSO theory of (N, less_or_equal). We prove that the following are equivalent over the weak second-order arithmetic theory RCA: 1. Büchi's complementation theorem for nondeterministic automata on infinite words, 2. the decidability of the depth-n fragment of the MSO theory of (N, less_or_equal), for each n greater than 5, 3. the induction scheme for Sigma^0_2 formulae of arithmetic. Moreover, each of (1)-(3) is equivalent to the additive version of Ramsey's Theorem for pairs, often used in proofs of (1); each of (1)-(3) implies McNaughton's determinisation theorem for automata on infinite words; and each of (1)-(3) implies the "bounded-width" version of König's Lemma, often used in proofs of McNaughton's theorem.

Details

Highlights Conference 2015 Conference Abstract

How unprovable is Rabin's theorem?

Henryk Michalewski

We study the strength of set-theoretic axioms needed to prove Rabin's theorem on the decidability of the MSO theory of the infinite binary tree. We first show that the complementation theorem for tree automata, which forms the technical core of typical proofs of Rabin's theorem, is equivalent over the modestly strong second-order arithmetic theory ACA_0 to the statement Det asserting the determinacy of all Gale-Stewart games with winning conditions given by boolean combinations of F_\sigma sets. From work of Medsalem and Tanaka it follows that the statement Det is provable from \Pi^1_3- but not \Delta^1_3-comprehension. We then prove that the decidability of MSO on the infinite binary tree is itself equivalent to the statement Det over \Pi^1_2-comprehension. Hence, \Delta^1_3-comprehension is not enough to prove Rabin's theorem. Moreover, relying on the work of Moellerfeld, we show that Rabin's theorem is equivalent over \Pi^1_2-comprehension to a purely logical reflection principle: "every \Pi^1_3 sentence provable from \Pi^1_2 comprehension is true". 12: 02 13: 30 Lunch

Details

Highlights Conference 2013 Conference Abstract

Complexity collapse for unambiguous languages

Henryk Michalewski
Michał Skrzypczak

A nondeterministic automaton is unambiguous if it has at most one accepting run on every input. The class of languages recognisable by unambiguous tree automata is still not well-understood. In this paper we show the following complexity collapse: if A is an unambiguous Buchi tree automaton then L(A) is recognisable by a weak alternating automaton. We present also some extensions of this result. To our best knowledge this is the first parity index collapse that exploits unambiguity of an automaton.

PDF Details

Highlights Conference 2013 Conference Abstract

Regular languages of infinite trees of low Borel complexity

Henryk Michalewski

M. Bojanczyk and T. Place proved that for a regular language of infinite trees it is decidable if it is a boolean combination of open sets. The class of Delta^0_2 sets consists of sets which are simultaneously F_sigma and G_delta. Every boolean combination of open sets belongs to the class Delta^0_2. I will present a small modification of the result of M. Bojanczyk and T. Place. Namely, provide an algorithm to decide whether a given regular language of infinite trees belongs to the class \Delta^0_2.

PDF Details