Author name cluster

Konrad Staniszewski

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki
Konrad Staniszewski
Piotr Nawrot
Edoardo Maria Ponti

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key–value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8× compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference latency and memory load. For instance, we enhance Qwen-R1 32B by 9. 1 points on AIME 24, 7. 6 on GPQA, and 9. 6 on LiveCodeBench on average for an equivalent number of memory reads.

PDF Details

AAAI Conference 2025 Conference Paper

Structured Packing in LLM Training Improves Long Context Utilization

Konrad Staniszewski
Szymon Tworkowski
Sebastian Jaszczur
Yu Zhao
Henryk Michalewski
Łukasz Kuciński
Piotr Miłoś

Recent advancements in long-context language modeling have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. To efficiently address this issue, we introduce the Structured Packing for Long Context, SPLiCe, a method that uses retrieval to collate mutually relevant documents into long training samples. We demonstrate that SPLiCe improves performance on long-context tasks, particularly by achieving perfect accuracy on the synthetic Needle in the Haystack benchmark, and effectively mitigating the ‘lost-in-the-middle’ phenomenon often observed in large language models. Notably, these long-context capabilities also extend to realistic downstream tasks, such as Qasper, across multiple model sizes—3B, 7B, and 13B—and are achieved with only brief fine-tuning on 2-6 billion tokens. We supplement these results with a detailed analysis of SPLiCe, examining the impact of hyperparameter choices, the different mixtures and proportions of SPLiCe-generated training data, and the choice of the retriever. We also study the transfer of long-context utilization skills between the modalities. An intriguing finding from our analysis is that training on a corpus of code can enhance performance on natural language tasks.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Focused Transformer: Contrastive Training for Context Scaling

Szymon Tworkowski
Konrad Staniszewski
Mikołaj Pacek
Yuhuai Wu
Henryk Michalewski
Piotr Miłoś

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an additional context, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3 B$ and $7 B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.

PDF Details

CSL Conference 2023 Conference Paper

Parity Games of Bounded Tree-Depth

Konrad Staniszewski

The exact complexity of solving parity games is a major open problem. Several authors have searched for efficient algorithms over specific classes of graphs. In particular, Obdržálek showed that for graphs of bounded tree-width or clique-width, the problem is in P, which was later improved by Ganardi, who showed that it is even in LOGCFL (with an additional assumption for clique-width case). Here we extend this line of research by showing that for graphs of bounded tree-depth the problem of solving parity games is in logspace uniform AC⁰. We achieve this by first considering a parameter that we obtain from a modification of clique-width, which we call shallow clique-width. We subsequently provide a suitable reduction.

Details