Author name cluster

Jakub Tarnawski

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Improved Algorithms for Fair Matroid Submodular Maximization

Sepideh Mahabadi
Sherry Sarkar
Jakub Tarnawski

Submodular maximization subject to matroid constraints is a central problem with many applications in machine learning. As algorithms are increasingly used in decision-making over datapoints with sensitive attributes such as gender or race, it is becoming crucial to enforce fairness to avoid bias and discrimination. Recent work has addressed the challenge of developing efficient approximation algorithms for fair matroid submodular maximization. However, the best algorithms known so far are only guaranteed to satisfy a relaxed version of the fairness constraints that loses a factor 2, i. e. , the problem may ask for $\ell$ elements with a given attribute, but the algorithm is only guaranteed to find $\lfloor \ell/2 \rfloor$. In particular, there is no provable guarantee when $\ell=1$, which corresponds to a key special case of perfect matching constraints. In this work, we achieve a new trade-off via an algorithm that gets arbitrarily close to full fairness. Namely, for any constant $\varepsilon>0$, we give a constant-factor approximation to fair monotone matroid submodular maximization that in expectation loses only a factor $(1-\varepsilon)$ in the lower-bound fairness constraint. Our empirical evaluation on a standard suite of real-world datasets -- including clustering, recommendation, and coverage tasks -- demonstrates the practical effectiveness of our methods.

ICML Conference 2024 Conference Paper

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Foteini Strati
Sara McAllister
Amar Phanishayee
Jakub Tarnawski
Ana Klimovic

Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. DéjàVu addresses all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.

ICLR Conference 2024 Conference Paper

Efficiently Computing Similarities to Private Datasets

Arturs Backurs
Zinan Lin 0001
Sepideh Mahabadi
Sandeep Silwal
Jakub Tarnawski

Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data-structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.

ICML Conference 2024 Conference Paper

Integrated Hardware Architecture and Device Placement Search

Irene Wang
Jakub Tarnawski
Amar Phanishayee
Divya Mahajan 0001

Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https: //github. com/msr-fiddle/phaze.

ICML Conference 2023 Conference Paper

Fairness in Streaming Submodular Maximization over a Matroid Constraint

Marwa El Halabi
Federico Fusco 0001
Ashkan Norouzi-Fard
Jakab Tardos
Jakub Tarnawski

Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in developing fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks.

SODA Conference 2022 Conference Paper

On the Hardness of Scheduling With Non-Uniform Communication Delays

Sami Davies
Janardhan Kulkarni
Thomas Rothvoss
Sai Sandeep
Jakub Tarnawski
Yihao Zhang

In the problem of scheduling with non-uniform communication delays, the input is a set of jobs with precedence constraints. Associated with every precedence constraint between a pair of jobs is a communication delay, the time duration the scheduler has to wait between the two jobs if they are scheduled on different machines. The objective is to assign the jobs to machines to minimize the makespan of the schedule. Despite being a fundamental problem in theory and a consequential problem in practice, the approximability of scheduling problems with communication delays is not very well understood. One of the top ten open problems in scheduling theory, in the influential list by Schuurman and Woeginger and its latest update by Bansal, asks if the problem admits a constant-factor approximation algorithm. In this paper, we answer this question in the negative by proving a logarithmic hardness for the problem under the standard complexity theory assumption that NP-complete problems do not admit quasi-polynomial-time algorithms. Our hardness result is obtained using a surprisingly simple reduction from a problem that we call Unique Machine Precedence constraints Scheduling (UMPS). We believe that this problem is of central importance in understanding the hardness of many scheduling problems and we conjecture that it is very hard to approximate. Among other things, our conjecture implies a logarithmic hardness of related machine scheduling with precedences, a long-standing open problem in scheduling theory and approximation algorithms.

STOC Conference 2022 Conference Paper

Online edge coloring via tree recurrences and correlation decay

Janardhan Kulkarni
Yang P. Liu
Ashwin Sah
Mehtaab Sawhney
Jakub Tarnawski

ICML Conference 2021 Conference Paper

Correlation Clustering in Constant Many Parallel Rounds

Vincent Cohen-Addad
Silvio Lattanzi
Slobodan Mitrovic
Ashkan Norouzi-Fard
Nikos Parotsidis
Jakub Tarnawski

Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability evaluation of our techniques.

SODA Conference 2021 Conference Paper

Scheduling with Communication Delays via LP Hierarchies and Clustering II: Weighted Completion Times on Related Machines

Sami Davies
Janardhan Kulkarni
Thomas Rothvoss
Jakub Tarnawski
Yihao Zhang

We consider the problem of scheduling jobs with precedence constraints on related machines to minimize the weighted sum of completion times, in the presence of communication delays. In this setting, denoted by Q | prec, c | Σ w j C j, if two dependent jobs are scheduled on different machines, then at least c units of communication delay time must pass between their executions. Our main result is an O (log 4 n )-approximation algorithm for the problem. As a byproduct of our result, we also obtain an O (log 3 n )-approximation algorithm for the problem of minimizing makespan Q | prec, c | C max, which improves upon the O (log 5 n/ log log n )-approximation algorithm due to a recent work of Maiti et al. [MRS + 20].

SODA Conference 2020 Conference Paper

Hierarchy-Based Algorithms for Minimizing Makespan under Precedence and Communication Constraints

Janardhan Kulkarni
Shi Li 0001
Jakub Tarnawski
Minwei Ye

We consider the classic problem of scheduling jobs with precedence constraints on a set of identical machines to minimize the makespan objective function. Understanding the exact approximability of the problem when the number of machines is a constant is a well-known question in scheduling theory. Indeed, an outstanding open problem from the classic book of Garey and Johnson [9] asks whether this problem is NP-hard even in the case of 3 machines and unit-length jobs. In a recent breakthrough, Levey and Rothvoss [24] gave a (1 + ϵ )-approximation algorithm, which runs in nearly quasi-polynomial time, for the case when job have unit lengths. However, a substantially more difficult case where jobs have arbitrary processing lengths has remained open. We make progress on this more general problem. We show that there exists a (1 + ϵ )-approximation algorithm (with similar running time as that of [24]) for the nonmigratory setting: when every job has to be scheduled entirely on a single machine, but within a machine the job need not be scheduled during consecutive time steps. Further, we also show that our algorithmic framework generalizes to another classic scenario where, along with the precedence constraints, the jobs also have communication delay constraints. Both of these fundamental problems are highly relevant to the practice of datacenter scheduling.

FOCS Conference 2020 Conference Paper

Scheduling with Communication Delays via LP Hierarchies and Clustering

Sami Davies
Janardhan Kulkarni
Thomas Rothvoss
Jakub Tarnawski
Yihao Zhang

We consider the classic problem of scheduling jobs with precedence constraints on identical machines to minimize makespan, in the presence of communication delays. In this setting, denoted by P | prec, c | Cmax, if two dependent jobs are scheduled on different machines, then at least c units of time must pass between their executions. Despite its relevance to many applications, this model remains one of the most poorly understood in scheduling theory. Even for a special case where an unlimited number of machines is available, the best known approximation ratio is 2/3·(c+1), whereas Graham's greedy list scheduling algorithm already gives a ( c+1) -approximation in that setting. An outstanding open problem in the top-10 list by Schuurman and Woeginger and its recent update by Bansal asks whether there exists a constant-factor approximation algorithm. In this work we give a polynomial-time O(logc·logm)-approximation algorithm for this problem, where m is the number of machines and c is the communication delay. Our approach is based on a Sherali-Adams lift of a linear programming relaxation and a randomized clustering of the semimetric space induced by this lift. The full version of this paper is available on arXiv.

STOC Conference 2018 Conference Paper

A constant-factor approximation algorithm for the asymmetric traveling salesman problem

Ola Svensson
Jakub Tarnawski
László A. Végh

ICML Conference 2018 Conference Paper

Beyond 1/2-Approximation for Submodular Maximization on Massive Data Streams

Ashkan Norouzi-Fard
Jakub Tarnawski
Slobodan Mitrovic
Amir Zandieh
Aida Mousavifar
Ola Svensson

Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc. , require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive over time at a fast pace and thus we need to design an efficient, low-memory algorithm. One such method, proposed by Badanidiyuru et al. (2014), always finds a 0. 5-approximate solution. Can this approximation factor be improved? We answer this question affirmatively by designing a new algorithm Salsa for streaming submodular maximization. It is the first low-memory, singlepass algorithm that improves the factor 0. 5, under the natural assumption that elements arrive in a random order. We also show that this assumption is necessary, i. e. , that there is no such algorithm with better than 0. 5-approximation when elements arrive in arbitrary order. Our experiments demonstrate that Salsa significantly outperforms the state of the art in applications related to exemplar-based clustering, social graph analysis, and recommender systems.

NeurIPS Conference 2017 Conference Paper

Streaming Robust Submodular Maximization: A Partitioned Thresholding Approach

Slobodan Mitrovic
Ilija Bogunovic
Ashkan Norouzi-Fard
Jakub Tarnawski
Volkan Cevher

We study the classical problem of maximizing a monotone submodular function subject to a cardinality constraint k, with two additional twists: (i) elements arrive in a streaming fashion, and (ii) m items from the algorithm’s memory are removed after the stream is finished. We develop a robust submodular algorithm STAR-T. It is based on a novel partitioning structure and an exponentially decreasing thresholding rule. STAR-T makes one pass over the data and retains a short but robust summary. We show that after the removal of any m elements from the obtained summary, a simple greedy algorithm STAR-T-GREEDY that runs on the remaining elements achieves a constant-factor approximation guarantee. In two different data summarization tasks, we demonstrate that it matches or outperforms existing greedy and streaming methods, even if they are allowed the benefit of knowing the removed subset in advance.

FOCS Conference 2017 Conference Paper

The Matching Problem in General Graphs Is in Quasi-NC

Ola Svensson
Jakub Tarnawski

We show that the perfect matching problem in general graphs is in Quasi-NC. That is, we give a deterministic parallel algorithm which runs in O(log 3 n) time on n O(log2 n) processors. The result is obtained by a derandomization of the Isolation Lemma for perfect matchings, which was introduced in the classic paper by Mulmuley, Vazirani and Vazirani [1987] to obtain a Randomized NC algorithm. Our proof extends the framework of Fenner, Gurjar and Thierauf [2016], who proved the analogous result in the special case of bipartite graphs. Compared to that setting, several new ingredients are needed due to the significantly more complex structure of perfect matchings in general graphs. In particular, our proof heavily relies on the laminar structure of the faces of the perfect matching polytope.

SODA Conference 2017 Conference Paper

Unrelated Machine Scheduling of Jobs with Uniform Smith Ratios

Christos Kalaitzis
Ola Svensson
Jakub Tarnawski

We consider the classic problem of scheduling jobs on unrelated machines so as to minimize the weighted sum of completion times. Recently, for a small constant ∊ > 0, Bansal et al. gave a (3/2 – ∊)-approximation algorithm improving upon the “natural” barrier of 3/2 which follows from independent randomized rounding. In simplified terms, their result is obtained by an enhancement of independent randomized rounding via strong negative correlation properties. In this work, we take a different approach and propose to use the same elegant rounding scheme for the weighted completion time objective as devised by Shmoys and Tardos for optimizing a linear function subject to makespan constraints. Our main result is a 1. 21-approximation algorithm for the natural special case where the weight of a job is proportional to its processing time (specifically, all jobs have the same Smith ratio), which expresses the notion that each unit of work has the same weight. In addition, as a direct consequence of the rounding, our algorithm also achieves a bi-criteria 2-approximation for the makespan objective. Our technical contribution is a tight analysis of the expected cost of the solution compared to the one given by the Configuration-LP relaxation - we reduce this task to that of understanding certain worst-case instances which are simple to analyze.

SODA Conference 2015 Conference Paper

Fast Generation of Random Spanning Trees and the Effective Resistance Metric

Aleksander Madry
Damian Straszak
Jakub Tarnawski