Arrow Research search

Author name cluster

Sudipto Guha

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
1 author row

Possible papers

35

ICML Conference 2018 Conference Paper

Semi-Supervised Learning on Data Streams via Temporal Label Propagation

  • Tal Wagner
  • Sudipto Guha
  • Shiva Prasad Kasiviswanathan
  • Nina Mishra

We consider the problem of labeling points on a fast-moving data stream when only a small number of labeled examples are available. In our setting, incoming points must be processed efficiently and the stream is too large to store in its entirety. We present a semi-supervised learning algorithm for this task. The algorithm maintains a small synopsis of the stream which can be quickly updated as new points arrive, and labels every incoming point by provably learning from the full history of the stream. Experiments on real datasets validate that the algorithm can quickly and accurately classify points on a stream with a small quantity of labeled examples.

ICML Conference 2016 Conference Paper

Robust Random Cut Forest Based Anomaly Detection on Streams

  • Sudipto Guha
  • Nina Mishra
  • Gourav Roy
  • Okke Schrijvers

In this paper we focus on the anomaly detection problem for dynamic data streams through the lens of random cut forests. We investigate a robust random cut data structure that can be used as a sketch or synopsis of the input stream. We provide a plausible definition of non-parametric anomalies based on the influence of an unseen point on the remainder of the data, i. e. , the externality imposed by that point. We show how the sketch can be efficiently updated in a dynamic data stream. We demonstrate the viability of the algorithm on publicly available real data.

ICML Conference 2015 Conference Paper

Correlation Clustering in Data Streams

  • Kook Jin Ahn
  • Graham Cormode
  • Sudipto Guha
  • Andrew McGregor 0001
  • Anthony Wirth

In this paper, we address the problem of \emphcorrelation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n⋅\textpolylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. However the standard LP and SDP formulations are not obviously solvable in O(n⋅\textpolylog n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. Note that the improved space and running-time bounds achieved from streaming algorithms are also useful for offline settings such as MapReduce models.

SODA Conference 2014 Conference Paper

Near Linear Time Approximation Schemes for Uncapacitated and Capacitated b-Matching Problems in Nonbipartite Graphs

  • Kook Jin Ahn
  • Sudipto Guha

We present the first fully polynomial approximation schemes for the maximum weighted (uncapacitated or capacitated) b –Matching problem for nonbipartite graphs that run in time (near) linear in the number of edges, that is, given any δ > 0 the algorithm produces a (1 – δ ) approximation in O ( m poly( δ −1, log n )) time. We provide fractional solutions for the standard linear programming formulations for these problems and subsequently also provide fully polynomial (near) linear time approximation schemes for rounding the fractional solutions. Through these problems as a vehicle, we also present several ideas in the context of solving linear programs approximately using fast primal-dual algorithms. First, we show that approximation algorithms can be used to reduce the width of the formulation, and as a consequence we induce faster convergence. Second, even though the dual of these problems have exponentially many variables and an efficient exact computation of dual weights is infeasible, we can efficiently compute and use a sparse approximation of the dual weights using a combination of (i) adding perturbation to the constraints of the polytope and (ii) amplification followed by thresholding of the dual weights. These algorithms also have the advantage that they use O ( n poly( δ −1, log n )) storage space and only make O ( δ −4 log (1/ δ )log n ) (or better) passes over a read only list of edges. These algorithms therefore can be run in the semi-streaming model and serve as exemplars where algorithms and ideas developed for the streaming model gives us algorithms for combinatorial optimization problems that were not known in absence of the streaming constraints.

SODA Conference 2012 Conference Paper

Analyzing graph structure via linear measurements

  • Kook Jin Ahn
  • Sudipto Guha
  • Andrew McGregor 0001

We initiate the study of graph sketching, i. e. , algorithms that use a limited number of linear measurements of a graph to determine the properties of the graph. While a graph on n nodes is essentially O ( n 2 )-dimensional, we show the existence of a distribution over random projections into d -dimensional “sketch” space ( d ≪ n 2 ) such that the relevant properties of the original graph can be inferred from the sketch with high probability. Specifically, we show that: 1. d = O ( n · polylog n ) suffices to evaluate properties including connectivity, k -connectivity, bipartiteness, and to return any constant approximation of the weight of the minimum spanning tree. 2. d = O ( n 1+γ ) suffices to compute graph sparsifiers, the exact MST, and approximate the maximum weighted matchings if we permit O (1/γ)-round adaptive sketches, i. e. , a sequence of projections where each projection may be chosen dependent on the outcome of earlier sketches.

FOCS Conference 2007 Conference Paper

Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

  • Sudipto Guha
  • Kamesh Munagala

We consider a variant of the classic multi-armed bandit problem (MAB), which we call feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a partially observable Markov decision process (POMDP), and a special case of the notoriously intractable "restless bandit" problem. Unlike the stochastic MAB problem, the feedback MAB problem does not admit to greedy index-based optimal policies. Vie state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms. We design a constant factor approximation to the feedback MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.

FOCS Conference 2004 Conference Paper

Machine Minimization for Scheduling Jobs with Interval Constraints

  • Julia Chuzhoy
  • Sudipto Guha
  • Sanjeev Khanna
  • Joseph Naor

The problem of scheduling jobs with interval constraints is a well-studied classical scheduling problem. The input to the problem is a collection of n jobs where each job has a set of intervals on which it can be scheduled. The goal is to minimize the total number of machines needed to schedule all jobs subject to these interval constraints. In the continuous version, the allowed intervals associated with a job form a continuous time segment, described by a release date and a deadline. In the discrete version of the problem, the set of allowed intervals for a job is given explicitly. So far, only an O(log n/( log log n))-approximation is known for either version of the problem, obtained by a randomized rounding of a natural linear programming relaxation of the problem. In fact, we show here that this analysis is tight for both versions of the problem by providing a matching lower bound on the integrality gap of the linear program. Moreover, even when all jobs can be scheduled on a single machine, the discrete case has recently been shown to be /spl Omega/(log log n)-hard to approximate. In this paper, we provide improved approximation factors for the number of machines needed to schedule all jobs in the continuous version of the problem. Our main result is an O(1)-approximation algorithm when the optimal number of machines needed is bounded by a fixed constant. Thus, our results separate the approximability of the continuous and the discrete cases of the problem. For general instances, we strengthen the natural linear programming relaxation in a recursive manner by forbidding certain configurations which cannot arise in an integral feasible solution. This yields an O(OPT)-approximation, where OPT denotes the number of machines needed by an optimal solution. Combined with earlier results, our work implies an O(/spl radic/log n/(log log n))-approximation for any value of OPT.

FOCS Conference 2000 Conference Paper

Clustering Data Streams

  • Sudipto Guha
  • Nina Mishra
  • Rajeev Motwani 0001
  • Liadan O'Callaghan

We study clustering under the data stream model of computation where: given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. The data stream model is relevant to new classes of applications involving massive data sets, such as Web click stream analysis and multimedia data analysis. We give constant-factor approximation algorithms for the k-median problem in the data stream model of computation in a single pass. We also show negative results implying that our algorithms cannot be improved in a certain sense.

FOCS Conference 2000 Conference Paper

Hierarchical Placement and Network Design Problems

  • Sudipto Guha
  • Adam Meyerson
  • Kamesh Munagala

Gives constant approximations for a number of layered network design problems. We begin by modeling hierarchical caching, where the caches are placed in layers and each layer satisfies a fixed percentage of the demand (bounded miss rates). We present a constant approximation to the minimum total cost of placing the caches and to the routing demand through the layers. We extend this model to cover more general layered caching scenarios, giving a constant combinatorial approximation to the well-studied multi-level facility location problem. We consider a facility location variant, the load-balanced facility location problem, in which every demand is served by a unique facility and each open facility must serve at least a certain amount of demand. By combining load-balanced facility location with our results on hierarchical caching, we give a constant approximation for the access network design problem.

FOCS Conference 2000 Conference Paper

Nested Graph Dissection and Approximation Algorithms

  • Sudipto Guha

This paper considers approximation algorithms for graph completion problems using the nested dissection paradigm. Given a super-additive function of interest (the smallest planar or chordal extension for example) and a test that relates it to an upper bound of the smallest separator, we provide a framework how to dissect the graph recursively such that no subgraph has more than half the value of its parent, (or is indistinguishable via separator tests) in polynomial time. Interestingly we cannot bound such a function till we have constructed the entire nested dissection. We achieve a partition of the graph with respect to a constant number of such unknown estimator functions simultaneously. Using the framework the paper presents improvements in approximating the chordal completion size (by a factor of log n), operation count (by a factor of log/sup 2/ n and the polynomial term depending on degree) and elimination height. We show that there exists a nested dissection ordering that simultaneously minimizes the elimination height, chordal completion, operation count to within O(log n) factors of the best possible (which may be obtained by three independent orderings) improving the previous existence theorem by factors of log n and d/sup 1/3/ log/sup 3/ n for the latter two. We also show that graphs with small crossing number or fill-in have better approximations of the elimination height, completion and operation count. As a consequence we can approximate the pathwidth, cutwidth, vertex ranking problems better for such graphs. The paper also improves, in some cases, the approximation results of minimum drawing size (number of vertices plus the crossing number) of a planar embedding of a graph, and its layout area on a grid.

FOCS Conference 1999 Conference Paper

Improved Combinatorial Algorithms for the Facility Location and k-Median Problems

  • Moses Charikar
  • Sudipto Guha

We present improved combinatorial approximation algorithms for the uncapacitated facility location and k-median problems. Two central ideas in most of our results are cost scaling and greedy improvement. We present a simple greedy local search algorithm which achieves an approximation ratio of 2. 414+/spl epsiv/ in O/spl tilde/(n/sup 2///spl epsiv/) time. This also yields a bicriteria approximation tradeoff of (1+/spl gamma/, 1+2//spl gamma/) for facility cost versus service cost which is better than previously known tradeoffs and close to the best possible. Combining greedy improvement and cost scaling with a recent primal dual algorithm for facility location due to K. Jain and V. Vazirani (1999), we get an approximation ratio of 1. 853 in O/spl tilde/(n/sup 3/) time. This is already very close to the approximation guarantee of the best known algorithm which is LP-based. Further combined with the best known LP-based algorithm for facility location, we get a very slight improvement in the approximation factor for facility location, achieving 1. 728. We present improved approximation algorithms for capacitated facility location and a variant. We also present a 4-approximation for the k-median problem, using similar ideas, building on the 6-approximation of Jain and Vazirani. The algorithm runs in O/spl tilde/(n/sup 3/) time.

FOCS Conference 1998 Conference Paper

Approximating a Finite Metric by a Small Number of Tree Metrics

  • Moses Charikar
  • Chandra Chekuri
  • Ashish Goel
  • Sudipto Guha
  • Serge A. Plotkin

Y. Bartal (1996, 1998) gave a randomized polynomial time algorithm that given any n point metric G, constructs a tree T such that the expected stretch (distortion) of any edge is at most O (log n log log n). His result has found several applications and in particular has resulted in approximation algorithms for many graph optimization problems. However approximation algorithms based on his result are inherently randomized. In this paper we derandomize the use of Bartal's algorithm in the design of approximation algorithms. We give an efficient polynomial time algorithm that given a finite n point metric G, constructs O(n log n) trees and a probability distribution /spl mu/ on them such that the expected stretch of any edge of G in a tree chosen according to /spl mu/ is at most O(log n log log n). Our result establishes that finite metrics can be probabilistically approximated by a small number of tree metrics. We obtain the first deterministic approximation algorithms for buy-at-bulk network design and vehicle routing; in addition we subsume results from our earlier work on derandomization. Our main result is obtained by a novel view of probabilistic approximation of metric spaces as a deterministic optimization problem via linear programming.