How many Clusters? - An algorithmic answer

Chiranjib Bhattacharyya; Ravindran Kannan; Amit Kumar 0001

Back to SODA

SODA 2022

How many Clusters? - An algorithmic answer

Conference Paper Accepted Paper Algorithms and Complexity · Theoretical Computer Science

Details

Abstract

Many algorithms for clustering high dimensional data assume that k, the number of clusters, is given. However, there has been little work on provably inferring k from the data. This paper gives polynomial time algorithms for finding k from the data assuming it satisfies certain natural deterministic conditions. Informally, we assume that there is a Ground Truth (GT) clustering of the data with the following properties: (i) Each cluster has a certain minimum size, (ii) the inter-mean separation of any two distinct clusters in the GT is large enough (although still weaker than what is typically assumed in the literature), and (iii) we define a novel “no large sub-cluster” (NLSC) property that characterizes the notion of a cluster by stipulating that there be no subsets of low “directional variance”. NLSC is indeed satisfied by large class of distributions including log-concave densities. The first major contribution is an algorithm for finding k where m, the minimum GT cluster size, is assumed to be known. This algorithm uses a novel rounding procedure which finds subsets of size m with low Directional Variance by rounding a SDP relaxation using Cheeger's inequality and it is shown that k is precisely the number of such sets whose means are well-separated. The harder problem of finding k when m not given is addressed by running the previous algorithm for each value of m to find candidate values of k and the corresponding k -clustering. The second major contribution of this paper is a test which certifies the correct candidate thereby yielding a polynomial time algorithm which finds k.

How many Clusters? - An algorithmic answer

Abstract

Authors

Keywords

Context