SODA 2022
How many Clusters? - An algorithmic answer
Abstract
Many algorithms for clustering high dimensional data assume that k, the number of clusters, is given. However, there has been little work on provably inferring k from the data. This paper gives polynomial time algorithms for finding k from the data assuming it satisfies certain natural deterministic conditions. Informally, we assume that there is a Ground Truth (GT) clustering of the data with the following properties: (i) Each cluster has a certain minimum size, (ii) the inter-mean separation of any two distinct clusters in the GT is large enough (although still weaker than what is typically assumed in the literature), and (iii) we define a novel “no large sub-cluster” (NLSC) property that characterizes the notion of a cluster by stipulating that there be no subsets of low “directional variance”. NLSC is indeed satisfied by large class of distributions including log-concave densities. The first major contribution is an algorithm for finding k where m, the minimum GT cluster size, is assumed to be known. This algorithm uses a novel rounding procedure which finds subsets of size m with low Directional Variance by rounding a SDP relaxation using Cheeger's inequality and it is shown that k is precisely the number of such sets whose means are well-separated. The harder problem of finding k when m not given is addressed by running the previous algorithm for each value of m to find candidate values of k and the corresponding k -clustering. The second major contribution of this paper is a test which certifies the correct candidate thereby yielding a polynomial time algorithm which finds k.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- ACM-SIAM Symposium on Discrete Algorithms
- Archive span
- 1990-2025
- Indexed papers
- 4674
- Paper id
- 334338797571363037