Arrow Research search

Author name cluster

Jinhui Xu 0001

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
1 author row

Possible papers

21

UAI Conference 2025 Conference Paper

Nearly Optimal Differentially Private ReLU Regression

  • Meng Ding
  • Mingxi Lei
  • Shaowei Wang 0003
  • Tianhang Zheng
  • Di Wang 0015
  • Jinhui Xu 0001

In this paper, we investigate one of the most fundamental non-convex learning problems-ReLU regression-in the Differential Privacy (DP) model. Previous studies on private ReLU regression heavily rely on stringent assumptions, such as constant-bounded norms for feature vectors and labels. We relax these assumptions to a more standard setting, where data can be i. i. d. sampled from $O(1)$-sub-Gaussian distributions. We first show that when $\varepsilon = \tilde{O}(\sqrt{\frac{1}{N}})$ and there is some public data, it is possible to achieve an upper bound of $\Tilde{O}(\frac{d^2}{N^2 \varepsilon^2})$ for the excess population risk in $(\epsilon, \delta)$-DP, where $d$ is the dimension and $N$ is the number of data samples. Moreover, we relax the requirement of $\epsilon$ and public data by proposing and analyzing a one-pass mini-batch Generalized Linear Model Perceptron algorithm (DP-MBGLMtron). Additionally, using the tracing attack argument technique, we demonstrate that the minimax rate of the estimation error for $(\varepsilon, \delta)$-DP algorithms is lower bounded by $\Omega(\frac{d^2}{N^2 \varepsilon^2})$. This shows that DP-MBGLMtron achieves the optimal utility bound up to logarithmic factors. Experiments further support our theoretical results.

ICLR Conference 2025 Conference Paper

New Algorithms for the Learning-Augmented k-means Problem

  • Junyu Huang
  • Qilong Feng
  • Ziyun Huang 0001
  • Zhen Zhang 0025
  • Jinhui Xu 0001
  • Jianxin Wang 0001

In this paper, we study the clustering problems in the learning-augmented setting, where predicted labels for a d-dimensional dataset with size m are given by an oracle to serve as auxiliary information to improve the clustering performance. Following the prior work, the given oracle is parameterized by some error rate α, which captures the accuracy of the oracle such that there are at most α fraction of false positives and false negatives in each predicted cluster. In this setting, the goal is to design fast and practical algorithms that can break the computational barriers of inapproximability. The current state-of-the-art learning-augmented k-means algorithm relies on sorting strategies to find good coordinates approximation, where a (1+O(α))-approximation can be achieved with near-linear running time in the data size. However, the computational demands for sorting may limit the scalability of the algorithm for handling large-scale datasets. To address this issue, in this paper, we propose new algorithms that can identify good coordinates approximation using sampling-based strategies, where (1+O(α))-approximation can be achieved with linear running time in the data size. To obtain a more practical algorithm for the problem with better clustering quality and running time, we propose a sampling-based heuristic which can directly find center approximations using sampling-based strategies. Empirical experiments show that our proposed methods are faster than the state-of-the-art learning-augmented k-means algorithms with comparable performances on clustering quality.

ICLR Conference 2025 Conference Paper

TTVD: Towards a Geometric Framework for Test-Time Adaptation Based on Voronoi Diagram

  • Mingxi Lei
  • Chunwei Ma
  • Meng Ding
  • Yufan Zhou 0001
  • Ziyun Huang 0001
  • Jinhui Xu 0001

Deep learning models often struggle with generalization when deploying on real-world data, due to the common distributional shift to the training data. Test-time adaptation (TTA) is an emerging scheme used at inference time to address this issue. In TTA, models are adapted online at the same time when making predictions to test data. Neighbor-based approaches have gained attention recently, where prototype embeddings provide location information to alleviate the feature shift between training and testing data. However, due to their inherit limitation of simplicity, they often struggle to learn useful patterns and encounter performance degradation. To confront this challenge, we study the TTA problem from a geometric point of view. We first reveal that the underlying structure of neighbor-based methods aligns with the Voronoi Diagram, a classical computational geometry model for space partitioning. Building on this observation, we propose the Test-Time adjustment by Voronoi Diagram guidance (TTVD), a novel framework that leverages the benefits of this geometric property. Specifically, we explore two key structures: 1) Cluster-induced Voronoi Diagram (CIVD): This integrates the joint contribution of self-supervision and entropy-based methods to provide richer information. 2) Power Diagram (PD): A generalized version of the Voronoi Diagram that refines partitions by assigning weights to each Voronoi cell. Our experiments under rigid, peer-reviewed settings on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and ImageNet-R shows that TTVD achieves remarkable improvements compared to state-of-the-art methods. Moreover, extensive experimental results also explore the effects of batch size and class imbalance, which are two scenarios commonly encountered in real-world applications. These analyses further validate the robustness and adaptability of our proposed framework.

ICLR Conference 2024 Conference Paper

Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model

  • Liyang Zhu
  • Meng Ding
  • Vaneet Aggarwal
  • Jinhui Xu 0001
  • Di Wang 0015

In this paper, we revisit the problem of sparse linear regression in the local differential privacy (LDP) model. Existing research in the non-interactive and sequentially local models has focused on obtaining the lower bounds for the case where the underlying parameter is $1$-sparse, and extending such bounds to the more general $k$-sparse case has proven to be challenging. Moreover, it is unclear whether efficient non-interactive LDP (NLDP) algorithms exist. To address these issues, we first consider the problem in the $\epsilon$ non-interactive LDP model and provide a lower bound of $\Omega(\frac{\sqrt{dk\log d}}{\sqrt{n}\epsilon})$ on the $\ell_2$-norm estimation error for sub-Gaussian data, where $n$ is the sample size and $d$ is the dimension of the space. We propose an innovative NLDP algorithm, the very first of its kind for the problem. As a remarkable outcome, this algorithm also yields a novel and highly efficient estimator as a valuable by-product. Our algorithm achieves an upper bound of $\tilde{O}({\frac{d\sqrt{k}}{\sqrt{n}\epsilon}})$ for the estimation error when the data is sub-Gaussian, which can be further improved by a factor of $O(\sqrt{d})$ if the server has additional public but unlabeled data. For the sequentially interactive LDP model, we show a similar lower bound of $\Omega({\frac{\sqrt{dk}}{\sqrt{n}\epsilon}})$. As for the upper bound, we rectify a previous method and show that it is possible to achieve a bound of $\tilde{O}(\frac{k\sqrt{d}}{\sqrt{n}\epsilon})$. Our findings reveal fundamental differences between the non-private case, central DP model, and local DP model in the sparse linear regression problem.

ICML Conference 2024 Conference Paper

Near-Linear Time Approximation Algorithms for k-means with Outliers

  • Junyu Huang
  • Qilong Feng
  • Ziyun Huang 0001
  • Jinhui Xu 0001
  • Jianxin Wang 0001

The k-means with outliers problem is one of the most extensively studied clustering problems in the field of machine learning, where the goal is to discard up to z outliers and identify a minimum k-means clustering on the remaining data points. Most previous results for this problem have running time dependent on the aspect ratio Δ (the ratio between the maximum and the minimum pairwise distances) to achieve fast approximations. To address the issue of aspect ratio dependency on the running time, we propose sampling-based algorithms with almost linear running time in the data size, where a crucial component of our approach is an algorithm called Fast-Sampling. Fast-Sampling algorithm can find inliers that well approximate the optimal clustering centers without relying on a guess for the optimal clustering costs, where a 4-approximate solution can be obtained in time $O(\frac{ndk\log\log n}{\epsilon^2})$ with O(k/ϵ) centers opened and (1+ϵ)z outliers discarded. To reduce the number of centers opened, we propose a center reduction algorithm, where an O(1/ϵ)-approximate solution can be obtained in time $O(\frac{ndk\log \log n}{\epsilon^2} + dpoly(k, \frac{1}{\epsilon})\log(n\Delta))$ with (1+ϵ)z outliers discarded and exactly k centers opened. Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem.

ICML Conference 2024 Conference Paper

Understanding Forgetting in Continual Learning with Linear Regression

  • Meng Ding
  • Kaiyi Ji
  • Di Wang 0015
  • Jinhui Xu 0001

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to $\textit{catastrophic forgetting}$, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both under-parameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence—where tasks with larger eigenvalues in their population data covariance matrices are trained later—tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both under-parameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.

ICML Conference 2023 Conference Paper

Fast Algorithms for Distributed k-Clustering with Outliers

  • Junyu Huang
  • Qilong Feng
  • Ziyun Huang 0001
  • Jinhui Xu 0001
  • Jianxin Wang 0001

In this paper, we study the $k$-clustering problems with outliers in distributed setting. The current best results for the distributed $k$-center problem with outliers have quadratic local running time with communication cost dependent on the aspect ratio $\Delta$ of the given instance, which may constraint the scalability of the algorithms for handling large-scale datasets. To achieve better communication cost for the problem with faster local running time, we propose an inliers-recalling sampling method, which avoids guessing the optimal radius of the given instance, and can achieve a 4-round bi-criteria $(14(1+\epsilon), 1+\epsilon)$-approximation with linear local running time in the data size and communication cost independent of the aspect ratio. To obtain a more practical algorithm for the problem, we propose another space-narrowing sampling method, which automatically adjusts the sample size to adapt to different outliers distributions on each machine, and can achieve a 2-round bi-criteria $(14(1+\epsilon), 1+\epsilon)$-approximation with communication cost independent of the number of outliers. We show that, if the data points are randomly partitioned across machines, our proposed sampling-based methods can be extended to the $k$-median/means problems with outliers, and can achieve $(O(\frac{1}{\epsilon^2}), 1+\epsilon)$-approximation with communication cost independent of the number of outliers. Empirical experiments suggest that the proposed 2-round distributed algorithms outperform other state-of-the-art algorithms.

ECAI Conference 2023 Conference Paper

Finite Sample Guarantees of Differentially Private Expectation Maximization Algorithm

  • Di Wang 0015
  • Jiahao Ding
  • Lijie Hu
  • Zejun Xie
  • Miao Pan
  • Jinhui Xu 0001

(Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of Gradient EM algorithm with statistical guarantees. Specifically, we first propose a new mechanism for privately estimating the mean of a heavy-tailed distribution, which significantly improves a previous result in [25], and it could be extended to the local DP model, which has not been studied before. Next, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first result on finite sample statistical guarantees. Our theory is supported by thorough numerical experiments on both real-world data and synthetic data.

ICLR Conference 2023 Conference Paper

Progressive Voronoi Diagram Subdivision Enables Accurate Data-free Class-Incremental Learning

  • Chunwei Ma
  • Zhanghexuan Ji
  • Ziyun Huang 0001
  • Yan Shen 0002
  • Mingchen Gao
  • Jinhui Xu 0001

Data-free Class-incremental Learning (CIL) is a challenging problem because rehearsing data from previous phases is strictly prohibited, causing catastrophic forgetting of Deep Neural Networks (DNNs). In this paper, we present \emph{iVoro}, a novel framework derived from computational geometry. We found Voronoi Diagram (VD), a classical model for space subdivision, is especially powerful for solving the CIL problem, because VD itself can be constructed favorably in an incremental manner -- the newly added sites (classes) will only affect the proximate classes, making the non-contiguous classes hardly forgettable. Furthermore, we bridge DNN and VD using Power Diagram Reduction, and show that the VD structure can be progressively refined along the phases using a divide-and-conquer algorithm. Moreover, our VD construction is not restricted to the deep feature space, but is also applicable to multiple intermediate feature spaces, promoting VD to be multilayer VD that efficiently captures multi-grained features from DNN. Importantly, \emph{iVoro} is also capable of handling uncertainty-aware test-time Voronoi cell assignment and has exhibited high correlations between geometric uncertainty and predictive accuracy (up to ${\sim}0.9$). Putting everything together, \emph{iVoro} achieves up to $25.26\%$, $37.09\%$, and $33.21\%$ improvements on CIFAR-100, TinyImageNet, and ImageNet-Subset, respectively, compared to the state-of-the-art non-exemplar CIL approaches. In conclusion, \emph{iVoro} enables highly accurate, privacy-preserving, and geometrically interpretable CIL that is particularly useful when cross-phase data sharing is forbidden, e.g. in medical applications.

ICLR Conference 2022 Conference Paper

Few-shot Learning via Dirichlet Tessellation Ensemble

  • Chunwei Ma
  • Ziyun Huang 0001
  • Mingchen Gao
  • Jinhui Xu 0001

Few-shot learning (FSL) is the process of rapid generalization from abundant base samples to inadequate novel samples. Despite extensive research in recent years, FSL is still not yet able to generate satisfactory solutions for a wide range of real-world applications. To confront this challenge, we study the FSL problem from a geometric point of view in this paper. One observation is that the widely embraced ProtoNet model is essentially a Voronoi Diagram (VD) in the feature space. We retrofit it by making use of a recent advance in computational geometry called Cluster-induced Voronoi Diagram (CIVD). Starting from the simplest nearest neighbor model, CIVD gradually incorporates cluster-to-point and then cluster-to-cluster relationships for space subdivision, which is used to improve the accuracy and robustness at multiple stages of FSL. Specifically, we use CIVD (1) to integrate parametric and nonparametric few-shot classifiers; (2) to combine feature representation and surrogate representation; (3) and to leverage feature-level, transformation-level, and geometry-level heterogeneities for a better ensemble. Our CIVD-based workflow enables us to achieve new state-of-the-art results on mini-ImageNet, CUB, and tiered-ImagenNet datasets, with ${\sim}2\%{-}5\%$ improvements upon the next best. To summarize, CIVD provides a mathematically elegant and geometrically interpretable framework that compensates for extreme data insufficiency, prevents overfitting, and allows for fast geometric ensemble for thousands of individual VD. These together make FSL stronger.

UAI Conference 2021 Conference Paper

Improving uncertainty calibration of deep neural networks via truth discovery and geometric optimization

  • Chunwei Ma
  • Ziyun Huang 0001
  • Jiayi Xian
  • Mingchen Gao
  • Jinhui Xu 0001

Deep Neural Networks (DNNs), despite their tremendous success in recent years, could still cast doubts on their predictions due to the intrinsic uncertainty associated with their learning process. Ensemble techniques and post-hoc calibrations are two types of approaches that have individually shown promise in improving the uncertainty calibration of DNNs. However, the synergistic effect of the two types of methods has not been well explored. In this paper, we propose a truth discovery framework to integrate ensemble-based and post-hoc calibration methods. Using the geometric variance of the ensemble candidates as a good indicator for sample uncertainty, we design an accuracy-preserving truth estimator with provably no accuracy drop. Furthermore, we show that post-hoc calibration can also be enhanced by truth discovery-regularized optimization. On large-scale datasets including CIFAR and ImageNet, our method shows consistent improvement against state-of-the-art calibration approaches on both histogram-based and kernel density-based evaluation metrics. Our code is available at https: //github. com/horsepurve/truly-uncertain.

ICLR Conference 2021 Conference Paper

Meta-Learning with Neural Tangent Kernels

  • Yufan Zhou 0001
  • Zhenyi Wang 0001
  • Jiayi Xian
  • Changyou Chen
  • Jinhui Xu 0001

Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-of-distribution adaptation than popular baselines, as demonstrated in our experiments.

SODA Conference 2021 Conference Paper

PTAS for Minimum Cost Multi-covering with Disks

  • Ziyun Huang 0001
  • Qilong Feng
  • Jianxin Wang 0001
  • Jinhui Xu 0001

In this paper, we study the following Minimum Cost Multi-Covering (MCMC) problem: Given a set of n client points C and a set of m server points S in a fixed dimensional ℝ d space, determine a set of disks centered at these server points so that each client point c is covered by at least k ( c ) disks and the total cost of these disks is minimized, where k (-) is a function that maps every client point to some non-negative integer no more than m and the cost of each disk is measured by the α -th power of its radius for some constant α > 0. MCMC is a fundamental optimization problem with applications in many areas such as wireless/sensor networking. Despite extensive research on this problem in the past two decades, only constant approximations were known for general k. It has been an open problem for a long time to determine whether a PTAS is possible. In this paper, we give an affirmative answer to this question by presenting the first PTAS for it. Our approach is based on a number of novel techniques, such as Balanced Recursive Realization and Bubble Charging, and new insights to the problem which are somewhat counter-intuitive. Particularly, we show that instead of optimizing each disk as a whole, it is possible to further approximate each disk with a set of sub-boxes and optimize them at the sub-disk level. This allows us to first compute an approximate disk cover with minimum cost through dynamic programming, and then obtain the desired disk cover through a balanced recursive realization procedure. Our techniques have the potential to be used to other geometric (covering) problems.

ICML Conference 2020 Conference Paper

On Differentially Private Stochastic Convex Optimization with Heavy-tailed Data

  • Di Wang 0015
  • Hanshen Xiao
  • Srinivas Devadas
  • Jinhui Xu 0001

In this paper, we consider the problem of designing Differentially Private (DP) algorithms for Stochastic Convex Optimization (SCO) on heavy-tailed data. The irregularity of such data violates some key assumptions used in almost all existing DP-SCO and DP-ERM methods, resulting in failure to provide the DP guarantees. To better understand this type of challenges, we provide in this paper a comprehensive study of DP-SCO under various settings. First, we consider the case where the loss function is strongly convex and smooth. For this case, we propose a method based on the sample-and-aggregate framework, which has an excess population risk of $\tilde{O}(\frac{d^3}{n\epsilon^4})$ (after omitting other factors), where $n$ is the sample size and $d$ is the dimensionality of the data. Then, we show that with some additional assumptions on the loss functions, it is possible to reduce the \emph{expected} excess population risk to $\tilde{O}(\frac{ d^2}{ n\epsilon^2 })$. To lift these additional conditions, we also provide a gradient smoothing and trimming based scheme to achieve excess population risks of $\tilde{O}(\frac{ d^2}{n\epsilon^2})$ and $\tilde{O}(\frac{d^\frac{2}{3}}{(n\epsilon^2)^\frac{1}{3}})$ for strongly convex and general convex loss functions, respectively, \emph{with high probability}. Experiments on both synthetic and real-world datasets suggest that our algorithms can effectively deal with the challenges caused by data irregularity.

ICML Conference 2019 Conference Paper

Differentially Private Empirical Risk Minimization with Non-convex Loss Functions

  • Di Wang 0015
  • Changyou Chen
  • Jinhui Xu 0001

We study the problem of Empirical Risk Minimization (ERM) with (smooth) non-convex loss functions under the differential-privacy (DP) model. Existing approaches for this problem mainly adopt gradient norms to measure the error, which in general cannot guarantee the quality of the solution. To address this issue, we first study the expected excess empirical (or population) risk, which was primarily used as the utility to measure the quality for convex loss functions. Specifically, we show that the excess empirical (or population) risk can be upper bounded by $\tilde{O}(\frac{d\log (1/\delta)}{\log n\epsilon^2})$ in the $(\epsilon, \delta)$-DP settings, where $n$ is the data size and $d$ is the dimensionality of the space. The $\frac{1}{\log n}$ term in the empirical risk bound can be further improved to $\frac{1}{n^{\Omega(1)}}$ (when $d$ is a constant) by a highly non-trivial analysis on the time-average error. To obtain more efficient solutions, we also consider the connection between achieving differential privacy and finding approximate local minimum. Particularly, we show that when the size $n$ is large enough, there are $(\epsilon, \delta)$-DP algorithms which can find an approximate local minimum of the empirical risk with high probability in both the constrained and non-constrained settings. These results indicate that one can escape saddle points privately.

ICML Conference 2019 Conference Paper

On Sparse Linear Regression in the Local Differential Privacy Model

  • Di Wang 0015
  • Jinhui Xu 0001

In this paper, we study the sparse linear regression problem under the Local Differential Privacy (LDP) model. We first show that polynomial dependency on the dimensionality $p$ of the space is unavoidable for the estimation error in both non-interactive and sequential interactive local models, if the privacy of the whole dataset needs to be preserved. Similar limitations also exist for other types of error measurements and in the relaxed local models. This indicates that differential privacy in high dimensional space is unlikely achievable for the problem. With the understanding of this limitation, we then present two algorithmic results. The first one is a sequential interactive LDP algorithm for the low dimensional sparse case, called Locally Differentially Private Iterative Hard Thresholding (LDP-IHT), which achieves a near optimal upper bound. This algorithm is actually rather general and can be used to solve quite a few other problems, such as (Local) DP-ERM with sparsity constraints and sparse regression with non-linear measurements. The second one is for the restricted (high dimensional) case where only the privacy of the responses (labels) needs to be preserved. For this case, we show that the optimal rate of the error estimation can be made logarithmically depending on $p$ (i. e. , $\log p$) in the local model, where an upper bound is obtained by a label-privacy version of LDP-IHT. Experiments on real world and synthetic datasets confirm our theoretical analysis.

SODA Conference 2015 Conference Paper

A Unified Framework for Clustering Constrained Data without Locality Property

  • Hu Ding 0003
  • Jinhui Xu 0001

In this paper, we consider a class of constrained clustering problems of points in ℝ d space, where d could be rather high. A common feature of these problems is that their optimal clusterings no longer have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present in this paper a unified framework, called Peeling-and-Enclosing, to iteratively solve two variants of the constrained clustering problems, constrained k-means clustering ( k -CMeans) and constrained k-median clustering ( k -CMedian). Our framework generalizes Kumar et al. 's elegant k -means clustering approach [35] from unconstrained data to constrained data, and is based on two standalone geometric techniques, called Simplex Lemma and Weaker Simplex Lemma, for k -CMeans and k -CMedian, respectively. Simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. With these techniques, our framework generates, in nearly linear time ( i. e. , O ( n (log n ) k +1 d )), O ((log n ) k ) k -tuple candidates for the k mean or median points, and one of them induces a (1 + ε )-approximation for k -CMeans or k -CMedian, where n is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best k -tuple candidate), we obtain a (1 + ε )-approximation for each of the constrained clustering problems. Our framework improves considerably the best known results for these problems. We expect that our technique will be applicable to other constrained clustering problems without locality.

FOCS Conference 2013 Conference Paper

On Clustering Induced Voronoi Diagrams

  • Danny Z. Chen
  • Ziyun Huang 0001
  • Yangwei Liu
  • Jinhui Xu 0001

In this paper, we study a generalization of the classical Voronoi diagram, called clustering induced Voronoi diagram (CIVD). Different from the traditional model, CIVD takes as its sites the power set U of an input set P of objects. For each subset C of P, CIVD uses an influence function F(C, q) to measure the total (or joint) influence of all objects in C on an arbitrary point q in the space ℝ d, and determines the influence-based Voronoi cell in ℝ d for C. This generalized model offers a number of new features (e. g. , simultaneous clustering and space partition) to Voronoi diagram which are useful in various new applications. We investigate the general conditions for the influence function which ensure the existence of a small-size (e. g. , nearly linear) approximate CIVD for a set P of n points in ℝ d for some fixed d. To construct CIVD, we first present a standalone new technique, called approximate influence (AI) decomposition, for the general CIVD problem. With only O(n log n) time, the AI decomposition partitions the space ℝ d into a nearly linear number of cells so that all points in each cell receive their approximate maximum influence from the same (possibly unknown) site (i. e. , a subset of P). Based on this technique, we develop assignment algorithms to determine a proper site for each cell in the decomposition and form various (1-ε)-approximate CIVDs for some small fixed € > 0. Particularly, we consider two representative CIVD problems, vector CIVD and density-based CIVD, and show that both of them admit fast assignment algorithms; consequently, their (1 - €)-approximate CIVDs can be built in O(n log d+1 n) and O(n log 2 n) time, respectively.