K-Center Clustering
Definition
K-center Clustering is a type of clustering based on an combinatorial optimization methods. It clusters a set of points so as to minimize the maximum intercluster distance.
How it works
An example K-center Clustering problem is to mimimize the number of points in a set that are necessary so that a every other point in the set is within some fixed distance of those points. For instance, given n cities with specified distances, one wants to build k warehouses in different cities and minimize the maximum distance of a city to a warehouse.
Considerations
- Scalability: Exact solutions are NP-hard. However, algorithms that have been proven effective and create no more than 2x the optimal set of clusters can run in O(kn) proportional to k*n where k is the minimum number of clusters and n is the number of data points being clustered.
Key Test Considerations
Unsupervised Learning:
- Number of Clusters: The Gonzalez (Gon) algorithm guarantees creating no more than twice the optimal number of clusters, where the optimal number is the minimum number of clusters to minimize total distance between representative points in the clusters [1].
Cluster Analysis:
Rand Index and Adjusted Rand Index: Given ground truth set of class labels for the data, the Rand Index is a measure of the similarity between two data clusterings. The Rand Index is the accuracy of determining if a link belongs within a cluster or not. A form of the Rand Index may be defined that is adjusted for the chance grouping of elements, this is the Adjusted Rand Index [5].
Adjusted Mutual Information: Given ground truth set of class labels for the data, Adjusted Mutual Information corrects the effect of agreement solely due to chance between clusterings, similar to the way the Adjusted Rand Index corrects the Rand Index [6].
Connection-based Clustering:
Choice of Distance Metric: The outcome can vary significantly depending on the chosen distance metric (e.g., Euclidean, Manhattan).
Sensitivity: Connection-based method can be sensitive to outliers, which might affect the quality of the clusters formed.
K-center Clustering:
Silhouette Score: The silhouette score refers to a scoring method that helps validate the consistency between clusters of data. The evaluation technique also produces a concise graphical representation of how well each object appear to have been classified. It is suited to K-centric Clustering in that it also works for different metric spaces.
Distance Metric: The distance measure must be a true metric (see [2]). Differences in the metric chosen may (e.g., Euclidean,a Manhattan) affect results significantly.
Sensitivity: Greedy implementations may be sensitive to outliers.
Platforms, Tools, or Libraries
N/A. Note that this algorithm is relatively simple and so it is usually implemented from scratch by those incorporating this algorithm into a system.
References
Gonzalez, T.F. (1985). Clustering to Minimize the Maximum Intercluster Distance. Theor. Comput. Sci., 38, 293-306. Link.
Weisstein, Eric W. (n.d.). "Metric." From MathWorld--A Wolfram Web Resource. Link.
Wikipedia. (8 Aug 2023). Metric k-center Link.
Wikipedia. (14 Aug 2023). Vertex k-center problem. Link.
Wikipedia. (n.d.). Rand Index. Link.
Wikipedia. (n.d.). Adjusted Mutual Information. Link.
Wikipedia. (1 Aug 2023). Silhouette (clustering). Link.