next up previous
Next: Stability indices Up: Background on stability based Previous: Background on stability based

Stability based methods

A major requirement for clustering algorithms is the reproducibility of their solutions on other data sets drawn from the same source. In this context, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in clustered data [18,16]: multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations.

A general stability-based algorithmic scheme for assessing the reliability of clustering solutions may be summarized in the following way:

  1. Randomly perturb the data many times according to a given perturbation procedure.
  2. Apply a given clustering algorithm to the perturbed data
  3. Apply a given clustering similarity measure to multiple pairs of k-clusterings obtained according to steps 1 and 2.
  4. Use appropriate similarity indices to assess the stability of a given clustering.
  5. Repeat steps 1 to 4 for multiple values of k and select the most stable clustering(s) as the most reliable.

Several approaches have been proposed to implement the first step: a random "perturbation" of the data may be obtained through bootstrap samples drawn from the available data [15,18], or random noise injection into the data [17] or random projections into lower dimensional subspaces [20,7].

The application of a given algorithm (step 2) represents a choice based on "a priori" knowledge or assumptions about the characteristics of the data. To estimate the similarity between clusterings (step 3), classical measures, such as the Rand Index [19], or the Jaccard or the Fowlkes and Mallows coefficients [14] or their equivalent dot-product representations [4] may be applied.

Several stability indices for model order selection have been proposed in the literature (see, e.g. [17,20,18,16]): very schematically they can be divided into indices that use statistics of the similarity measures [20,7] or their overall empirical distribution [4].

The last step, that is the selection of the most stable/reliable clustering, given a set of similarity measures and the related stability indices, has been usually approached by choosing the best scored clustering (according to the chosen stability index). A major problem in this last step is represented by the estimate of the statistical significance of the discovered solutions.


next up previous
Next: Stability indices Up: Background on stability based Previous: Background on stability based