   Next: Statistical tests to assess Up: Background on stability based Previous: Stability based methods

## Stability indices

Let be a clustering algorithm, a given random perturbation procedure applied to a data set and a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity ). For instance may be a random projection from a high dimensional to a low dimensional subspace , or a bootstrap procedure to sample a random subset of data from the original data set .

We define as the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm to pairs and of random independently perturbed data. The intuitive idea is that if is concentrated close to , the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.

Let be its density function, and (1)

its cumulative distribution function.

We define as the integral of the cumulative distribution function: (2)

Intuitively represents the "concentration" of the similarity values close to 1; that is, if then the distribution of the values of is concentrated near 1, or, in other words, the k-clustering is stable. On the other hand, if then the clusterings are totally unstable, while if the distribution is close to the uniform distribution, we have .

We may directly estimate eq. 2 by numerical integration, or we may more easily obtain from the estimate of the expectation :     (3)

Hence from eq. 3 we may easily compute : (4)

Eq. 4 shows also more clearly that we have a very stable and reliable clustering ( close to 1), if and only if is close to 0.   Next: Statistical tests to assess Up: Background on stability based Previous: Stability based methods