Next: Statistical tests to assess
Up: Background on stability based
Previous: Stability based methods
Stability indices
Let be a clustering algorithm, a given random perturbation procedure applied to a data set and a
suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity [13]).
For instance may be a random projection from a high dimensional to a low dimensional subspace [1], or a bootstrap
procedure to sample a random subset of data from the original data set [4].
We define
as the random variable given by the similarity between two kclusterings obtained by applying a
clustering algorithm to pairs and of random independently perturbed data.
The intuitive idea is that if is concentrated close to , the corresponding clustering is stable with respect to a given
controlled perturbation and hence it is reliable.
Let be its density function, and

(1) 
its cumulative distribution function.
We define as the integral of the cumulative distribution function:

(2) 
Intuitively represents the "concentration" of the similarity values close to 1; that is, if then
the distribution of the values of is concentrated near 1, or, in other words, the kclustering is stable. On the
other hand, if then the clusterings are totally unstable, while if the distribution is close to the
uniform distribution, we have
.
We may directly estimate eq. 2 by numerical integration, or we may more easily obtain from the estimate of
the expectation :
Hence from eq. 3 we may easily compute :

(4) 
Eq. 4 shows also more clearly that we have a very stable and reliable clustering ( close to 1),
if and only if is close to 0.
Next: Statistical tests to assess
Up: Background on stability based
Previous: Stability based methods