   Next: Introduction to the functionalities Up: Background on stability based Previous: Stability indices

## Statistical tests to assess the significance of clustering solutions

Consider a set of k-clusterings , where is a set of numbers of clusters and the corresponding set of integrals computed according to eq. 2. Then we obtain a set of values . We can sort G obtaining with values in ascending order. For each k-clustering we consider two groups of pairwise clustering similarities values separated by a threshold (a reasonable threshold could be .). Thus we may obtain: , where is computed according to eq. 1. If represents the number of trials for estimating the value of then is the number of times for which the similarity values are larger than . The may be interpreted as the successes from binomial populations with parameters . If the number of trials is sufficiently large, and setting as a random variable that counts how many times , we have that the following random variable, for sufficiently large values of is distributed according to a normal distribution: (5)

A sum of i.i.d. squared normal variables is distributed according to a distribution: (6)

Considering the null hypothesis : all the are equal to , where the unknown is estimated through its pooled estimate , then the null hypothesis may be evaluated against the alternative hypothesis that the are not all equal using the statistic (7)

If we may reject the null hypothesis at significance level, that is we may conclude that with probability the considered proportions are different, and hence that at least one k-clustering significantly differ from the others. Using the above test we start considering all the k-clustering. If a significant difference is registered according to the statistical test we exclude the last clustering (according to the sorting of ). This is repeated until no significant difference is detected (or until only 1 clustering is left off): the set of the remaining (top sorted) k-clusterings represents the set of the estimate stable number of clusters discovered (at significance level).

It is worth noting that the above -based procedure be also applied to automatically find the optimal number of clusters independently of the applied perturbation method.

Anyway, note that with the previous -based statistical test we implicitly assume that some probability distributions are normal. Using the classical Bernstein inequality  we may apply a partially "distribution independent" approach to assess the significance of the discovered clustering. In this way we can apply a statistical test without any assumption about the distribution of the random variables. For more details see , downloadable from:
http://homes.dsi.unimi.it/valenti/papers/bertoni-vale-cisi06.pdf.   Next: Introduction to the functionalities Up: Background on stability based Previous: Stability indices