next up previous
Next: Introduction to the functionalities Up: Background on stability based Previous: Stability indices


Statistical tests to assess the significance of clustering solutions

Consider a set of k-clusterings $k \in \mathcal{K}$, where $\mathcal{K}$ is a set of numbers of clusters and the corresponding set of integrals computed according to eq. 2. Then we obtain a set of values $
\mathcal{G} = \{ g_k \vert k \in \mathcal{K} \}
$. We can sort G obtaining $\hat{\mathcal{G}}$ with values $\hat{g}_i$ in ascending order. For each k-clustering we consider two groups of pairwise clustering similarities values separated by a threshold $t^o$ (a reasonable threshold could be $t^o = 0.9$.). Thus we may obtain: $
P(S_k > t^o) = 1 - F_k(s=t^o)
$, where $F_k(s=t^o)$ is computed according to eq. 1. If $n$ represents the number of trials for estimating the value of $S_k$ then $
x_k = P(S_k > t^o) n
$ is the number of times for which the similarity values are larger than $t^o$. The $x_k$ may be interpreted as the successes from $\vert\mathcal{K}\vert$ binomial populations with parameters $\theta_k$. If the number of trials $n$ is sufficiently large, and setting $X_k$ as a random variable that counts how many times $S_k > t^o$, we have that the following random variable, for sufficiently large values of $n$ is distributed according to a normal distribution:
\begin{displaymath}
\frac{X_k - n \theta_k}{\sqrt{n \theta_k (1- \theta_k)}} \sim N(0,1)
\end{displaymath} (5)

A sum of i.i.d. squared normal variables is distributed according to a $\chi^2$ distribution:
\begin{displaymath}
\sum_{k \in \mathcal{K}}\frac{(X_k - n \theta_k)^2}{n \theta_k (1- \theta_k)} \sim \chi^2
\end{displaymath} (6)

Considering the null hypothesis $H_0$: all the $\theta_k$ are equal to $\theta$, where the unknown $\theta$ is estimated through its pooled estimate $
\hat{\theta} = \frac{\sum_{k \in K} x_k}{\vert\mathcal{K}\vert \cdot n}
$, then the null hypothesis may be evaluated against the alternative hypothesis that the $\theta_k$ are not all equal using the statistic
\begin{displaymath}
Y = \sum_{k \in \mathcal{K}}\frac{(x_k - n \hat{\theta})^2}{...
...heta} (1- \hat{\theta})} \sim \chi^2_{\vert\mathcal{K}\vert-1}
\end{displaymath} (7)

If $Y \geq \chi^2_{\alpha, \vert\mathcal{K}\vert - 1}$ we may reject the null hypothesis at $\alpha$ significance level, that is we may conclude that with probability $1 - \alpha$ the considered proportions are different, and hence that at least one k-clustering significantly differ from the others. Using the above test we start considering all the k-clustering. If a significant difference is registered according to the statistical test we exclude the last clustering (according to the sorting of $\mathcal{G}$). This is repeated until no significant difference is detected (or until only 1 clustering is left off): the set of the remaining (top sorted) k-clusterings represents the set of the estimate stable number of clusters discovered (at $\alpha$ significance level).

It is worth noting that the above $\chi^2$-based procedure be also applied to automatically find the optimal number of clusters independently of the applied perturbation method.

Anyway, note that with the previous $\chi^2$-based statistical test we implicitly assume that some probability distributions are normal. Using the classical Bernstein inequality [12] we may apply a partially "distribution independent" approach to assess the significance of the discovered clustering. In this way we can apply a statistical test without any assumption about the distribution of the random variables. For more details see [5], downloadable from:
http://homes.dsi.unimi.it/valenti/papers/bertoni-vale-cisi06.pdf.


next up previous
Next: Introduction to the functionalities Up: Background on stability based Previous: Stability indices