next up previous
Next: Background on stability based Up: The mosclust R package: Previous: The mosclust R package:

Overview of the mosclust R package

The mosclust R package (that stands for model order selection for clustering problems) implements a set of functions to discover significant structures in bio-molecular data. One of the main problems in unsupervised clustering analysis is the assessment of the "natural" number of clusters. Several methods and software tools have been proposed to tackle this problem (see [11] for a recent review).

Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data [8,18,16,9,21]. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations.

Several perturbation techniques have been proposed, ranging form bootstrap techniques [15,4,18], to random projections to lower dimensional subspaces [20,7] to noise injection procedures [17]. All these perturbation techniques are implemented in mosclust.

The library implements indices of stability/reliability of the clusterings based on the distribution of similarity measures between multiple instances of clusterings performed on multiple instances of data obtained through a given random perturbation of the original data.

These indices provides a "score" that can be used to compare the reliability of different clusterings. Moreover statistical tests based on $\chi^2$ and on the classical Bernstein inequality [12] are implemented in order to assess the statistical significance of the discovered clustering solutions. By this approach we could also find multiple structures simultaneously present in the data. For instance, it is possible that data exhibit a hierarchical structure, with subclusters inside other clusters, and using the indices and the statistical tests implemented in mosclust we may detect them at a given significance level.

Summarizing, this package may be used for:

  1. Assessment of the reliability of a given clustering solution
  2. Clustering model order selection: what about the "natural" number of clusters inside the data?
  3. Assessment of the statistical significance of a given clustering solution
  4. Discovery of multiple structures underlying the data: are there multiple reliable clustering solutions at a given significance level?

Note that this package cannot be used to assess the reliability of an individual cluster inside a given clustering (to this end you may use the clusterv R package).

The next section provides a background on stability methods, with a brief description of the stability indices and the statistical tests implemented in the package. For more details, please see [6,5].

Then a brief introduction to the functionalities and the usage of mosclust is given.

To download the R software and documentation (comprising the tutorial and the reference manual in pdf format) go to the section Download software and documentation.

The statistical tests implemented in the package have been designed with the theoretical and methodological contribution of Alberto Bertoni (DSI, Università degli Studi di Milano).


next up previous
Next: Background on stability based Up: The mosclust R package: Previous: The mosclust R package: