Introduction

Gene selection is a relevant problem in bioinformatics [10,11,9].

In particular, the evaluation of the performance and the validation of gene selection methods using real gene expression data is difficult and in many cases unfeasible. In fact, from a biological point of view, the subsets of modulated genes associated to a specific phenotype are usually unknown and, from a machine learning standpoint, gene selection, as an instance of the feature selection problem, is a NP-hard problem [4].

We propose a model and a procedure for the generation of biologically plausible artificial data by supposing that, if a gene selection method achieves good results on these data, it reasonably works well with real data too.

Furthermore, we suggest a procedure for evaluating and comparing the performances of gene selection methods, exploiting the fact that the structure of the data generated with the proposed model is completely known.

In this supplementary information we discuss in more detail some mathematical aspects of the model and present examples and experimental results obtained with data generated through the model.

In Irredundant PDNF for positive boolean functions we give the mathematical definition of irredundant Positive Disjunctive Normal Form (PDNF) for positive Boolean functions that constitute the mathematical basis of the model.

In the section Relevance and irredundance we show, through an example, the relationship between the irredundant PDNF and the relevance of a variable representing a gene.

In Compactness and stability of EP form we discuss the compactness and the stability of EP (Expression Profile), that is the mathematical representation of biological expression profiles [1].

In Procedure for the generation of artificial gene expression data the input parameters of the algorithmic procedure, derived from our proposed mathematical model, is described in detail.

Then a procedure to compare the effectiveness of gene selection methods is described and a simple example is provided.

Experimental results shows the results obtained from the application of SVM-RFE [8] and Golub's method [7] to the colon-cancer-like artificial datasets obtained with the proposed model; in this section the parameters used to simulate four real gene expression data sets through our proposed model are provided, as well as the parameters of the Gaussian distributions used to generate "raw" gene expression data. Moreover the comparison of the detailed results obtained with real and the corresponding artificial data using five different gene selection methods is also given.