A procedure to compare the effectiveness of gene selection methods

We propose here a simple procedure for evaluating and comparing feature selection methods, although other different approaches can be applied by using the synthetic data obtained from our mathematical model.

Suppose we want to compare and evaluate $ H$ feature selection methods $ F_i$, $ i=1,\dots,H$, on a single artificial dataset $ A$ created by using the proposed model.

Denote with $ R$ the set containing all the genes included in two distinct expression profiles. If $ r$ is the total number of genes, we have $ \vert R\vert = k \leq r$. The result produced by each feature selection method $ F_i$ consists of a list $ L_{ik}$, with the $ k$ more relevant genes identified by the method $ F_i$. We can evaluate the performance of the method $ F_i$ as the fraction $ p_i$ of overlapping between the set of the true relevant genes $ R$ and the set $ L_{ik}$ of the genes identified by the method $ F_i$.

$\displaystyle p_i=\frac{\vert R \cap L_{ik} \vert}{k}$    , $\displaystyle i=1,\dots,H$ (8)

We can compare the performances of the $ H$ methods by ordering the $ p_i$ values, for $ i=1,\dots,H$. In particular, the higher is the value $ p_i$, the more effective is the corresponding method $ F_i$.

Since this comparison, based on a single data set, may lead to unreliable conclusions, the procedure can be repeated for a fixed number $ T$ of different instances $ A_j$, with $ j=1,\dots,T$, obtained from the same mathematical model. Thus, for each method $ F_i$, we obtain a vector $ \boldsymbol{p}_i$ whose $ j$-th component $ p_{ij}$, $ j=1,\dots,T$, is the overlapping between the relevant genes of the $ j$-th dataset $ A_j$, and the set of the first $ k_j = \vert R_j\vert$ genes obtained by applying the method $ F_i$ on $ A_j$.

For the sake of simplicity the procedure has been schematized in Tab. 3, where each row represents a different feature selection method and each column corresponds to a different dataset. The comparison between two methods $ F_i$ and $ F_k$ is simply given by counting how many components of the $ i$-th row vector are higher than the corresponding components of the $ k$-th one, or by calculating the mean and the standard deviation of the corresponding row vector.


Table 3: Table describing the procedure for evaluating $ H$ feature selection methods through $ T$ training sets.
Data set
$ A_1$ $ \dots$ $ A_j$ $ \dots$ $ A_T$
  $ F_1$ $ p_{11}$       $ p_{1T}$
Feature $ \dots$          
selection $ F_i$     $ p_{ij}=\frac{R_j \cap L_{ij}}{k_j}$    
method $ \dots$          
  $ F_H$ $ p_{H1}$       $ p_{HT}$