A procedure to compare the effectiveness of gene selection methods

We propose here a simple procedure for evaluating and comparing feature selection methods, although other different approaches can be applied by using the synthetic data obtained from our mathematical model.

Suppose we want to compare and evaluate

feature selection methods

, $i=1,\dots,H$ , on a single artificial dataset

created by using the proposed model.

Denote with

the set containing all the genes included in two distinct expression profiles. If

is the total number of genes, we have $\vert R\vert = k \leq r$ . The result produced by each feature selection method

consists of a list $L_{ik}$ , with the

more relevant genes identified by the method

. We can evaluate the performance of the method

as the fraction

of overlapping between the set of the true relevant genes

and the set $L_{ik}$ of the genes identified by the method

$\displaystyle p_i=\frac{\vert R \cap L_{ik} \vert}{k}$ , $\displaystyle i=1,\dots,H$

(8)

We can compare the performances of the

methods by ordering the

values, for $i=1,\dots,H$ . In particular, the higher is the value

, the more effective is the corresponding method

Since this comparison, based on a single data set, may lead to unreliable conclusions, the procedure can be repeated for a fixed number

of different instances

, with $j=1,\dots,T$ , obtained from the same mathematical model. Thus, for each method

, we obtain a vector $\boldsymbol{p}_i$ whose

-th component $p_{ij}$ , $j=1,\dots,T$ , is the overlapping between the relevant genes of the

-th dataset

, and the set of the first $k_j = \vert R_j\vert$ genes obtained by applying the method

For the sake of simplicity the procedure has been schematized in Tab. 3, where each row represents a different feature selection method and each column corresponds to a different dataset. The comparison between two methods

and

is simply given by counting how many components of the

-th row vector are higher than the corresponding components of the

-th one, or by calculating the mean and the standard deviation of the corresponding row vector.

**Table 3:** Table describing the procedure for evaluating feature selection methods through training sets.
	Data set
		$\dots$	$\dots$
		$p_{11}$		$p_{1T}$
Feature	$\dots$
selection			$p_{ij}=\frac{R_j \cap L_{ij}}{k_j}$
method	$\dots$
		$p_{H1}$		$p_{HT}$