Comparison of gene selection results obtained with real and synthetic data

These figures show that many of the represented points are distributed close to bisector; this fact reveals that the percentage of commonly selected genes with real and synthetic gene expression data is nearly the same, regardless the considered pair of gene selection methods. In ther words, the results achieved by gene selection methods are comparable with real and synthetic data, independently of the applied gene selection method.

Figure 3: Comparison of the percentage of common genes obtained with synthetic (abscissa) and real (ordinate) data for all the pairwise combinations of gene selection methods. Colon: a) 200 and b) 400 top ranked genes. Leukemia: c) 200 and d) 400 top ranked genes.

$\includegraphics[width = 7.2cm]{Colon200.eps}$	$\includegraphics[width = 7.2cm]{Colon400.eps}$
(a)	(b)
$\includegraphics[width = 7.2cm]{Leukemia200.eps}$	$\includegraphics[width = 7.2cm]{Leukemia400.eps}$
(c)	(d)

Figure 4: Comparison of the percentage of common genes obtained with synthetic (abscissa) and real (ordinate) data for all the pairwise combinations of gene selection methods. DLBCL-FL: a) 200 and b) 400 top ranked genes. DLBCL-Outcome: c) 200 and d) 400 top ranked genes.

$\includegraphics[width = 7.2cm]{DLBCL-FL200.eps}$	$\includegraphics[width = 7.2cm]{DLBCL-FL400.eps}$
(a)	(b)
$\includegraphics[width = 7.2cm]{DLBCL-Outcome200.eps}$	$\includegraphics[width = 7.2cm]{DLBCL-Outcome400.eps}$
(c)	(d)

The selection of the most relevant genes is the natural approach to evaluate the usefulness of simulated data sets in assessing gene selection methods performances. Nevertheless, it would be also useful to compare real and synthetic data considering the whole gene relevance distribution calculated by each gene selection method.

To this end, we considered the whole not ordered normalized relevance vector generated by a gene selection method, and we computed on both real data and synthetic data the

-distance between the relevance vectors associated to each pair of gene selection methods. We used the

norm induced distance because of its robustness to outliers, whereas several other distance measure, such as the euclidean, are sensitive to them. The value of

, where x and y are two normalized n-dimensional vectors and

is the number of genes, is in the

interval. Values close to 0 indicate that the two vectors are very similar; on the contrary, the greater the value, the more different are the two vectors. In Fig. 5 we represent

-distance results for all the four considered data sets: each point represents the

-distance between the gene relevance vectors generated by a specific pair of gene selection methods with synthetic (abscissa) and real (ordinate) gene expression data. Each of the

different pairwise comparisons has been repeated 5 times for each data set, thus resulting in

points for each graph.

Points proximity to the bisector for all four data sets reveals that considered gene selection methods work in a similar way on real and synthetic data even in this case, and this is a further element to assess the usefulness of simulated artificial data to evaluate gene selection methods performances.

Figure 5: Comparison of the

norm based measures obtained with synthetic (abscissa) and real (ordinate) data for all the pairwise combinations of gene selection methods. a) Colon b) Leukemia c) DLBCL-FL d) DLBCL-Outcome.

$\includegraphics[width = 7.2cm]{ColonL1.eps}$	$\includegraphics[width = 7.2cm]{LeukemiaL1.eps}$
(a)	(b)
$\includegraphics[width = 7.2cm]{DLBCL-FL-L1.eps}$	$\includegraphics[width = 7.2cm]{DLBCL-Outcome-L1.eps}$
(c)	(d)