Comparison of gene selection results obtained with real and synthetic data

In this section we present more detailed results about the comparison of the percentage of commonly selected genes by each pair of the considered gene selection methods ($ 10$ pairs), in both real and simulated synthetic data. We repeated these experiments $ 5$ times for each data set, for a total of $ 50$ pairwise comparisons of gene selection methods across data sets. The results for Colon and Leukemia data sets are reported in Fig. 3. Fig. 4 shows the results relative to the DLBCL-FL and DLBCL-Outcome data sets.

These figures show that many of the represented points are distributed close to bisector; this fact reveals that the percentage of commonly selected genes with real and synthetic gene expression data is nearly the same, regardless the considered pair of gene selection methods. In ther words, the results achieved by gene selection methods are comparable with real and synthetic data, independently of the applied gene selection method.

Figure 3: Comparison of the percentage of common genes obtained with synthetic (abscissa) and real (ordinate) data for all the pairwise combinations of gene selection methods. Colon: a) 200 and b) 400 top ranked genes. Leukemia: c) 200 and d) 400 top ranked genes.
\includegraphics[width = 7.2cm]{Colon200.eps} \includegraphics[width = 7.2cm]{Colon400.eps}
(a) (b)
\includegraphics[width = 7.2cm]{Leukemia200.eps} \includegraphics[width = 7.2cm]{Leukemia400.eps}
(c) (d)

Figure 4: Comparison of the percentage of common genes obtained with synthetic (abscissa) and real (ordinate) data for all the pairwise combinations of gene selection methods. DLBCL-FL: a) 200 and b) 400 top ranked genes. DLBCL-Outcome: c) 200 and d) 400 top ranked genes.
\includegraphics[width = 7.2cm]{DLBCL-FL200.eps} \includegraphics[width = 7.2cm]{DLBCL-FL400.eps}
(a) (b)
\includegraphics[width = 7.2cm]{DLBCL-Outcome200.eps} \includegraphics[width = 7.2cm]{DLBCL-Outcome400.eps}
(c) (d)

The selection of the most relevant genes is the natural approach to evaluate the usefulness of simulated data sets in assessing gene selection methods performances. Nevertheless, it would be also useful to compare real and synthetic data considering the whole gene relevance distribution calculated by each gene selection method.

To this end, we considered the whole not ordered normalized relevance vector generated by a gene selection method, and we computed on both real data and synthetic data the $ L_1$-distance between the relevance vectors associated to each pair of gene selection methods. We used the $ L_1$ norm induced distance because of its robustness to outliers, whereas several other distance measure, such as the euclidean, are sensitive to them. The value of $ L_1(x, y)$, where x and y are two normalized n-dimensional vectors and $ n$ is the number of genes, is in the $ [0, 2]$ interval. Values close to 0 indicate that the two vectors are very similar; on the contrary, the greater the value, the more different are the two vectors. In Fig. 5 we represent $ L_1$-distance results for all the four considered data sets: each point represents the $ L_1$-distance between the gene relevance vectors generated by a specific pair of gene selection methods with synthetic (abscissa) and real (ordinate) gene expression data. Each of the $ 10$ different pairwise comparisons has been repeated 5 times for each data set, thus resulting in $ 50$ points for each graph.

Points proximity to the bisector for all four data sets reveals that considered gene selection methods work in a similar way on real and synthetic data even in this case, and this is a further element to assess the usefulness of simulated artificial data to evaluate gene selection methods performances.

Figure 5: Comparison of the $ L_1$ norm based measures obtained with synthetic (abscissa) and real (ordinate) data for all the pairwise combinations of gene selection methods. a) Colon b) Leukemia c) DLBCL-FL d) DLBCL-Outcome.
\includegraphics[width = 7.2cm]{ColonL1.eps} \includegraphics[width = 7.2cm]{LeukemiaL1.eps}
(a) (b)
\includegraphics[width = 7.2cm]{DLBCL-FL-L1.eps} \includegraphics[width = 7.2cm]{DLBCL-Outcome-L1.eps}
(c) (d)