We propose here a simple procedure for evaluating and comparing feature selection methods, although other different approaches can be applied by using the synthetic data obtained from our mathematical model.
Suppose we want to compare and evaluate feature selection methods , , on a single artificial dataset created by using the proposed model.
Denote with the set containing all the genes included in two distinct expression profiles. If is the total number of genes, we have . The result produced by each feature selection method consists of a list , with the more relevant genes identified by the method . We can evaluate the performance of the method as the fraction of overlapping between the set of the true relevant genes and the set of the genes identified by the method .
We can compare the performances of the methods by ordering the values, for . In particular, the higher is the value , the more effective is the corresponding method .
Since this comparison, based on a single data set, may lead to unreliable conclusions, the procedure can be repeated for a fixed number of different instances , with , obtained from the same mathematical model. Thus, for each method , we obtain a vector whose -th component , , is the overlapping between the relevant genes of the -th dataset , and the set of the first genes obtained by applying the method on .
For the sake of simplicity the procedure has been schematized in Tab. 3, where each row represents a different feature selection method and each column corresponds to a different dataset. The comparison between two methods and is simply given by counting how many components of the -th row vector are higher than the corresponding components of the -th one, or by calculating the mean and the standard deviation of the corresponding row vector.
Data set | ||||||
Feature | ||||||
selection | ||||||
method | ||||||