In this paper we propose a new method to measure the contribution of discretized features for supervised learning and discuss its applications to biological data analysis. We restrict the description and the experiments to the most representative case of discretization in two intervals and of samples belonging to two classes. In order to test the validity of the method, we measured the abundance of different explanatory models that can be derived from a given set of binary features. We compare the performances of our algorithm with those of popular feature selection methods, over three different publicly available gene expression data sets. The results of the comparison are in favour of the proposed method.
Accepté le :
DOI : 10.1051/ro/2015045
Mots clés : Features selection, discretization, data mining
@article{RO_2016__50_2_437_0, author = {Santoni, Daniele and Weitschek, Emanuel and Felici, Giovanni}, title = {Optimal discretization and selection of features by association rates of joint distributions}, journal = {RAIRO - Operations Research - Recherche Op\'erationnelle}, pages = {437--449}, publisher = {EDP-Sciences}, volume = {50}, number = {2}, year = {2016}, doi = {10.1051/ro/2015045}, zbl = {1341.62188}, mrnumber = {3479881}, language = {en}, url = {http://www.numdam.org/articles/10.1051/ro/2015045/} }
TY - JOUR AU - Santoni, Daniele AU - Weitschek, Emanuel AU - Felici, Giovanni TI - Optimal discretization and selection of features by association rates of joint distributions JO - RAIRO - Operations Research - Recherche Opérationnelle PY - 2016 SP - 437 EP - 449 VL - 50 IS - 2 PB - EDP-Sciences UR - http://www.numdam.org/articles/10.1051/ro/2015045/ DO - 10.1051/ro/2015045 LA - en ID - RO_2016__50_2_437_0 ER -
%0 Journal Article %A Santoni, Daniele %A Weitschek, Emanuel %A Felici, Giovanni %T Optimal discretization and selection of features by association rates of joint distributions %J RAIRO - Operations Research - Recherche Opérationnelle %D 2016 %P 437-449 %V 50 %N 2 %I EDP-Sciences %U http://www.numdam.org/articles/10.1051/ro/2015045/ %R 10.1051/ro/2015045 %G en %F RO_2016__50_2_437_0
Santoni, Daniele; Weitschek, Emanuel; Felici, Giovanni. Optimal discretization and selection of features by association rates of joint distributions. RAIRO - Operations Research - Recherche Opérationnelle, Tome 50 (2016) no. 2, pp. 437-449. doi : 10.1051/ro/2015045. http://www.numdam.org/articles/10.1051/ro/2015045/
Affymetrix technologies. www.affymetrix.com.
Agilent technologies. www.genomics.agilent.com.
Affymetrix, Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA, Version 5 edn. (2001).
Gene expression biomarkers in the brain of a mouse model for alzheimer’s disease: mining of microarray data by logic classification and feature selection. J. Alzheimer’s Disease 24 (2011) 721–738. | DOI
et al.,Tissue classification with gene expression profiles. J. Comput. Biol. 7 (2000) 559–583. | DOI
, , , , and ,Logic classification and feature selection for biomedical data. Comput. Math. Appl. 55 (2008) 889–899. | DOI | MR | Zbl
, , and ,Learning to classify species with barcodes. BMC Bioinform. 10 (2009) 1–12. | DOI
, and ,P. Bertolazzi, G. Felici and G. Lancia, Application of Feature Selection and Classification to Computational Molecular Biology. In Biological Data Mining, edited by S. Lonardi and J.K. Chen. Chapman & Hall (2010) 257–294. | MR
Integer programming models for feature selection: new extensions and a randomized solution algorithm. Eur. J. Oper. Res. 250 (2015) 389–399. | DOI | MR | Zbl
, , , and ,Logical analysis of binary data with missing bits. Artif. Intell. 107 (1999) 219–263. | DOI | MR | Zbl
, and ,L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA (1984). | MR | Zbl
Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97 (2000) 262–267. | DOI
, , , , , , and ,Fold change and -value cutoffs significantly alter microarray interpretations. BMC Bioinform. 13 (2012) 1471–2105. | DOI
, , and ,Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinform. 16 (2000) 906–914. | DOI
, , , , and ,M.R. Garey and D.S Johnson, Computers and Intractability : A Guide to the Theory of NP-Completeness. Series Books Math. Sci. Edited by W.H. Freeman (1979). | MR | Zbl
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (1999) 531–537. | DOI
et al.,An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (2003) 1157–1182. | Zbl
and ,Gene selection for cancer classification using support vector machines. Machine Lear. 46 (2002) 389–422. | DOI | Zbl
, , and ,H. Hu, J. Li, A.W. Plank, H. Wang and G. Daggard, A comparative study of classification methods for microarray data analysis. In AusDM (2006) 33–37.
T. Jirapech-Umpai and S Aitken, Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinform. 148 (2005).
I. Kononenko, Estimating attributes: analysis and extensions of relief. In Machine Learning: ECML-94. Springer (1994) 171–182.
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinform. 20 (2004) 2429–2437. | DOI
, and ,H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers (2000). | Zbl
Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415 (2002) 436–442. | DOI
et al.,Microarray data normalization and transformation. Nature Genet. 32 (2002) 496–501. | DOI
,D. Santoni and E. Pourabbas, Automatic detection of words associations in texts based on joint distribution of words occurrences. To appear in Comput. Intell. (2015) . | DOI | MR
Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science 270 (1995) 467–470. | DOI
, , and ,M. Tom, Machine Learning. The Mc-Graw-Hill Companies (1997).
I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005). | Zbl
Kernel-based distance metric learning for microarray data classification. BMC Bioinform. 7 (2006) 299. | DOI
and ,Cité par Sources :