Specific Gaussian mixtures are considered to solve simultaneously variable selection and clustering problems. A non asymptotic penalized criterion is proposed to choose the number of mixture components and the relevant variable subset. Because of the non linearity of the associated Kullback-Leibler contrast on Gaussian mixtures, a general model selection theorem for maximum likelihood estimation proposed by [Massart Concentration inequalities and model selection Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6-23 (2003)] is used to obtain the penalty function form. This theorem requires to control the bracketing entropy of Gaussian mixture families. The ordered and non-ordered variable selection cases are both addressed in this paper.
Mots clés : model-based clustering, variable selection, penalized likelihood criterion, bracketing entropy
@article{PS_2011__15__41_0, author = {Maugis, Cathy and Michel, Bertrand}, title = {A non asymptotic penalized criterion for gaussian mixture model selection}, journal = {ESAIM: Probability and Statistics}, pages = {41--68}, publisher = {EDP-Sciences}, volume = {15}, year = {2011}, doi = {10.1051/ps/2009004}, mrnumber = {2870505}, language = {en}, url = {http://www.numdam.org/articles/10.1051/ps/2009004/} }
TY - JOUR AU - Maugis, Cathy AU - Michel, Bertrand TI - A non asymptotic penalized criterion for gaussian mixture model selection JO - ESAIM: Probability and Statistics PY - 2011 SP - 41 EP - 68 VL - 15 PB - EDP-Sciences UR - http://www.numdam.org/articles/10.1051/ps/2009004/ DO - 10.1051/ps/2009004 LA - en ID - PS_2011__15__41_0 ER -
%0 Journal Article %A Maugis, Cathy %A Michel, Bertrand %T A non asymptotic penalized criterion for gaussian mixture model selection %J ESAIM: Probability and Statistics %D 2011 %P 41-68 %V 15 %I EDP-Sciences %U http://www.numdam.org/articles/10.1051/ps/2009004/ %R 10.1051/ps/2009004 %G en %F PS_2011__15__41_0
Maugis, Cathy; Michel, Bertrand. A non asymptotic penalized criterion for gaussian mixture model selection. ESAIM: Probability and Statistics, Tome 15 (2011), pp. 41-68. doi : 10.1051/ps/2009004. http://www.numdam.org/articles/10.1051/ps/2009004/
[1] Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory (Tsahkadsor, 1971), Akadémiai Kiadó, Budapest (1973) 267-281. | MR | Zbl
,[2] Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. (2008) (to appear).
and ,[3] Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (1993) 803-821. | MR | Zbl
and ,[4] Risk bounds for model selection via penalization. Prob. Th. Re. Fields 113 (1999) 301-413. | MR | Zbl
, and ,[5] Clustering through model selection criteria. Poster session at One Day Statistical Workshop in Lisieux. http://www.math.u-psud.fr/ baudry, June (2007).
,[6] Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Analy. Mach. Intell. 22 (2000) 719-725.
, and ,[7] Model-based cluster and discriminant analysis with the mixmod software. Comput. Stat. Data Anal. 51 (2006) 587-600. | MR | Zbl
, , and ,[8] Gaussian model selection. J. Eur. Math. Soc. 3 (2001) 203-268. | MR | Zbl
and ,[9] A generalized Cp criterion for Gaussian model selection. Prépublication n° 647, Universités de Paris 6 et Paris 7 (2001).
and ,[10] Minimal penalties for Gaussian model selection. Prob. Th. Rel. Fields 138 (2007) 33-73. | MR | Zbl
and .[11] From model selection to adaptive estimation, in Festschrift for Lucien Le Cam. Springer, New York (1997) 55-87. | MR | Zbl
and ,[12] High-Dimensional Data Clustering. Comput. Stat. Data Anal. 52 (2007) 502-519. | MR
, and ,[13] Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York, 2nd edition (2002). | MR | Zbl
and ,[14] Modified Akaike's criterion for histogram density estimation. Technical report, Université Paris-Sud 11 (1999).
,[15] Density estimation via exponential model selection. IEEE Trans. Inf. Theory 49 (2003) 2052-2060. | MR | Zbl
,[16] Gaussian parsimonious clustering models. Pattern Recogn. 28 (1995) 781-793.
and ,[17] Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc, Ser. B. 39 (1977) 1-38. | MR | Zbl
, and ,[18] Rates of convergence for the Gaussian mixture sieve. Ann. Stat. 28 (2000) 1105-1127. | MR | Zbl
and ,[19] Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Stat. 29 (2001) 1233-1263. | MR | Zbl
and ,[20] Consistent estimation of the order of mixture models. Sankhyā. The Indian Journal of Statistics. Series A 62 (2000) 49-66. | MR | Zbl
,[21] Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1154-1166.
, and ,[22] Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Proc. 85 (2005) 717-736. | Zbl
,[23] Potentiel de réserves d'un bassin pétrolier: modélisation et estimation. Ph.D. thesis, Université Paris-Sud 11 (2002).
,[24] Concentration inequalities and model selection. Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6-23 (2003). | MR | Zbl
,[25] Sélection de variables pour la classification non supervisée par mélanges gaussiens. Applications à l'étude de données transcriptomes. Ph.D. thesis, University Paris-Sud 11 (2008).
,[26] Variable Selection for Clustering with Gaussian Mixture Models. Biometrics (2008) (to appear). | MR | Zbl
, and ,[27] Slope heuristics for variable selection and clustering via Gaussian mixtures. Technical Report 6550, INRIA (2008).
and ,[28] Variable Selection for Model-Based Clustering. J. Am. Stat. Assoc. 101 (2006) 168-178. | MR | Zbl
and ,[29] Estimating the dimension of a model. Ann. Stat. 6 (1978) 461-464. | MR | Zbl
,[30] Matrices. Springer-Verlag, New York (2002). | MR | Zbl
,[31] Concentration of measure and isoperimetric inequalities in product spaces. Publ. Math., Inst. Hautes Étud. Sci. 81 (1995) 73-205. | Numdam | MR | Zbl
,[32] New concentration inequalities in product spaces. Invent. Math. 126 (1996) 505-563. | MR | Zbl
,[33] Tests et sélection de modèles pour l'analyse de données protéomiques et transcriptomiques. Ph.D. thesis, University Paris-Sud 11 (2007).
,Cité par Sources :