In the companion paper [C. Maugis and B. Michel, A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41-68] , a penalized likelihood criterion is proposed to select a Gaussian mixture model among a specific model collection. This criterion depends on unknown constants which have to be calibrated in practical situations. A “slope heuristics” method is described and experimented to deal with this practical problem. In a model-based clustering context, the specific form of the considered Gaussian mixtures allows us to detect the noisy variables in order to improve the data clustering and its interpretation. The behavior of our data-driven criterion is highlighted on simulated datasets, a curve clustering example and a genomics application.
Mots clés : slope heuristics, penalized likelihood criterion, model-based clustering, noisy variable detection
@article{PS_2011__15__320_0, author = {Maugis, Cathy and Michel, Bertrand}, title = {Data-driven penalty calibration: {A} case study for gaussian mixture model selection}, journal = {ESAIM: Probability and Statistics}, pages = {320--339}, publisher = {EDP-Sciences}, volume = {15}, year = {2011}, doi = {10.1051/ps/2010002}, mrnumber = {2870518}, language = {en}, url = {http://www.numdam.org/articles/10.1051/ps/2010002/} }
TY - JOUR AU - Maugis, Cathy AU - Michel, Bertrand TI - Data-driven penalty calibration: A case study for gaussian mixture model selection JO - ESAIM: Probability and Statistics PY - 2011 SP - 320 EP - 339 VL - 15 PB - EDP-Sciences UR - http://www.numdam.org/articles/10.1051/ps/2010002/ DO - 10.1051/ps/2010002 LA - en ID - PS_2011__15__320_0 ER -
%0 Journal Article %A Maugis, Cathy %A Michel, Bertrand %T Data-driven penalty calibration: A case study for gaussian mixture model selection %J ESAIM: Probability and Statistics %D 2011 %P 320-339 %V 15 %I EDP-Sciences %U http://www.numdam.org/articles/10.1051/ps/2010002/ %R 10.1051/ps/2010002 %G en %F PS_2011__15__320_0
Maugis, Cathy; Michel, Bertrand. Data-driven penalty calibration: A case study for gaussian mixture model selection. ESAIM: Probability and Statistics, Tome 15 (2011), pp. 320-339. doi : 10.1051/ps/2010002. http://www.numdam.org/articles/10.1051/ps/2010002/
[1] Unsupervised curve clustering using B-splines. Scand. J. Stat. Th. Appl. 30 (2003) 581-595. | MR | Zbl
, , and .[2] Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory (Tsahkadsor, 1971). Akadémiai Kiadó, Budapest (1973) 267-281. | MR | Zbl
,[3] A new look at the statistical model identification. IEEE Trans. Automatic Control AC-19 (1974) 716-723. System identification and time-series analysis | MR | Zbl
,[4] Réechantillonnage et sélection de modèles, Ph.D. thesis, Université Paris-Sud XI (2007).
,[5] Slope heuristics for heteroscedastic regression on a random design. Submitted to the Annals of Statistics (2008).
and ,[6] Oil and gas exploration and production, reserves, costs, contracts. Technip, Paris (2007).
, and ,[7] Model-based gaussian and non-gaussian clustering. Biometrics 49 (1993) 803-821. | MR | Zbl
and ,[8] Risk bounds for model selection via penalization. Prob. Th. Rel. Fields 113 (1999) 301-413. | MR | Zbl
, and ,[9] Clustering through model selection criteria. Poster session at One Day Statistical Workshop in Lisieux. http://www.math.u-psud.fr/ baudry, June (2007).
,[10] Functional classification with wavelets, Technical report To appear (2008), in Annales de l'ISUP. | MR
, and ,[11] Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 719-725.
, and ,[12] Model-based cluster and discriminant analysis with the MIXMOD software. Comp. Stat. Data Anal. 51 (2006) 587-600. | MR | Zbl
, , and ,[13] Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 (2001) 203-268. | MR | Zbl
and ,[14] Minimal penalties for Gaussian model selection. Prob. Th. Rel. Fields 138 (2006) 33-73. | MR | Zbl
and ,[15] Uci repository of machine learning databases (1999). http://mlearn.ics.uci.edu/MLSummary.html.
and ,[16] Classification and regression trees. Wadsworth Statistics/Probability Series. Wadsworth Advanced Books and Software, Belmont, CA (1984). | MR | Zbl
, , and ,[17] Gaussian parsimonious clustering models. Patt. Recog. 28 (1995) 781-793.
and ,[18] Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Methodol. 39 (1977) 1-38, With discussion. | MR | Zbl
, and ,[19] CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform. Nucleic Acids Res. 36 (2008) 986-990.
, , , , , , , , and ,[20] A proposal for robust curve clustering. J. Class. 22 (2005) 185-201.
and ,[21] Robust Statistics. Wiley (1981). | MR | Zbl
,[22] Clustering for sparsely sampled functional data. J. Am. Stat. Assoc. 98 (2003) 397-408. | MR | Zbl
and ,[23] Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng. 16 (2004) 1370-1386.
, and ,[24] Consistent estimation of the order of mixture models. Sankhyā Ser. A 62 (2000) 49-66. | MR | Zbl
,[25] Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Proc. 85 (2005) 717-736. | Zbl
,[26] Potentiel de réserves d'un bassin pétrolier: modélisation et estimation, Ph.D. thesis, Université Paris Sud (2002).
,[27] Ret, M.-L. Martin-Magniette, H. Mireau, N. Peeters, J.-P. Renou, B. Szurek, L. Taconnat and I. Small, Genome-wide analysis of arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. Plant Cell 16 (2004) 2089-103.
, , , , , , , , , , ,[28] A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 (2006) 1261-1269.
, , and ,[29] Some comments on Cp. Technometrics 37 (1973) 362-372. | MR | Zbl
,[30] Concentration inequalities and model selection, Lecture Notes in Mathematics Vol. 1896. Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6-23 (2003). | MR | Zbl
,[31] Variable selection for clustering with Gaussian mixture models. Biometrics 65 (2009) 701-709. | MR | Zbl
, and ,[32] Variable selection in model-based clustering: A general variable role modeling. Comput. Stat. Data Anal. 53 (2009) 3872-3882. | MR
, and ,[33] A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41-68. | Numdam | MR
and ,[34] Modélisation de la production d'hydrocarbures dans un bassin pétrolier, Ph.D. thesis, Université Paris-Sud 11 (2008).
,[35] Wavelet methods for time series analysis. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge university press, New York (2000). | MR | Zbl
and ,[36] Variable selection for model-based clustering. J. Am. Stat. Assoc. 101 (2006) 168-178. | MR | Zbl
and ,[37] Estimating the dimension of a model. Ann. Stat. 6 (1978) 461-464. | MR | Zbl
,[38] Cluster analysis and its applications to gene expression data. In Ernst Schering Workshop on Bioinformatics and Genome Analysis. Springer Verlag (2002).
, and ,[39] Clustering functional data. J. Class. 20 (2003) 93-114. | MR | Zbl
and ,[40] Tests et sélection de modèles pour l'analyse de données protéomiques et transcriptomiques, Ph.D. thesis, Université Paris-Sud 11 (2007).
,Cité par Sources :