Numéro spécial : analyse de mélanges
Bootstrap Validation of the Estimated Parameters in Mixture Models Used for Clustering
[Validation par bootstrap de l’estimation des paramètres d’un modèle de mélange utilisé en classification]
Journal de la société française de statistique, Tome 160 (2019) no. 1, pp. 114-129.

Lorsqu’un modèle de mélange est utilisé en classification, l’incertitude est liée au choix du modèle optimal (y compris le nombre de groupes) et à l’estimation de ses paramètres. Nous discutons ici du calcul d’intervalles de confiance en utilisant différentes approches bootstrap qui mélangent ou au contraire séparent ces deux types d’incertitude. En particulier, nous suggérons deux nouvelles approches qui dépendent en partie de la spécification du modèle considéré comme optimal par le chercheur, et qui répondent spécifiquement à l’incertitude liée à l’estimation des paramètres. Ces méthodes sont spécialement utiles lorsque les données sont mal séparées ou lorsque le modèle à estimer est complexe et que la solution choisie se révèle difficile à reproduire dans chaque échantillon bootstrap. De plus, elles présentent l’avantage de réduire le problème du label-switching. Deux simulations basées sur le modèle Hidden Mixture Transition Distribution adapté à la classification de données longitudinales illustrent nos propositions.

When a mixture model is used to perform clustering, the uncertainty is related both to the choice of an optimal model (including the number of clusters) and to the estimation of the parameters. We discuss here the computation of confidence intervals using different bootstrap approaches, which either mix or separate the two kinds of uncertainty. In particular, we suggest two new approaches that rely to some degree on the model specification considered as optimal by the researcher, and that address specifically the uncertainty related to parameter estimation. These methods are especially useful for poorly separated data or complex models, where the selected solution is difficult to recreate in each bootstrap sample, and they present the advantage of reducing the well-known label-switching issue. Two simulation experiments based on the Hidden Mixture Transition Distribution model for the clustering of longitudinal data illustrate our proposed bootstrap approaches.

Keywords: clustering, mixture model, bootstrap, uncertainty, label-switching, confidence interval, frequentist estimation, HMTD model
Mot clés : classification, modèle de mélange, bootstrap, incertitude, label-switching, intervalle de confiance, estimation fréquentiste, modèle HMTD
@article{JSFS_2019__160_1_114_0,
     author = {Taushanov, Zhivko and Berchtold, Andr\'e},
     title = {Bootstrap {Validation} of the {Estimated} {Parameters} in {Mixture} {Models} {Used} for {Clustering}},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {114--129},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {160},
     number = {1},
     year = {2019},
     mrnumber = {3928542},
     zbl = {1432.62191},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2019__160_1_114_0/}
}
TY  - JOUR
AU  - Taushanov, Zhivko
AU  - Berchtold, André
TI  - Bootstrap Validation of the Estimated Parameters in Mixture Models Used for Clustering
JO  - Journal de la société française de statistique
PY  - 2019
SP  - 114
EP  - 129
VL  - 160
IS  - 1
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2019__160_1_114_0/
LA  - en
ID  - JSFS_2019__160_1_114_0
ER  - 
%0 Journal Article
%A Taushanov, Zhivko
%A Berchtold, André
%T Bootstrap Validation of the Estimated Parameters in Mixture Models Used for Clustering
%J Journal de la société française de statistique
%D 2019
%P 114-129
%V 160
%N 1
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2019__160_1_114_0/
%G en
%F JSFS_2019__160_1_114_0
Taushanov, Zhivko; Berchtold, André. Bootstrap Validation of the Estimated Parameters in Mixture Models Used for Clustering. Journal de la société française de statistique, Tome 160 (2019) no. 1, pp. 114-129. http://www.numdam.org/item/JSFS_2019__160_1_114_0/

[Berchtold(2003)] Berchtold, A. (2003) Mixture transition distribution (MTD) modelling of heteroscedastic time series. Computational statistics and data analysis 41(3): 399-411. | MR | Zbl

[Berchtold & Raftery(2002)] Berchtold, A., & Raftery, A. (2002) The mixture transition distribution model for high-order Markov chains and non-Gaussian time series. Statistical Science 17(3): 328-356. | MR | Zbl

[Berchtold, Suris, Meyer & Taushanov (2018)] Berchtold , A., Suris, J. C., Meyer, T., & Taushanov, Z. (2018). Development of Somatic Complaints Among Adolescents and Young Adults in Switzerland. Swiss Journal of Sociology, 44(2): 239-257.

[Bolano & Berchtold(2016)] Bolano, D., & Berchtold, A. (2016) General framework and model building in the class of Hidden Mixture Transition Distribution models. Computational Statistics & Data Analysis 93: 131-145. | MR

[Celeux & Govaert(1995)] Celeux G., & Govaert G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28: 781-793.

[Celeux, Hurn & Robert(2000)] Celeux, G., Hurn, M., & Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451): 957-970. | MR | Zbl

[Efron(1979)] Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 7(1): 1-26. | MR | Zbl

[Grün & Leisch(2004)] Grün, B., & Leisch, F. (2004). Bootstrapping finite mixture models. Proceedings of the COMPSTAT 2004 Symposium. | MR

[Jasra, Holmes & Stephens(2005)] Jasra A., Holmes C., & Stephens D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20(1): 50-67. | MR | Zbl

[Marin, Mengersen & Robert(2005)] Marin, J. M., Mengersen, K., & Robert, C. P. (2005). Bayesian modelling and inference on mixtures of distributions. Handbook of statistics, Volume 25: 459-507. | MR

[Meila(2016)] Meila, M. (2016) Criteria for Comparing Clusterings, In Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (Eds.) Handbook of cluster analysis (Chapter 27). CRC Press. | MR

[O’Hagan, Murphy, Scrucca & Gormley(2018)] O’Hagan,A., Murphy, T. B., Scrucca, L., & Gormley, I. C. (2018). Investigation of Parameter Uncertainty in Clustering Using a Gaussian Mixture Model Via Jackknife, Bootstrap and Weighted Likelihood Bootstrap. Available online at https://arxiv.org/abs/1510.00551 | MR

[Raftery(1985)] Raftery, A. (1985). A model for high-order Markov chains. Journal of the Royal Statistical Society, series B, 47(3): 528-539. | MR | Zbl

[Rodriguez & Walker(2014)] Rodriguez, C. E., & Walker, S. G. (2014). Label switching in Bayesian mixture models: Deterministic relabeling strategies. Journal of Computational and Graphical Statistics, 23(1): 25-45. | MR

[Rosychuk, Sheng & Stuber(2006)] Rosychuk, R. J., Sheng, X., & Stuber, J. L. (2006). Comparison of variance estimation approaches in a two-state Markov model for longitudinal data with misclassification. Statistics in medicine, 25(11): 1906-1921. | MR

[Rydén(2008)] Rydén, T. (2008) EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective. Bayesian Analysis, 3(4): 659-688. | MR

[Scrucca, Fop, Murphy & Raftery (2016)] Scrucca L., Fop M., Murphy T. B. & Raftery, A. E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 205-233.

[Sperrin, Jaki & Wit(2010)] Sperrin, M., Jaki, T., & Wit, E. (2010). Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Statistics and Computing, 20(3): 357-366. | MR

[Stephens(2000)] Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society, series B, 62(4): 795-809. | MR | Zbl

[Taushanov and Berchtold(2017a)] Taushanov, Z., & Berchtold, A. (2017a). A Direct Local Search Method and its Application to a Markovian Model. Statistics, Optimization & Information Computing, 5(1): 19-34. | MR

[Taushanov and Berchtold(2017b)] Taushanov, Z., & Berchtold, A. (2017b) Markovian-based Clustering of Internet Addiction Trajectories. In G. Ritschard & M. Studer (eds), Sequence Analysis and Related Approaches: Innovative Methods and Applications. Berlin: Springer.

[Visser, Raijmakers & Molenaar(2000)] Visser, I., Raijmakers, M. E., & Molenaar, P. (2000). Confidence intervals for hidden Markov model parameters. British journal of mathematical and statistical psychology, 53(2): 317-327.