La classification croisée vise à identifier une structure sous-jacente existant entre les lignes et colonnes d’un tableau de données. Cette revue bibliographique présente les différents points de vue abordés depuis cinquante ans pour définir cette structure et propose pour chacun un éventail non exhaustif des algorithmes et applications associés. Enfin, les questions encore ouvertes sont abordées et une méthodologie est proposée dans la partie discussion pour analyser des données réelles.
Co-clustering aims to identify block patterns in a data table, from a joint clustering of rows and columns. This problem has been studied since 1965, with recent interests in various fields, ranging from graph analysis, machine learning, data mining and genomics. Several variants have been proposed with diverse names: bi-clustering, block clustering, cross-clustering, or simultaneous clustering. We propose here a review of these methods in order to describe, compare and discuss the different possibilities to realize a co-clustering following the user aim.
Keywords: Cross classification, co-clustering, block clustering, biclustering, selection criterion
@article{JSFS_2015__156_3_27_0, author = {Brault, Vincent and Lomet, Aurore}, title = {Revue des m\'ethodes pour la classification jointe des lignes et des colonnes d{\textquoteright}un tableau}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {27--51}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {156}, number = {3}, year = {2015}, zbl = {1335.62092}, language = {fr}, url = {http://www.numdam.org/item/JSFS_2015__156_3_27_0/} }
TY - JOUR AU - Brault, Vincent AU - Lomet, Aurore TI - Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau JO - Journal de la société française de statistique PY - 2015 SP - 27 EP - 51 VL - 156 IS - 3 PB - Société française de statistique UR - http://www.numdam.org/item/JSFS_2015__156_3_27_0/ LA - fr ID - JSFS_2015__156_3_27_0 ER -
%0 Journal Article %A Brault, Vincent %A Lomet, Aurore %T Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau %J Journal de la société française de statistique %D 2015 %P 27-51 %V 156 %N 3 %I Société française de statistique %U http://www.numdam.org/item/JSFS_2015__156_3_27_0/ %G fr %F JSFS_2015__156_3_27_0
Brault, Vincent; Lomet, Aurore. Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau. Journal de la société française de statistique, Tome 156 (2015) no. 3, pp. 27-51. http://www.numdam.org/item/JSFS_2015__156_3_27_0/
[1] A comparison of segment retention criteria for finite mixture logit models, Journal of Marketing Research (2003), pp. 235-243
[2] Multimodel inference understanding AIC and BIC in model selection, Sociological methods & research, Volume 33 (2004) no. 2, pp. 261-304
[3] Assessing a mixture model for clustering with the integrated completed likelihood, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Volume 22 (2000) no. 7, pp. 719-725
[4] Exact and Monte Carlo calculations of integrated likelihoods for the latent class model, Journal of Statistical Planning and Inference, Volume 140 (2010) no. 11, pp. 2991-3002 | Zbl
[5] Discovering local structure in gene expression data : the order-preserving submatrix problem, Journal of computational biology, Volume 10 (2003) no. 3-4, pp. 373-384
[6] A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation, Journal of Machine Learning Research, Volume 8 (2007), pp. 1919-1986 | Zbl
[7] A survey of clustering data mining techniques, Springer, 2006, pp. 25-71
[8] Measuring the power of hierarchical cluster analysis, Journal of the American Statistical Association, Volume 70 (1975) no. 349, pp. 31-38 | Zbl
[9] The netflix prize, Proceedings of KDD cup and workshop, Volume 2007 (2007), 35 pages
[10] Simultaneous clustering of objects and variables, Analyse des données et Informatique (1979), pp. 187-203 | Zbl
[11] Model-based Gaussian and non-Gaussian clustering, Biometrics (1993), pp. 803-821 | Zbl
[12] Biclustering of expression data., Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB) (2000), 93 pages
[13] Structuring interaction in two-way tables by clustering, Biometrics (1990), pp. 207-215 | Zbl
[14] A new method for cross-classification analysis of contingency data tables, Compstat 98-Proceedings in Computational Statistics, Physica-Verlag, Heidelberg (1998), pp. 209-214 | Zbl
[15] A classification EM algorithm for clustering and two stochastic versions, Computational Statistics and Data Analysis, Volume 14 (1992) no. 3, pp. 315-332 | Zbl
[16] A dendrite method for cluster analysis, Communications in Statistics-theory and Methods, Volume 3 (1974) no. 1, pp. 1-27 | Zbl
[17] Détermination du nombre de classes dans les méthodes de bipartitionnement, 17ème Rencontres de la Société Francophone de Classification, Saint-Denis de la Réunion (2010), pp. 119-122
[18] Une histoire de discrétisation, La Revue de Modulad, Volume 11 (1993), pp. 7-44
[19] A cluster separation measure, Pattern Analysis and Machine Intelligence, IEEE Transactions on (1979) no. 2, pp. 224-227
[20] Simultaneous co-clustering and modeling of market data, Proceedings of the Workshop for Data Mining in Marketing (DMM 2007)(Leipzig, Germany). IEEE Computer Society Press, Los Alamitos, CA (2007)
[21] Co-clustering documents and words using bipartite spectral graph partitioning, KDD’01 : Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM (2001), pp. 269-274
[22] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Volume 39 (1977) no. 1, pp. 1-38 | Zbl
[23] Information-theoretic co-clustering, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2003), pp. 89-98
[24] A permutation-based algorithm for block clustering, Journal of Classification, Volume 8 (1991) no. 1, pp. 65-91
[25] Well-separated clusters and optimal fuzzy partitions, Journal of cybernetics, Volume 4 (1974) no. 1, pp. 95-104 | Zbl
[26] How many clusters ? Which clustering method ? Answers via model-based cluster analysis, The Computer Journal, Volume 41 (1998) no. 8, pp. 578-588 | Zbl
[27] Data clustering : theory, algorithms, and applications, 20, Siam, 2007 | Zbl
[28] Clustering with block mixture models, Pattern Recognition, Volume 36 (2003), pp. 463-473
[29] Clustering of contingency table and mixture model, European Journal of Operational Research, Volume 183 (2007), pp. 1055-1066 | Zbl
[30] Block clustering with Bernoulli mixture models : Comparison of different approaches, Computational Statistics and Data Analysis, Volume 52 (2008), pp. 3233-3245 | Zbl
[31] Un modèle de mélange pour la classification croisée d’un tableau de données continues, CAP’09, 11e conférence sur l’apprentissage artificiel (2009)
[32] Categorization of classification, Mathematics and Computer Science in Biology and Medicine, Her Majesty’s Stationery Office, 1965
[33] Algorithme de classification d’un tableau de contingence, First international symposium on data analysis and informatics, INRIA, Versailles (1977), pp. 487-500
[34] Classification croisée, Thèse d’état, Université Pierre et Marie Curie (1983) (Ph. D. Thesis)
[35] Classification croisée, Modulad, Volume 4 (1989), pp. 9-36
[36] Simultaneous Clustering of Rows and Columns, Control and Cybernetics, Volume 24 (1995) no. 4, pp. 437-458 | Zbl
[37] Two-mode clustering with genetic algorithms, Classification, automation, and new media, Springer, 2002, pp. 87-93
[38] Bloc voting in the United States senate, Journal of Classification, Volume 17 (2000) no. 1, pp. 29-49 | Zbl
[39] Clustering Algorithms, John Wiley & Sons, Inc., New York, NY, USA, 1975 | Zbl
[40] Gene-expression profiles in hereditary breast cancer, New Eng. J. Med., Volume 344 (2001), pp. 539-548
[41] A general statistical framework for assessing categorical clustering in free recall., Psychological bulletin, Volume 83 (1976) no. 6, pp. 1072-1080
[42] Bagging for Biclustering : Application to Microarray Data, Machine Learning and Knowledge Discovery in Databases (Balcázar, JoséLuis; Bonchi, Francesco; Gionis, Aristides; Sebag, Michèle, eds.) (Lecture Notes in Computer Science), Volume 6321, Springer Berlin Heidelberg, 2010, pp. 490-505
[43] Probabilistic latent semantic indexing, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM (1999), pp. 50-57
[44] Defining transcription modules using large-scale gene expression data, Bioinformatics, Volume 20 (2004) no. 13, pp. 1993-2003
[45] Data clustering : a review, ACM computing surveys (CSUR), Volume 31 (1999) no. 3, pp. 264-323
[46] Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering, BMC Bioinformatics, Volume 8 (2007) no. Suppl 10 | DOI
[47] Spectral biclustering of microarray data : coclustering genes and conditions, Genome Research, Volume 13 (2003) no. 4, pp. 703-716
[48] Model selection for the binary latent block model, Compstat (2012), pp. 379-390
[49] Estimation and Selection for the Latent Block Model on Categorical Data (2013) no. RR-8264, 30 pages (Rapport de recherche)
[50] Estimation d’un modèle à blocs latents par l’algorithme SEM, 42e Journées de Statistique, SFdS, Marseille (2010)
[51] A criterion for determining the number of groups in a data set using sum-of-squares clustering, Biometrics (1988), pp. 23-34 | Zbl
[52] Learning systems of concepts with an infinite relational model, Proceedings of The Twenty-First National Conference on Artificial Intelligence, AAAI Press (2006), pp. 381-388
[53] Co-clustering with generative models (2009) (Technical report)
[54] An Approximation of the Integrated Classification Likelihood for the Latent Block Model, ICDM 2012 IEEE International Conference on Data Mining (2012)
[55] Model selection in block clustering by the integrated classification likelihood, Proceedings of Compstat 2012 (2012), pp. 519-530
[56] Un protocole de simulation de données pour la classification croisée, 44e Journées de Statistique de la SFdS (2012)
[57] La méthode des pôles d’attraction, Journées Analyse des Données et Informatique (1977)
[58] Plaid Models for Gene Expression Data, Statistica Sinica, Volume 12 (2000), pp. 61-86 | Zbl
[59] Les plaques-boucles mérovingiennes, Dossiers de l’Archéologie, Volume 42 (1980), pp. 83-87
[60] Co-clustering by block value decomposition, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM (2005), pp. 635-640
[61] An examination of procedures for determining the number of clusters in a data set, Psychometrika, Volume 50 (1985) no. 2, pp. 159-179
[62] The classification and mixture maximum likelihood approaches to cluster analysis, Handbook of statistics, Volume 2 (1982), pp. 199-208 | Zbl
[63] Extracting conserved gene expression motifs from gene expression data, Pacific Symposium on Biocomputing, Volume 8 (2003), pp. 77-88 | Zbl
[64] Extraction de biclusters contraints dans des contextes bruités, Conférence Francophone sur l’Apprentissage Automatique - CAp 2012, Nancy, France, Laurent Bougrain (2012), 16 pages
[65] Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gènes orphelins, La Revue de Modulad, Volume 40 (2009), pp. 69-80
[66] Biclustering algorithms for biological data analysis : a survey, Computational Biology and Bioinformatics, IEEE/ACM Transactions on, Volume 1 (2004) no. 1, pp. 24-45
[67] Nonparametric Bayesian biclustering (2007) (Technical report)
[68] Modeling heterogeneity in random graphs : a selective review, arXiv :1402.4296 (2014) | Zbl
[69] Uncovering latent structure in valued graphs : a variational approach, The Annals of Applied Statistics, Volume 4 (2010) no. 2, pp. 715-742 | Zbl
[70] Estimation and prediction for stochastic blockstructures, Journal of the American Statistical Association, Volume 96 (2001) no. 455, pp. 1077-1087 | Zbl
[71] Application of matrix clustering to web log analysis and access prediction, WEBKDD 2001-Mining Web Log Data Across All Customers Touch Points, Third International Workshop (2001), pp. 13-21
[72] A systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics, Volume 22 (2006) no. 9, pp. 1122-1129 | DOI
[73] A general strategy for the simultaneous classification of variables and objects in ecological data tables, Journal of Vegetation Science, Volume 2 (1991) no. 4, pp. 435-444
[74] Le choix bayésien : Principes et pratique, Springer Science & Business, 2006
[75] Two-dimensional clusters in grammatical relations, AAAI Symposium on Representation and Acquisition of Lexical Knowledge (1995)
[76] Silhouettes : a graphical aid to the interpretation and validation of cluster analysis, Journal of computational and applied mathematics, Volume 20 (1987), pp. 53-65 | Zbl
[77] The Mondrian Process., NIPS (2008), pp. 1377-1384
[78] Two-mode multi-partitioning, Computational Statistics and Data Analysis, Volume 52 (2008) no. 4, pp. 1984-2003 | Zbl
[79] Bayesian co-clustering, Eighth IEEE International Conference on Data Mining, 2008. ICDM’08 (2008), pp. 530-539
[80] Selecting among multi-mode partitioning models of different complexities : A comparison of four model selection criteria, Journal of Classification, Volume 25 (2008) no. 1, pp. 67-85 | Zbl
[81] Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems 13 (2001), pp. 556-562
[82] Model-based overlapping co-clustering, Proceeding of SIAM Conference on Data Mining (2006)
[83] PAC-Bayesian analysis of co-clustering and beyond, The Journal of Machine Learning Research, Volume 11 (2010), pp. 3595-3646 | Zbl
[84] The information bottleneck method, Invited paper to The 37th annual Allerton Conference on Communication, Control, and Computing (1999)
[85] Discovering Statistically Significant Biclusters in Gene Expression Data, Proceedings of ISMB 2002 (2002), pp. 136-144
[86] A Bayesian approach to two-mode clustering (2009) no. 2009-06 (Technical report)
[87] Block clustering with collapsed latent block models, Statistics and Computing (2010), pp. 1-14 | Zbl
[88] Nonparametric Bayesian Co-clustering Ensembles., SIAM (2011)
[89] Orthogonal nonnegative matrix tri-factorization for co-clustering : Multiplicative updates on Stiefel manifolds, Information processing & management, Volume 46 (2010) no. 5, pp. 559-570
[90] -clusters : Capturing subspace correlation in a large data set, Data Engineering, 2002. Proceedings. 18th International Conference on, IEEE (2002), pp. 517-528