Lors de la classification non supervisée des nœuds d’un graphe, une partition unique des nœuds est généralement construite, que le graphe soit orienté ou non. Bien que ce choix soit pertinent pour les graphes non orientés, il devrait être discuté pour les graphes orientés car il implique qu’aucune différence n’est faite entre les clusters de nœuds source et cible. Nous examinons cette question dans le contexte des modèles de clustering probabilistes à variables latentes et comparons l’utilisation du modèle de blocs stochastiques (SBM) et du modèle de blocs latents (LBM). Nous analysons et discutons cette comparaison à travers des jeux de données simulées et réelles.
When clustering the nodes of a graph, a unique partition of the nodes is usually built, either the graph is undirected or directed. While this choice is pertinent for undirected graphs, it should be discussed for directed graphs because it implies that no difference is made between the clusters of source and target nodes. We examine this question in the context of probabilistic models with latent variables and compare the use of the Stochastic Block Model (SBM) and of the Latent Block Model (LBM). We analyze and discuss this comparison through simulated and real data sets and suggest some recommendation.
Keywords: Clustering for directed graphs, Genes networks, Penalized log-likehood, Co-clustering, SBM, LBM
Mot clés : Classification non supervisée de graphes orientés, Réseaux de gènes, Vraisemblance pénalisée, Co-clustering, SBM, LBM
@article{JSFS_2021__162_1_46_0, author = {Keribin, Christine}, title = {Cluster or co-cluster the nodes of oriented graphs?}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {46--69}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {162}, number = {1}, year = {2021}, mrnumber = {4286849}, zbl = {1469.62287}, language = {en}, url = {http://www.numdam.org/item/JSFS_2021__162_1_46_0/} }
TY - JOUR AU - Keribin, Christine TI - Cluster or co-cluster the nodes of oriented graphs? JO - Journal de la société française de statistique PY - 2021 SP - 46 EP - 69 VL - 162 IS - 1 PB - Société française de statistique UR - http://www.numdam.org/item/JSFS_2021__162_1_46_0/ LA - en ID - JSFS_2021__162_1_46_0 ER -
Keribin, Christine. Cluster or co-cluster the nodes of oriented graphs?. Journal de la société française de statistique, Tome 162 (2021) no. 1, pp. 46-69. http://www.numdam.org/item/JSFS_2021__162_1_46_0/
[1] Community detection and stochastic block models: recent developments, The Journal of Machine Learning Research, Volume 18 (2017) no. 1, pp. 6446-6531 | MR
[2] A latent mixed membership model for relational data, Proceedings of the 3rd international workshop on Link discovery (2005), pp. 82-89 | DOI
[3] Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels, The Annals of Statistics, Volume 41 (2013) no. 4, pp. 1922-1943 | MR | Zbl
[4] Assessing a mixture model for clustering with the integrated completed likelihood, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Volume 22 (2000) no. 7, pp. 719-725 | DOI
[5] Consistency and asymptotic normality of Latent Block Model estimators, Electronic journal of statistics, Volume 14 (2020) no. 1, pp. 1234-1268 | MR | Zbl
[6] The netflix prize, Proceedings of KDD cup and workshop, Volume 2007 (2007), 35 pages
[7] Revue des méthodes pour la classification jointe des lignes et des colonnes dâun tableau, Journal de la Société Française de Statistique, Volume 156 (2015) no. 3, pp. 27-51 | Numdam | MR | Zbl
[8] Co-clustering through Latent Bloc Model: a Review, Journal de la Société Française de Statistique, Volume 156 (2015) no. 3, pp. 120-139 | Numdam | MR | Zbl
[9] Modern graph theory, 184, Springer Science & Business Media, 2013 | MR
[10] Random graphs, Modern graph theory, Springer, 1998, pp. 215-252 | DOI | MR
[11] Speeding cis-trans regulation discovery by phylogenomic analyses coupled with screenings of an arrayed library of Arabidopsis transcription factors, PloS one, Volume 6 (2011) no. 6, p. e21524 | DOI
[12] Classification non supervisée de graphes orientés: faut-il distinguer les nœuds origines des nœuds terminaux?, Actes des Neuvièmes Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes 2018 (JFRB, ed.) (2018), pp. 68-75
[13] Co-clustering documents and words using bipartite spectral graph partitioning, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (2001), pp. 269-274 | DOI
[14] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (methodological) (1977), pp. 1-38 | DOI | MR | Zbl
[15] A mixture model for random graphs, Statistics and computing, Volume 18 (2008) no. 2, pp. 173-183 | DOI | MR
[16] Model-based count series clustering for bike sharing system usage mining: a case study with the Vélibâsystem of Paris, ACM Transactions on Intelligent Systems and Technology (TIST), Volume 5 (2014) no. 3, pp. 1-21 | DOI
[17] On the evolution of random graphs, Publ. Math. Inst. Hung. Acad. Sci, Volume 5 (1960) no. 1, pp. 17-60 | MR | Zbl
[18] Solutio problematis ad geometriam situs pertinentis, Commentarii academiae scientiarum Petropolitanae (1741), pp. 128-140
[19] Cluster inference by using transitivity indices in empirical graphs, Journal of the American Statistical Association, Volume 77 (1982) no. 380, pp. 835-840 | DOI | MR | Zbl
[20] Learning from missing data with the Latent Block Model, arXiv preprint arXiv:2010.12222 (2020) | MR | Zbl
[21] Self-organization and identification of web communities, Computer, Volume 35 (2002) no. 3, pp. 66-70 | DOI
[22] Community detection in graphs, Physics reports, Volume 486 (2010) no. 3-5, pp. 75-174 | DOI | MR
[23] Community structure in social and biological networks, Proceedings of the national academy of sciences, Volume 99 (2002) no. 12, pp. 7821-7826 | DOI | MR | Zbl
[24] Clustering of contingency table and mixture model, European Journal of Operational Research, Volume 183 (2007), pp. 1055-1066 | DOI | MR | Zbl
[25] Co-Clustering, ISTE Ltd and John Wiley & Sons, Inc, 2013 | DOI
[26] A survey of statistical network models, Found. Trends Mach. Learn., Volume 2 (2010) no. 2, pp. 129-233 | DOI | Zbl
[27] Network bipartivity, Physical Review E, Volume 68 (2003) no. 5, p. 056107 | DOI
[28] Stochastic block models: First steps, Social networks, Volume 5 (1983) no. 2, pp. 109-137 | DOI | MR
[29] Estimation and selection for the latent block model on categorical data, Statistics and Computing, Volume 25 (2015) no. 6, pp. 1201-1216 | DOI | MR | Zbl
[30] Statistical Analysis of Network Data Methods and Models, Springer, 2009 | DOI | MR
[31] blockmodels: Latent and Stochastic Block Model Estimation by a ’V-EM’ Algorithm (2015) https://CRAN.R-project.org/package=blockmodels (R package version 1.1.1)
[32] Blockmodels: A R-package for estimating in Latent Block Model and Stochastic Block Model, with various probability functions, with or without covariates, arXiv preprint arXiv:1602.07587 (2016)
[33] Detection of structurally homogeneous subsets in graphs, Statistics and Computing, Volume 24 (2014) no. 5, pp. 675-692 | DOI | MR | Zbl
[34] Biclustering algorithms for biological data analysis: a survey, Computational Biology and Bioinformatics, IEEE/ACM Transactions on, Volume 1 (2004) no. 1, pp. 24-45 | DOI
[35] Modeling heterogeneity in random graphs through latent space models: a selective review, ESAIM: Proceedings and Surveys, Volume 47 (2014), pp. 55-74 | DOI | MR | Zbl
[36] Uncovering latent structure in valued graphs: a variational approach, The Annals of Applied Statistics, Volume 4 (2010) no. 2, pp. 715-742 | DOI | MR | Zbl
[37] Modularity and community structure in networks, Proceedings of the national academy of sciences, Volume 103 (2006) no. 23, pp. 8577-8582 | DOI
[38] Bayesian estimation of the latent dimension and communities in stochastic blockmodels, Statistics and Computing (2020), pp. 1-17 | MR | Zbl
[39] On a two-truths phenomenon in spectral graph clustering, Proceedings of the National Academy of Sciences, Volume 116 (2019) no. 13, pp. 5995-6000 | DOI | MR | Zbl
[40] Spectral clustering and the high-dimensional stochastic blockmodel, The Annals of Statistics, Volume 39 (2011) no. 4, pp. 1878-1915 | MR | Zbl
[41] A graph based approach to extract a neighborhood customer community for collaborative filtering, International Workshop on Databases in Networked Information Systems, Springer (2002), pp. 188-200 | DOI | Zbl
[42] bikm1: Co-Clustering Adjusted Rand Index and Bikm1 Procedure for Contingency and Binary Data-Sets (2020) https://CRAN.R-project.org/package=bikm1 (R package version 1.0.0)
[43] Comparing high-dimensional partitions with the Co-clustering Adjusted Rand Index, Journal of Classification (2020), pp. 1-29 | MR | Zbl
[44] Estimating the dimension of a model, The annals of statistics, Volume 6 (1978) no. 2, pp. 461-464 | MR | Zbl
[45] Bayesian co-clustering, Eighth IEEE International Conference on Data Mining, 2008. ICDM’08 (2008), pp. 530-539 | DOI
[46] blockcluster: An R Package for Model-Based Co-Clustering, Journal of Statistical Software, Volume 76 (2017) no. 9, pp. 1-24 | DOI
[47] The Arabidopsis Information Resource (TAIR): gene structure and function annotation, Nucleic acids research, Volume 36 (2007) no. suppl_1, p. D1009-D1014 | DOI
[48] Stability of ecological communities and the architecture of mutualistic and trophic networks, Science, Volume 329 (2010) no. 5993, pp. 853-856 | DOI
[49] Inférence de réseaux de régulation orientés pour les facteurs de transcription d’Arabidopsis thaliana et création de groupes de co-régulation, Ph. D. Thesis , Paris Saclay (2017)
[50] A tutorial on spectral clustering, Statistics and computing, Volume 17 (2007) no. 4, pp. 395-416 | DOI | MR
[51] Architecture of an antagonistic tree/fungus network: the asymmetric influence of past evolutionary history, PloS one, Volume 3 (2008) no. 3, p. e1740 | DOI
[52] Likelihood-based model selection for stochastic block models, The Annals of Statistics, Volume 45 (2017) no. 2, pp. 500-528 | MR | Zbl
[53] Inferring structure in bipartite networks using the latent block model and exact ICL ., Network Science, Volume 5 (2018), pp. 45-69 | DOI
[54] Positional information and pattern formation, Current topics in developmental biology, Volume 6 (1971), pp. 183-224 | DOI