Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure.
翻译:数值交互导致用户共享其他用户发布的文本内容,在自然上可用网络表示,其中个人与节点相关联,交换的文本与边相关。为了理解这些异构和复杂的数据结构,将节点聚类为同质群体,并呈现可理解的数据可视化是必要的。为了解决这两个问题,我们引入了Deep-LPTM,一种基于变分图自动编码器方法的模型聚类策略,以及一种用于描述讨论主题的概率模型。 Deep-LPTM允许在两个嵌入空间中构建节点和边的联合表示。使用变分推断算法推断参数。我们还引入了IC2L,这是一种特别设计用于选择具有相关聚类和可视化属性的模型的模型选择准则。提供了大量关于合成数据的基准研究。特别是,我们发现Deep-LPTM比最先进的ETSBM和STBM更好地恢复了节点的分区。最终,对恩隆公司的电子邮件进行了分析,并呈现了结果的可视化,其中突出显示了图形结构的有意义的亮点。