Most modern language models infer representations that, albeit powerful, lack both compositionality and semantic interpretability. Starting from the assumption that a large proportion of semantic content is necessarily relational, we introduce a neural language model that discovers networks of symbols (schemata) from text datasets. Using a variational autoencoder (VAE) framework, our model encodes sentences into sequences of symbols (composed representation), which correspond to the nodes visited by biased random walkers on a global latent graph. Sentences are then generated back, conditioned on the selected symbol sequences. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next we leverage pretrained BERT and GPT-2 language models as encoder and decoder, respectively, to train our model on language modelling tasks. Qualitatively, our results show that the model is able to infer schema networks encoding different aspects of natural language. Quantitatively, the model achieves state-of-the-art scores on VAE language modeling benchmarks. Source code to reproduce our experiments is available at https://github.com/ramsesjsf/HiddenSchemaNetworks
翻译:多数现代语言模型推断出尽管强大,但缺乏组成性和语义解释的表达方式。从大量语义内容必然具有关联性这一假设出发,我们引入了一个神经语言模型,从文本数据集中发现符号网络(schemata)。我们使用一个变式自动编码框架,将句子编码成符号序列(组合表示方式),这与全球潜形图上偏向随机行走者访问的节点相对应。然后,根据选定的符号序列生成句子。我们首先证明该模型能够从随机符号序列的人工生成数据集中发现地真象图。接下来我们利用预先训练的BERT和GPT-2语言模型分别作为编码器和解码器,来培训我们的语言建模任务模型。从本质上看,我们的结果显示该模型能够将自然语言的不同方面编集成 schema 网络。定量化,模型在VAAE模型/Schemaservaservical Systems上实现州-art评分数的图案。