Disentangling the encodings of neural models is a fundamental aspect for improving interpretability, semantic control and downstream task performance in Natural Language Processing. Currently, most disentanglement methods are unsupervised or rely on synthetic datasets with known generative factors. We argue that recurrent syntactic and semantic regularities in textual data can be used to provide the models with both structural biases and generative factors. We leverage the semantic structures present in a representative and semantically dense category of sentence types, definitional sentences, for training a Variational Autoencoder to learn disentangled representations. Our experimental results show that the proposed model outperforms unsupervised baselines on several qualitative and quantitative benchmarks for disentanglement, and it also improves the results in the downstream task of definition modeling.
翻译:分离神经模型编码是改进自然语言处理中的可解释性、语义控制和下游任务性能的一个基本方面,目前,大多数分离方法不受监督,或依赖具有已知基因因素的合成数据集。我们争辩说,文字数据中反复出现的混合性和语义性常态可以用来为模型提供结构偏差和基因化因素。我们利用代表性和语义密集类刑罚(定义性判决)中的语义结构,培训变异自动编码器学习分解的表达方式。我们的实验结果显示,拟议的模型在分解的若干定性和定量基准上,优于不受监督的基线,还改善了下游定义模型任务的结果。