Syntax is a fundamental component of language, yet few metrics have been employed to capture syntactic similarity or coherence at the utterance- and document-level. The existing standard document-level syntactic similarity metric is computationally expensive and performs inconsistently when faced with syntactically dissimilar documents. To address these challenges, we present FastKASSIM, a metric for utterance- and document-level syntactic similarity which pairs and averages the most similar constituency parse trees between a pair of documents based on tree kernels. FastKASSIM is more robust to syntactic dissimilarities and runs up to to 5.32 times faster than its predecessor over documents in the r/ChangeMyView corpus. FastKASSIM's improvements allow us to examine hypotheses in two settings with large documents. We find that syntactically similar arguments on r/ChangeMyView tend to be more persuasive, and that syntax is predictive of authorship attribution in the Australian High Court Judgment corpus.
翻译:语法是语言的一个基本组成部分, 但很少使用量度来捕捉语言和文件层面的同系相似性或一致性。 现有的标准文档级同级类似性测量标准在计算上成本很高,在面对同级不同文件时表现不一。 为了应对这些挑战, 我们提出“ FastKASSIM ”, 这是一种表达和文件级同级相似性测量标准, 配对和平均以树皮为主的一对文件之间最相似的选区分解树。 FastKASSIM 比较强, 比较综合性差异性强,比在r/ ChangeMyViction 中的文件的前身速度快5.32倍。 FastKASSIM 的改进使我们能够用大文件检查两种环境中的假说。 我们认为, r/ ChangeMyView 的相近似性论点往往更具有说服力, 而澳大利亚高等法院的判决书中, 合成法是作者归属的预言。