Syntax is a fundamental component of language, yet few metrics have been employed to capture syntactic similarity or coherence at the utterance- and document-level. The existing standard document-level syntactic similarity metric is computationally expensive and performs inconsistently when faced with syntactically dissimilar documents. To address these challenges, we present FastKASSIM, a metric for utterance- and document-level syntactic similarity which pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels. FastKASSIM is more robust to syntactic dissimilarities and runs up to to 5.32 times faster than the baseline method over the documents in the r/ChangeMyView corpus. These improvements allow us to examine hypotheses in two settings with large documents: persuasive online arguments on r/ChangeMyView, and authorship attribution in the Australian High Court Judgment corpus. With FastKASSIM, we are able to show that more syntactically similar arguments tend to be more persuasive, and that syntax provides a key indicator of writing style.
翻译:语法是语言的一个基本组成部分, 但是在语句和文件层面, 却很少使用量度来捕捉同义性相似性或一致性。 现有的标准文档级同义性类似性衡量标准在计算上成本很高, 面对同义性不同文件时运行不均。 为了应对这些挑战, 我们展示了 FastKASSIM, 一种表达和文件级同义性相似性衡量标准, 配对和平均一对基于树内核的文档之间最相似的依附性细小树。 FastKASSIM 比较强, 比对 r/ ChangeMyViction 文件中的基线方法要快到5.32倍。 这些改进使我们能够在两种情况下检查假说: r/ ChangeMyView 上的有说服力的在线辩论, 以及澳大利亚高等法院判案表中的作者归属。 在 FastKASSIM 中, 我们能够显示, 更相近的同义性论点往往更具有说服力, 而该同义性参数提供了写作风格的关键指标 。