This work introduces a natural language inference (NLI) dataset that focuses on the validity of statements in legal wills. This dataset is unique because: (a) each entailment decision requires three inputs: the statement from the will, the law, and the conditions that hold at the time of the testator's death; and (b) the included texts are longer than the ones in current NLI datasets. We trained eight neural NLI models in this dataset. All the models achieve more than 80% macro F1 and accuracy, which indicates that neural approaches can handle this task reasonably well. However, group accuracy, a stricter evaluation measure that is calculated with a group of positive and negative examples generated from the same statement as a unit, is in mid 80s at best, which suggests that the models' understanding of the task remains superficial. Further ablative analyses and explanation experiments indicate that all three text segments are used for prediction, but some decisions rely on semantically irrelevant tokens. This indicates that overfitting on these longer texts likely happens, and that additional research is required for this task to be solved.
翻译:这项工作引入了一个自然语言推论( NLI) 数据集, 重点是法律遗嘱中声明的有效性。 这个数据集是独特的, 原因是:(a) 每项包含决定都需要三种投入: 遗嘱、 法律和测试者死亡时所持有的条件的说明; 以及 (b) 包含的文本比当前 NLI 数据集中的时间长。 我们在这个数据集中培训了八个神经NLI模型。 所有模型都达到80% 以上 宏观 F1 和 准确性, 这表明神经方法能够合理地处理这项任务。 然而, 组合准确性是比较严格的评估措施, 由一组从同一个单体中生成的正负例子来计算, 最多为80年代中期, 这表明模型对任务的理解仍然很肤浅。 进一步的比较分析和解释实验表明, 所有三个文本部分都用于预测, 但有些决定依赖于语义上无关的符号。 这表明, 这些较长的文本可能出现过大, 并且需要做更多的研究才能解决这项任务。