Natural language inference (NLI) is critical for complex decision-making in biomedical domain. One key question, for example, is whether a given biomedical mechanism is supported by experimental evidence. This can be seen as an NLI problem but there are no directly usable datasets to address this. The main challenge is that manually creating informative negative examples for this task is difficult and expensive. We introduce a novel semi-supervised procedure that bootstraps an NLI dataset from existing biomedical dataset that pairs mechanisms with experimental evidence in abstracts. We generate a range of negative examples using nine strategies that manipulate the structure of the underlying mechanisms both with rules, e.g., flip the roles of the entities in the interaction, and, more importantly, as perturbations via logical constraints in a neuro-logical decoding system. We use this procedure to create a novel dataset for NLI in the biomedical domain, called BioNLI and benchmark two state-of-the-art biomedical classifiers. The best result we obtain is around mid 70s in F1, suggesting the difficulty of the task. Critically, the performance on the different classes of negative examples varies widely, from 97% F1 on the simple role change negative examples, to barely better than chance on the negative examples generated using neuro-logic decoding.
翻译:自然语言推论( NLI) 对生物医学领域的复杂决策至关重要。 例如,一个关键问题是生物医学机制是否得到实验证据的支持。 这可以被视为NLI问题,但没有直接可用的数据集来解决这个问题。 主要的挑战是人工为这项任务创建信息化负面实例是困难和昂贵的。 我们引入了一个新的半监督程序, 将现有的生物医学数据集中的NLI数据集套起来, 将机制与实验证据放在摘要中。 我们产生了一系列负面例子, 使用九种战略来操纵基本机制的结构, 包括规则, 例如, 将实体在互动中的作用翻转, 更重要的是, 通过神经系统解码系统的逻辑限制进行干扰。 我们使用这个程序为生物医学领域的NLI创建一个新的数据集, 称为BioNLI, 并设定两个最先进的生物医学数据集的基准。 我们获得的最佳结果是大约70年代中期的F1, 表明任务的困难。 关键是, 不同类别的负面例子的性表现比普通例子差得多, 从97 % F1, 也就是在负面例子上产生的负面例子差得多。