We study the problem of building entity tagging systems by using a few rules as weak supervision. Previous methods mostly focus on disambiguation entity types based on contexts and expert-provided rules, while assuming entity spans are given. In this work, we propose a novel method TALLOR that bootstraps high-quality logical rules to train a neural tagger in a fully automated manner. Specifically, we introduce compound rules that are composed from simple rules to increase the precision of boundary detection and generate more diverse pseudo labels. We further design a dynamic label selection strategy to ensure pseudo label quality and therefore avoid overfitting the neural tagger. Experiments on three datasets demonstrate that our method outperforms other weakly supervised methods and even rivals a state-of-the-art distantly supervised tagger with a lexicon of over 2,000 terms when starting from only 20 simple rules. Our method can serve as a tool for rapidly building taggers in emerging domains and tasks. Case studies show that learned rules can potentially explain the predicted entities.
翻译:我们研究建立实体标记系统的问题,方法是使用少数规则作为薄弱的监管。 以往的方法主要侧重于基于背景和专家提供的规则的模糊性实体类型, 假设实体覆盖范围。 在这项工作中, 我们提出一种新的方法TALLOR, 将高质量的逻辑规则捆绑起来, 以完全自动化的方式训练神经引线器。 具体地说, 我们引入由简单规则组成的复合规则, 以提高边界探测的精确度, 并生成更多样化的假标签。 我们进一步设计动态标签选择战略, 以确保假标签质量, 从而避免超配神经塔格。 对三个数据集的实验表明, 我们的方法优于其他薄弱的监管方法, 甚至从20个简单规则开始, 与一个有2 000多个条件的州级塔格相对应。 我们的方法可以作为在新兴领域和任务中快速建立标记的工具。 案例研究显示, 学到的规则可以解释预测的实体。