Understanding human language often necessitates understanding entities and their place in a taxonomy of knowledge -- their types. Previous methods to learn entity types rely on training classifiers on datasets with coarse, noisy, and incomplete labels. We introduce a method to instill fine-grained type knowledge in language models with text-to-text pre-training on type-centric questions leveraging knowledge base documents and knowledge graphs. We create the WikiWiki dataset: entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types. Models trained on WikiWiki achieve state-of-the-art performance in zero-shot dialog state tracking benchmarks, accurately infer entity types in Wikipedia articles, and can discover new types deemed useful by human judges.
翻译:理解人类语言往往需要理解实体及其在知识分类中的位置 -- -- 其类型。学习实体类型的以往方法依靠粗糙、吵闹和不完全标签的数据集培训分类师。我们引入了一种方法,在语言模型中注入精细的分类知识,在以文字和文字为中心的问题上进行预培训,利用知识基础文件和知识图。我们创建了维基维基百科数据集:与维基数据知识图相关的10M维基百科文章中与41K型维基百科相联的实体和段落。维基百科培训的模型在零速对话状态跟踪基准中达到最新性能,在维基百科文章中准确推导出实体类型,并能够发现人类法官认为有用的新类型。