Text classifiers are at the core of many NLP applications and use a variety of algorithmic approaches and software. This paper introduces infrastructure and methodologies for text classifiers based on large-scale regular expressions. In particular, we describe how Facebook determines if a given piece of text - anything from a hashtag to a post - belongs to a narrow topic such as COVID-19. To fully define a topic and evaluate classifier performance we employ human-guided iterations of keyword discovery, but do not require labeled data. For COVID-19, we build two sets of regular expressions: (1) for 66 languages, with 99% precision and recall >50%, (2) for the 11 most common languages, with precision >90% and recall >90%. Regular expressions enable low-latency queries from multiple platforms. Response to challenges like COVID-19 is fast and so are revisions. Comparisons to a DNN classifier show explainable results, higher precision and recall, and less overfitting. Our learnings can be applied to other narrow-topic classifiers.
翻译:文本分类器是许多 NLP 应用程序的核心, 并使用多种算法方法和软件。 本文介绍了基于大规模常规表达式的文本分类器的基础设施和方法。 特别是, 我们描述Facebook如何确定某个文本( 从标签到文章的任何内容)是否属于CCOVID-19等狭隘的话题。 要充分定义一个专题并评估分类器性能, 我们使用人类引导的关键词发现迭代, 但不需要标签数据 。 对于 COVID-19, 我们建立两套常规表达器:(1) 66种语言, 精确度为99%, 并记得 > 50%, (2) 最常用的11种语言, 精确度为 > 90%, 记得 > 90% 。 常规表达器允许多个平台进行低延迟查询 。 对 COVID-19 等挑战的反应是快速的, 如此修改 。 与 DNN 分类器的比较显示可解释的结果、 更高精度和回忆, 并且不那么完美。 我们的学习方法可以适用于其他狭隘的分类器 。