We present skweak, a versatile, Python-based software toolkit enabling NLP developers to apply weak supervision to a wide range of NLP tasks. Weak supervision is an emerging machine learning paradigm based on a simple idea: instead of labelling data points by hand, we use labelling functions derived from domain knowledge to automatically obtain annotations for a given dataset. The resulting labels are then aggregated with a generative model that estimates the accuracy (and possible confusions) of each labelling function. The skweak toolkit makes it easy to implement a large spectrum of labelling functions (such as heuristics, gazetteers, neural models or linguistic constraints) on text data, apply them on a corpus, and aggregate their results in a fully unsupervised fashion. skweak is especially designed to facilitate the use of weak supervision for NLP tasks such as text classification and sequence labelling. We illustrate the use of skweak for NER and sentiment analysis. skweak is released under an open-source license and is available at: https://github.com/NorskRegnesentral/skweak
翻译:我们提出Skweak,一个多功能的、基于Python的软件工具包,使NLP的开发者能够对广泛的NLP任务应用薄弱的监督。弱监管是一种基于简单想法的新兴机器学习模式:我们使用来自域知识的标签功能,而不是用手贴数据点标签,以自动获得某一数据集的说明。由此产生的标签随后以一个基因化模型加以汇总,该模型估计每个标签功能的准确性(和可能的混乱性)。Skweak工具包使得在文本数据上实施大量标签功能(如超自然学、地名录、神经模型或语言限制)变得容易,将其应用于一个实体,并以完全不受监督的方式汇总其结果。Skweak特别旨在便利在文本分类和序列标签等NLP任务中使用薄弱的监督。我们用Skweak来说明 NER 和情绪分析。Skweak是根据公开源许可证发放的,可在以下网址上查阅:https://github.com/Norresentral/skweak/skweak: