Online data streams make training machine learning models hard because of distribution shift and new patterns emerging over time. For natural language processing (NLP) tasks that utilize a collection of features based on lexicons and rules, it is important to adapt these features to the changing data. To address this challenge we introduce PyTAIL, a python library, which allows a human in the loop approach to actively train NLP models. PyTAIL enhances generic active learning, which only suggests new instances to label by also suggesting new features like rules and lexicons to label. Furthermore, PyTAIL is flexible enough for users to accept, reject, or update rules and lexicons as the model is being trained. Finally, we simulate the performance of PyTAIL on existing social media benchmark datasets for text classification. We compare various active learning strategies on these benchmarks. The model closes the gap with as few as 10% of the training data. Finally, we also highlight the importance of tracking evaluation metric on remaining data (which is not yet merged with active learning) alongside the test dataset. This highlights the effectiveness of the model in accurately annotating the remaining dataset, which is especially suitable for batch processing of large unlabelled corpora. PyTAIL will be available at https://github.com/socialmediaie/pytail.
翻译:在线数据流使培训机学习模式因分布变化和随着时间的推移出现新的模式而变得非常艰难。 对于利用基于词汇和规则的特征收集的自然语言处理任务(NLP),必须使这些特征适应不断变化的数据。为了应对这一挑战,我们引入了PyTAIL, 即Python 库, 使人能够在环状方法中积极培训NLP模型。 PyTAIL加强通用积极学习,它只是通过同时提出规则和词汇标签等新特征来建议标签的新实例。此外, PyTAIL 足够灵活,用户可以接受、拒绝或更新正在培训的模式和词汇。 最后,我们模拟PyTAIL 的绩效, 用于文本分类的现有社会媒体基准数据集。 我们比较了这些基准上的各种积极的学习战略。 模型缩小了10%的培训数据的差距。 最后,我们还强调必须在测试数据集的同时跟踪剩余数据( 尚未与积极学习合并的) 。 这个模型在精确的 PyTAIL/ 图像处理中将特别适合的大型数据库/ 。