Active learning is a state-of-art machine learning approach to deal with an abundance of unlabeled data. In the field of Natural Language Processing, typically it is costly and time-consuming to have all the data annotated. This inefficiency inspires out our application of active learning in text classification. Traditional unsupervised k-means clustering is first modified into a semi-supervised version in this research. Then, a novel attempt is applied to further extend the algorithm into active learning scenario with Penalized Min-Max-selection, so as to make limited queries that yield more stable initial centroids. This method utilizes both the interactive query results from users and the underlying distance representation. After tested on a Chinese news dataset, it shows a consistent increase in accuracy while lowering the cost in training.
翻译:积极学习是一种最先进的机器学习方法,用来处理大量未标数据。在自然语言处理领域,通常要求所有数据附加说明费用昂贵且耗时。这种效率低下促使我们在文本分类中应用积极学习。传统的不受监督的 k 手段组合首先被修改为本研究中半监督的版本。然后,采用了一种新颖的尝试,将算法进一步扩展为惩罚性Min-Max-sselective的积极学习方案,以便进行有限的查询,从而产生更稳定的初始小行星。这种方法既利用用户的互动查询结果,又利用基本的远程代表。在中国新闻数据集测试后,它显示在降低培训成本的同时,其准确性持续提高。