The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighboring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This method of detecting missing bibliographic links can be used in recommendation engines of applied intelligent information systems.
翻译:本文提出了在为探测缺失的书目链接而进行自动判决分类时抽样文本数据的各种战略。我们根据作为文字语义单位的句子构建样本,并增加由几句相邻句子组成的直接上下文。我们研究了背景大小和位置不同的若干抽样战略。在收集STEM科学论文方面进行了实验。将判决内容纳入样本提高了分类结果。在对同一抽样数据进行不同分类时,我们通过采用共同投票方式,自动确定特定文本收集的最佳抽样战略。结合硬性投票程序对句子进行抽样战略,导致98%(F1-核心)的分类准确性。这种检测缺失书目链接的方法可用于应用智能信息系统的建议引擎。