Text Classification finds interesting applications in the pickup and delivery services industry where customers require one or more items to be picked up from a location and delivered to a certain destination. Classifying these customer transactions into multiple categories helps understand the market needs for different customer segments. Each transaction is accompanied by a text description provided by the customer to describe the products being picked up and delivered which can be used to classify the transaction. BERT-based models have proven to perform well in Natural Language Understanding. However, the product descriptions provided by the customers tend to be short, incoherent and code-mixed (Hindi-English) text which demands fine-tuning of such models with manually labelled data to achieve high accuracy. Collecting this labelled data can prove to be expensive. In this paper, we explore Active Learning strategies to label transaction descriptions cost effectively while using BERT to train a transaction classification model. On TREC-6, AG's News Corpus and an internal dataset, we benchmark the performance of BERT across different Active Learning strategies in Multi-Class Text Classification.
翻译:将客户交易分类为多种类别有助于理解不同客户部分的市场需求。每笔交易都附有客户提供的文字说明,以描述所采集和交付的可用于交易分类的产品。基于BERT的模型已证明在自然语言理解方面表现良好。然而,客户提供的产品说明往往短、不连贯和编码混合(Hindi-English)的文本,其中要求用人工标签数据对此类模型进行微调,以便实现高准确性。收集这些贴标签的数据可能非常昂贵。在本文件中,我们探索积极学习战略,在培训交易分类模型的同时,有效地标注交易说明成本。在TRE-6上,AG新闻公司的新闻公司和一个内部数据集中,我们将BERT的绩效作为多类文本分类中不同积极学习战略的基准。