Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.
翻译:虽然采用了不同的算法,包括未经监督的学习、半监督的学习、自学等,但文本分类的性能因背景不同而不同。鉴于缺少贴标签的数据集,我们建议采用新的、简单、不受监督的文本分类模式,使用标准国际贸易分类(SITC)编码对国际航运业的货物内容进行分类。我们的方法来自使用预先培训的Glove Word 嵌入器代表单词,并找到使用Cosine 相似性最有可能使用的标签。为了比较未经监督的文本分类模式和受监督的分类,我们还采用了若干变换模型对货物内容进行分类。由于缺乏培训数据,SITC数字代码和相应的文字描述被用作培训数据。使用少量手工贴标签的货物内容数据来评价未经监督的分类和基于变换的分类工作。比较表明,未经监督的分类在培训数据增加30%之后,也明显优于受监督的变换者。缺少培训数据是一种关键的瓶式分类方法,因为实践的变换方法可以阻止成功的数据转换。