Due to the increasing amount of data on the internet, finding a highly-informative, low-dimensional representation for text is one of the main challenges for efficient natural language processing tasks including text classification. This representation should capture the semantic information of the text while retaining their relevance level for document classification. This approach maps the documents with similar topics to a similar space in vector space representation. To obtain representation for large text, we propose the utilization of deep Siamese neural networks. To embed document relevance in topics in the distributed representation, we use a Siamese neural network to jointly learn document representations. Our Siamese network consists of two sub-network of multi-layer perceptron. We examine our representation for the text categorization task on BBC news dataset. The results show that the proposed representations outperform the conventional and state-of-the-art representations in the text classification task on this dataset.
翻译:由于互联网上的数据越来越多,找到高度信息化、低维度的文本表述方式是高效自然语言处理任务的主要挑战之一,包括文本分类。这种表述方式应捕捉文本的语义信息,同时保留文件分类的相关程度。这种方法绘制的文件主题与矢量空间表述方式类似。为了获得大文本的表述方式,我们建议利用深层的暹粒神经网络。为了将文件相关性纳入分布式表述方式,我们使用一个暹粒神经网络来联合学习文件表述方式。我们Siamse网络由两个多层透视器子子网络组成。我们研究了英国广播公司新闻数据集文本分类任务中的表述方式。结果显示,拟议的表述方式超出了该数据集文本分类工作中的传统和最新表述方式。