小型数据集内在分类的语义散列小数据集 (Subword Semantic Hashing for Intent Classification on Small Datasets)

In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise by the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: AskUbuntu, Chatbot, and Web Application. Our benchmarks are available online: https://github.com/kumar-shridhar/Know-Your-Intent

翻译：在本文中,我们引入了使用语义混杂作为本源分类任务的嵌入,并在三种常用基准上实现最新业绩。小数据集的内在分类对于基于数据的数据 -- -- 饥饿状态最先进的深层学习系统来说是一项艰巨的任务。语义混杂是试图克服这样的挑战并学习强大的文本分类。目前基于语言嵌入的单词取决于词汇库。此类方法的一个主要缺点是词汇术语外的, 特别是当拥有小型培训数据集并使用更广泛的词汇时。在对聊天机的 Intent 分类中, 通常是从互联网通信中提取小数据集。使用互联网通信会产生两个问题。首先, 此类数据集在词汇中缺少大量术语来高效地使用词嵌入词。其次, 用户经常做出拼写错误。通常, 意向分类的模式不是用拼写错误来训练的, 并且很难思考用户会如何做出错误。模型取决于语言定义, 包括我们内部定义的版本/ 。模型将实现我们内部定义的错误。模型将面临三种语言定义, 将面临这样的问题。