We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance significantly. We achieved up to 97 percent accuracy. Code and Dataset.
翻译:我们模拟了Quora 问题对数据集以识别类似的问题。 我们使用的数据集是由 Quora 提供的。 任务是一个二进制分类 。 我们尝试了几种方法和算法, 并尝试了与先前作品不同的方法 。 对于地貌提取, 我们使用了包括数矢量器在内的词包, 以及 XGBoost 和 CatBoost 使用 Unigram 的周期性反频文档频率。 此外, 我们还实验了 WordPiece 符号, 这极大地改善了模型的性能 。 我们达到了97%的精确度。 代码和数据集 。