利用数据科学方法查明重复复制问题:Quora案例研究 (Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study)

Identifying semantically identical questions on, Question and Answering social media platforms like Quora is exceptionally significant to ensure that the quality and the quantity of content are presented to users, based on the intent of the question and thus enriching overall user experience. Detecting duplicate questions is a challenging problem because natural language is very expressive, and a unique intent can be conveyed using different words, phrases, and sentence structuring. Machine learning and deep learning methods are known to have accomplished superior results over traditional natural language processing techniques in identifying similar texts. In this paper, taking Quora for our case study, we explored and applied different machine learning and deep learning techniques on the task of identifying duplicate questions on Quora's dataset. By using feature engineering, feature importance techniques, and experimenting with seven selected machine learning classifiers, we demonstrated that our models outperformed previous studies on this task. Xgboost model with character level term frequency and inverse term frequency is our best machine learning model that has also outperformed a few of the Deep learning baseline models. We applied deep learning techniques to model four different deep neural networks of multiple layers consisting of Glove embeddings, Long Short Term Memory, Convolution, Max pooling, Dense, Batch Normalization, Activation functions, and model merge. Our deep learning models achieved better accuracy than machine learning models. Three out of four proposed architectures outperformed the accuracy from previous machine learning and deep learning research work, two out of four models outperformed accuracy from previous deep learning study on Quora's question pair dataset, and our best model achieved accuracy of 85.82% which is close to Quora state of the art accuracy.

翻译：在Quora 等社交媒体平台上,识别与直觉相同的问题、问题和答案,对于确保根据问题的用意向用户提供内容的质量和数量,从而丰富总体用户经验来说,对于确保向用户提供内容的质量和数量来说,意义极为显著。发现重复的问题是一个具有挑战性的问题,因为自然语言非常直观,而且可以用不同的词语、短语和句子结构来传达独特的意图。已知机器学习和深层次学习方法在识别类似文本方面已经取得了优于传统自然语言处理技术的优于传统自然语言处理技术的优异结果。在本文中,用Quora 进行案例研究,我们探索和应用不同的机器学习和深度精确度技术,用于识别Quora 数据集重复问题的任务。通过使用特征工程、特征重要性技术,以及与7个选定的机器学习分类师进行实验,我们的模式比以前关于这项工作的频率和反术语频率模型,是我们提出的最佳机器学习模型,也超越了几个深层次学习基线模型。我们用深层次的深层次学习技术对四个深层次的精度模型进行了探索,我们用了四种深层次的精度研究, 将模型用于模型进行更深层次的精度的精度的精度研究, 。