We predict restaurant ratings from Yelp reviews based on Yelp Open Dataset. Data distribution is presented, and one balanced training dataset is built. Two vectorizers are experimented for feature engineering. Four machine learning models including Naive Bayes, Logistic Regression, Random Forest, and Linear Support Vector Machine are implemented. Four transformer-based models containing BERT, DistilBERT, RoBERTa, and XLNet are also applied. Accuracy, weighted F1 score, and confusion matrix are used for model evaluation. XLNet achieves 70% accuracy for 5-star classification compared with Logistic Regression with 64% accuracy.
翻译:我们根据Yelp Open Dataset预测从Yelp公司审查的餐厅评级。 数据分布显示, 并构建了一个均衡的培训数据集。 两个矢量器是用于地貌工程的实验。 实施了四个机器学习模型, 包括Naive Bayes、 物流回归、 随机森林和线性支持矢量机。 也应用了四个基于变压器的模型, 包括BERT、 DutilBERT、 RoBERTA 和 XLNet 。 模型评价使用了精确度、 加权F1 分和混乱矩阵。 XLNet 实现了五星级分类的70%的精度, 而物流回归的精度为64%。