摘要
任务简介
在电商中,了解用户在不同品类的各个产品的购买力是非常重要的!这将有助于他们为不同产品的客户创建个性化的产品。在这篇文章中,笔者在真实的数据集中手把手实现如何预测用户在不同品类的各个产品的购买行为。
如果有兴趣和笔者一步步实现项目,可以先根据上一篇文章的介绍中安装PySpark,并在网站中下载数据。
https://datahack.analyticsvidhya.com/contest/black-friday/
数据集简介
某零售公司想要了解针对不同类别的各种产品的顾客购买行为(购买量)。他们为上个月选定的大批量产品分享了各种客户的购买汇总。该数据集还包含客户人口统计信息(age, gender, marital status, city_type, stay_in_current_city),产品详细信息(product_id and product category)以及上个月的purchase_amount总数。现在,他们希望建立一个模型来预测客户对各种产品的购买量,这将有助于他们为不同产品的客户创建个性化的产品。
手把手实战项目
1. 导入数据
这里我们使用PySpark的读数据接口read.csv读取数据,和pandas读取数据接口迷之相似。
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.some.config.option", "setting") \
.getOrCreate()
train = spark.read.csv('./BlackFriday/train.csv', header=True, inferSchema=True)
test = spark.read.csv('./BlackFriday/test.csv', header=True, inferSchema=True
2. 分析数据的类型
要查看Dataframe中列的类型,可以使用printSchema()方法。让我们在train上应用printSchema(),它将以树格式打印模式。
train.printSchema()
"""
root
|-- User_ID: integer (nullable = true)
|-- Product_ID: string (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: string (nullable = true)
|-- Occupation: integer (nullable = true)
|-- City_Category: string (nullable = true)
|-- Stay_In_Current_City_Years: string (nullable = true)
|-- Marital_Status: integer (nullable = true)
|-- Product_Category_1: integer (nullable = true)
|-- Product_Category_2: integer (nullable = true)
|-- Product_Category_3: integer (nullable = true)
|-- Purchase: integer (nullable = true)
"""
3. 预览数据集
train.head(5)
"""
[Row(User_ID=1000001, Product_ID='P00069042', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=3, Product_Category_2=None, Product_Category_3=None, Purchase=8370),
Row(User_ID=1000001, Product_ID='P00248942', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=1, Product_Category_2=6, Product_Category_3=14, Purchase=15200),
Row(User_ID=1000001, Product_ID='P00087842', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=12, Product_Category_2=None, Product_Category_3=None, Purchase=1422),
Row(User_ID=1000001, Product_ID='P00085442', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=12, Product_Category_2=14, Product_Category_3=None, Purchase=1057),
Row(User_ID=1000002, Product_ID='P00285442', Gender='M', Age='55+', Occupation=16, City_Category='C', Stay_In_Current_City_Years='4+', Marital_Status=0, Product_Category_1=8, Product_Category_2=None, Product_Category_3=None, Purchase=7969)]
"""
4. 插补缺失值
train.na.drop('any').count(),test.na.drop('any').count()
"""
(166821, 71037)
"""
train = train.fillna(-1)
test = test.fillna(-1)
5. 分析数值特征
train.describe().show()
"""
+-------+------------------+----------+------+------+------------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
|summary| User_ID|Product_ID|Gender| Age| Occupation|City_Category|Stay_In_Current_City_Years| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|
+-------+------------------+----------+------+------+------------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
| count| 550068| 550068|550068|550068| 550068| 550068| 550068| 550068| 550068| 550068| 550068| 550068|
| mean|1003028.8424013031| null| null| null| 8.076706879876669| null| 1.468494139793958|0.40965298835780306| 5.404270017525106| 6.419769919355425| 3.145214773446192|9263.968712959126|
| stddev| 1727.591585530871| null| null| null|6.5226604873418115| null| 0.989086680757309| 0.4917701263173259| 3.936211369201324| 6.565109781181374| 6.681038828257864|5023.065393820593|
| min| 1000001| P00000142| F| 0-17| 0| A| 0| 0| 1| -1| -1| 12|
| max| 1006040| P0099942| M| 55+| 20| C| 4+| 1| 20| 18| 18| 23961|
+-------+------------------+----------+------+------+------------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
"""
train.select('User_ID','Age').show(5)
"""
+-------+----+
|User_ID| Age|
+-------+----+
|1000001|0-17|
|1000001|0-17|
|1000001|0-17|
|1000001|0-17|
|1000002| 55+|
+-------+----+
only showing top 5 rows
"""
6. 分析categorical特征
train.select('Product_ID').distinct().count(), test.select('Product_ID').distinct().count()
"""
(3631, 3491)
"""
diff_cat_in_train_test=test.select('Product_ID').subtract(train.select('Product_ID'))
diff_cat_in_train_test.distinct().count()
"""
(46, None)
"""
diff_cat_in_train_test.distinct().show(5)
"""
+----------+
|Product_ID|
+----------+
| P00322642|
| P00300142|
| P00077642|
| P00249942|
| P00294942|
+----------+
only showing top 5 rows
"""
7. 将分类变量转换为标签
from pyspark.ml.feature import StringIndexer
plan_indexer = StringIndexer(inputCol = 'Product_ID', outputCol = 'product_id_trans')
labeller = plan_indexer.fit(train)
Train1 = labeller.transform(train)
Test1 = labeller.transform(test)
Train1.show(2)
"""
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|product_id_trans|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370| 766.0|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200| 183.0|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
only showing top 2 rows
"""
Train1.select('product_id_trans').show(2)
"""
+----------------+
|product_id_trans|
+----------------+
| 766.0|
| 183.0|
+----------------+
only showing top 2 rows
"""
8. 选择特征来构建机器学习模型
from pyspark.ml.feature import RFormula
formula = RFormula(formula="Purchase ~ Age+ Occupation +City_Category+Stay_In_Current_City_Years+Product_Category_1+Product_Category_2+ Gender",
featuresCol="features",labelCol="label")
t1 = formula.fit(Train1)
train1 = t1.transform(Train1)
test1 = t1.transform(Test1)
train1.show(2)
"""
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|product_id_trans| features| label|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370| 766.0|(16,[6,10,13,14],...| 8370.0|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200| 183.0|(16,[6,10,13,14],...|15200.0|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
only showing top 2 rows
"""
在应用了这个公式之后,我们可以看到train1和test1有两个额外的列,称为features和label,并对我们在公式中指定的列进行标记(featuresCol= features和labelCol= label)。直观上,train1和test1中的features列中的所有分类变量都被转换为数值,数值变量与之前应用ML时相同。我们还可以查看train1和test1中的列特性和标签。
train1.select('features').show(2)
"""
+--------------------+
| features|
+--------------------+
|(16,[6,10,13,14],...|
|(16,[6,10,13,14],...|
+--------------------+
only showing top 2 rows
"""
train1.select('label').show(2)
"""
+-------+
| label|
+-------+
| 8370.0|
|15200.0|
+-------+
only showing top 2 rows
"""
9. 建立机器学习模型
在应用RFormula和转换Dataframe之后,我们现在需要根据这些数据开发机器学习模型。我想为这个任务应用一个随机森林回归。让我们导入一个在pyspark.ml中定义的随机森林回归器。然后建立一个叫做rf的模型。我将使用随机森林算法的默认参数。
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor()
在创建一个模型rf之后,我们需要将train1数据划分为train_cv和test_cv进行交叉验证。这里,我们将train1数据区域划分为train_cv的70%和test_cv的30%。
(train_cv, test_cv) = train1.randomSplit([0.7, 0.3])
model1 = rf.fit(train_cv)
predictions = model1.transform(test_cv)
10. 模型效果评估
让我们评估对test_cv的预测,看看rmse和mse是多少。
为了评估模型,我们需要从pyspark.ml.evaluation中导入RegressionEvaluator。我们必须为此创建一个对象。有一种方法叫 evaluate for evaluator ,它对模型求值。我们需要为此指定度量标准。
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator()
mse = evaluator.evaluate(predictions,{evaluator.metricName:"mse" })
import numpy as np
np.sqrt(mse), mse
"""
(3832.4796474051345, 14687900.247774584)
"""
经过计算,我们可以看到我们的rmse是3827.767295494888。
现在,我们将在所有的train1数据集上再次训练一个模型。
model = rf.fit(train1)
predictions1 = model.transform(test1)
df = predictions1.selectExpr("User_ID as User_ID", "Product_ID as Product_ID", 'prediction as Purchase')
df.toPandas().to_csv('./BlackFriday/submission.csv')
写入csv文件后(submission.csv)。我们可以上传我们的第一个解决方案来查看分数,我得到的分数是3844.20920145983。
总结