【导语】这篇文章为大家介绍了一个开源项目——sk-dist。在一台没有并行化的单机上进行超参数调优,需要 7.2 分钟,而在一百多个核心的 Spark 群集上用它进行超参数调优,只需要 3.4 秒,把训练 sk-learn 的速度提升了 100 倍。
import timefrom sklearn import datasets, svmfrom skdist.distribute.search import DistGridSearchCVfrom pyspark.sql import SparkSession# instantiate spark sessionspark = ( SparkSession .builder .getOrCreate() )sc = spark.sparkContext# the digits datasetdigits = datasets.load_digits()X = digits["data"]y = digits["target"]# create a classifier: a support vector classifierclassifier = svm.SVC()param_grid = { "C": [0.01, 0.01, 0.1, 1.0, 10.0, 20.0, 50.0], "gamma": ["scale", "auto", 0.001, 0.01, 0.1], "kernel": ["rbf", "poly", "sigmoid"] }scoring = "f1_weighted"cv = 10# hyperparameter optimizationstart = time.time()model = DistGridSearchCV( classifier, param_grid, sc=sc, cv=cv, scoring=scoring, verbose=True )model.fit(X,y)print("Train time: {0}".format(time.time() - start))print("Best score: {0}".format(model.best_score_))------------------------------Spark context found; running with sparkFitting 10 folds for each of 105 candidates, totalling 1050 fitsTrain time: 3.380601406097412Best score: 0.981450024203508