Translated title: 性能和能耗并行机器学习算法的研究 Translated abstract: 机器学习模型在数据科学、计算机视觉、自然语言处理等各个实际应用中取得了显著的成功。然而，机器学习模型的训练需要大规模的数据集和多个迭代才能正常工作。机器学习算法并行化是加速训练过程的常见策略。然而，许多关于模型训练和推断的研究仅关注性能方面。而电源消耗在任何类型的计算中都是重要的指标，尤其是高性能应用。对于低功率平台（如传感器和移动设备）可以使用的机器学习算法进行了研究，但是针对高性能计算的算法的功率优化不足。在本文中，我们通过C ++实现logistic回归和遗传算法，通过Python实现使用随机梯度下降算法的神经网络，对分类任务进行了测试。我们将展示模型的复杂性和训练数据大小对算法在性能和能源方面的并行效率的影响。我们还使用了分片内存并行、分布式内存并行和GPU加速来加速机器学习模型的训练。 (Performance and Energy Consumption of Parallel Machine Learning Algorithms)

翻译：Translated title: 性能和能耗并行机器学习算法的研究 Translated abstract: 机器学习模型在数据科学、计算机视觉、自然语言处理等各个实际应用中取得了显著的成功。然而，机器学习模型的训练需要大规模的数据集和多个迭代才能正常工作。机器学习算法并行化是加速训练过程的常见策略。然而，许多关于模型训练和推断的研究仅关注性能方面。而电源消耗在任何类型的计算中都是重要的指标，尤其是高性能应用。对于低功率平台（如传感器和移动设备）可以使用的机器学习算法进行了研究，但是针对高性能计算的算法的功率优化不足。在本文中，我们通过C ++实现logistic回归和遗传算法，通过Python实现使用随机梯度下降算法的神经网络，对分类任务进行了测试。我们将展示模型的复杂性和训练数据大小对算法在性能和能源方面的并行效率的影响。我们还使用了分片内存并行、分布式内存并行和GPU加速来加速机器学习模型的训练。

Xidong Wu,Preston Brazzle,Stephen Cahoon

Machine learning models have achieved remarkable success in various real-world applications such as data science, computer vision, and natural language processing. However, model training in machine learning requires large-scale data sets and multiple iterations before it can work properly. Parallelization of training algorithms is a common strategy to speed up the process of training. However, many studies on model training and inference focus only on aspects of performance. Power consumption is also an important metric for any type of computation, especially high-performance applications. Machine learning algorithms that can be used on low-power platforms such as sensors and mobile devices have been researched, but less power optimization is done for algorithms designed for high-performance computing. In this paper, we present a C++ implementation of logistic regression and the genetic algorithm, and a Python implementation of neural networks with stochastic gradient descent (SGD) algorithm on classification tasks. We will show the impact that the complexity of the model and the size of the training data have on the parallel efficiency of the algorithm in terms of both power and performance. We also tested these implementations using shard-memory parallelism, distributed memory parallelism, and GPU acceleration to speed up machine learning model training.

翻译：