Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.
翻译:暂无翻译