We propose a series of data-centric heuristics for improving the performance of machine learning systems when applied to problems in quantum information science. In particular, we consider how systematic engineering of training sets can significantly enhance the accuracy of pre-trained neural networks used for quantum state reconstruction without altering the underlying architecture. We find that it is not always optimal to engineer training sets to exactly match the expected distribution of a target scenario, and instead, performance can be further improved by biasing the training set to be slightly more mixed than the target. This is due to the heterogeneity in the number of free variables required to describe states of different purity, and as a result, overall accuracy of the network improves when training sets of a fixed size focus on states with the least constrained free variables. For further clarity, we also include a "toy model" demonstration of how spurious correlations can inadvertently enter synthetic data sets used for training, how the performance of systems trained with these correlations can degrade dramatically, and how the inclusion of even relatively few counterexamples can effectively remedy such problems.
翻译:我们提出了一系列以数据为中心的理论,以改善机器学习系统在应用到量子信息科学问题时的性能。我们特别考虑,在不改变基本结构的情况下,如何系统地设计培训成套设备可以显著提高用于量子国家重建的预先培训神经网络的准确性。我们发现,在设计培训成套设备时,并不总是最符合目标情景的预期分布,相反,如果对培训设置的偏差比目标略高,则业绩可以进一步提高。这是因为,在描述不同纯度状态所需的自由变量数量方面存在着差异性,因此,当固定规模的培训组合以最不受限制的自由变量为对象时,网络的总体准确性会提高。为了进一步明确起见,我们还包括一个“玩具模型”演示,说明虚假的关联如何无意地进入用于培训的合成数据集,如何使经过培训的系统性能急剧退化,如何纳入相对较少的反外观来有效解决此类问题。