In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introducing C++-based parallel constructs to enable 1) parallel execution of a quantum kernel with std::thread and 2) asynchronous execution with std::async. To do so, we provide a detailed overview of the current implementation of the QCOR programming model and runtime, and discuss how we add 1) thread-safety to some of its user-facing API routines, and 2) increase parallelism in QCOR by removing data races that inhibit multi-threading so as to better utilize available computing resources. We also present preliminary performance results with the Quantum++ back end on a single-node Ryzen9 3900X machine that has 12 physical cores (24 hardware threads) with 128GB of RAM. The results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other each with 24 threads (1.63x improvement). In addition, we observe the same trend when running two Shor's algorthm kernels in parallel (1.22x faster than executing the kernels one after the other). It is worth noting that the trends remain the same even when we only use physical cores instead of threads. We believe that our design, implementation, and results will open up an opportunity not only for 1) enabling quicker prototyping of parallel/asynchrony-aware quantum-classical algorithms on quantum circuit simulators in the short-term, but also for 2) realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms in the long-term.
翻译:在本文中,我们探讨了实现量子古典多元平台通用的多元平行编程模式的一些关键限制。 我们讨论了在QCOR中实现用户级多读化的经验,以及未来量子古典系统编程中需要应对的挑战。 具体地说,我们讨论了我们采用C++基平行结构的设计和实施,以便1 能够(1) 平行地执行量子内核,同时使用以下标准:thread 和 2 。要这样做,我们要详细介绍当前在QCOR编程模型和运行时间上实施的情况。我们讨论了我们如何在QCOR编程中为用户级多读多读多读的多读制平行系统编程中增加一些需要应对的挑战。我们还提出了初步的性能结果,在1-nal-node Ryzenex 后端端执行,在1-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-stal-ral-ral-ral-ral-rent-ral-ral-ral-lick-lick-lick er-ral-lick-lick-l-l) 中,在运行运行中将运行一个运行中运行。结果显示运行后,在运行运行运行运行后运行运行运行运行。结果显示,在1-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l