In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introducing C++-based parallel constructs to enable 1) parallel execution of a quantum kernel with std::thread and 2) asynchronous execution with std::async. To do so, we provide a detailed overview of the current implementation of the QCOR programming model and runtime, and discuss how we add 1) thread-safety to some of its user-facing API routines, and 2) increase parallelism in QCOR by removing data races that inhibit multi-threading so as to better utilize available computing resources. We also present preliminary performance results with the Quantum++ back end on a single-node Ryzen9 3900X machine that has 12 physical cores (24 hardware threads) with 128GB of RAM. The results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other each with 24 threads (1.63x improvement). In addition, we observe the same trend when running two Shor's algorthm kernels in parallel (1.22x faster than executing the kernels one after the other). Furthermore, the parallel version is better in terms of strong scalability. We believe that our design, implementation, and results will open up an opportunity not only for 1) enabling quicker prototyping of parallel/asynchrony-aware quantum-classical algorithms on quantum circuit simulators in the short-term, but also for 2) realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms in the long-term.
翻译:在本文中,我们论述了在量子-经典异构平台实现通用的异构并行编程模型所面临的关键限制。我们讨论了在QCOR中实现用户级别多线程的经验,以及需要解决的未来量子-经典系统的编程挑战。具体而言,我们讨论了如何引入基于C++的并行构造,以实现1)具有std::thread的量子内核的并行执行,以及2)具有std::async的异步执行。为此,我们提供了QCOR编程模型和运行时的当前实现的详细概述,并讨论了如何添加1)某些用户界面API程序的线程安全性,并2)通过消除阻碍多线程的数据竞争,增加QCOR中的并行性,以更好地利用可用的计算资源。我们还在单节点Ryzen9 3900X机器上使用Quantum++后端展示了初步的性能结果,该机器具有12个物理核心(24个硬件线程)和128GB的RAM。结果表明,在并行运行两个Bell内核的情况下,每个内核使用12个线程,优于将内核一个接一个地使用24个线程运行(1.63倍的改进)。此外,当并行运行两个Shor算法内核时,我们观察到相同的趋势(比将内核一个接一个地执行快1.22倍)。此外,并行版本在强可扩展性方面更好。我们相信,我们的设计、实现和结果不仅将在短期内为量子电路模拟器上并行/异步感知的量子-经典算法的更快原型设计打开机会,而且还将为长期实现量子-经典异构平台通用的并行编程模型提供机会。