Flaky tests, tests that pass or fail nondeterministically without changes to code or environment, pose a serious threat to software reliability. While classical software engineering has developed a rich body of dynamic and static techniques to study flakiness, corresponding evidence for quantum software remains limited. Prior work relies primarily on static analysis or small sets of manually reported incidents, leaving open questions about the prevalence, characteristics, and detectability of flaky tests. This paper presents the first large-scale dynamic characterization of flaky tests in quantum software. We executed the Qiskit Terra test suite 10,000 times across 23 releases in controlled environments. For each release, we measured test-outcome variability, identified flaky tests, estimated empirical failure probabilities, analyzed recurrence across versions, and used Wilson confidence intervals to quantify rerun budgets for reliable detection. We further mapped flaky tests to Terra subcomponents to assess component-level susceptibility. Across 27,026 test cases, we identified 290 distinct flaky tests. Although overall flakiness rates were low (0-0.4%), flakiness was highly episodic: nearly two-thirds of flaky tests appeared in only one release, while a small subset recurred intermittently or persistently. Many flaky tests failed with very small empirical probabilities ($\hat{p} \approx 10^{-4}$), implying that tens of thousands of executions may be required for confident detection. Flakiness was unevenly distributed across subcomponents, with 'transpiler' and 'quantum_info' accounting for the largest share. These results show that quantum test flakiness is rare but difficult to detect under typical continuous integration budgets. To support future research, we release a public dataset of per-test execution outcomes.
翻译:不稳定测试指在不修改代码或环境的情况下,测试结果非确定性地通过或失败,这对软件可靠性构成严重威胁。尽管经典软件工程已发展出丰富的动态与静态技术来研究测试不稳定性,但针对量子软件的相关证据仍然有限。先前工作主要依赖静态分析或少量人工报告的事件,关于不稳定测试的普遍性、特征及可检测性等问题尚未得到充分解答。本文首次对量子软件中的不稳定测试进行了大规模动态特征分析。我们在受控环境中对 Qiskit Terra 测试套件跨越 23 个版本执行了 10,000 次。针对每个版本,我们测量了测试结果的变异性,识别了不稳定测试,估算了经验失败概率,分析了跨版本的复现情况,并使用威尔逊置信区间量化了可靠检测所需的重复执行预算。我们进一步将不稳定测试映射到 Terra 的子组件,以评估组件级别的易感性。在总共 27,026 个测试用例中,我们识别出 290 个不同的不稳定测试。尽管整体不稳定率较低(0-0.4%),但不稳定性呈现高度偶发性:近三分之二的不稳定测试仅出现在一个版本中,而一小部分子集则间歇性或持续性地复现。许多不稳定测试以极低的经验概率失败($\hat{p} \approx 10^{-4}$),这意味着可能需要数万次执行才能实现可靠检测。不稳定性在子组件间分布不均,其中 'transpiler' 和 'quantum_info' 占比最大。这些结果表明,量子测试不稳定性虽然罕见,但在典型的持续集成预算下难以检测。为支持未来研究,我们公开了按测试执行结果的公共数据集。