Choosing hardware for theorem proving is no simple task: automated provers are highly complex and optimized programs, often utilizing a parallel computation model, and there is little prior research on the hardware impact on prover performance. To alleviate the problem for Isabelle, we initiated a community benchmark where the build time of HOL-Analysis is measured. On $54$ distinct CPUs, a total of $669$ runs with different Isabelle configurations were reported by Isabelle users. Results range from $107$s to over $11$h. We found that current consumer CPUs performed best, with an optimal number of $8$ to $16$ threads, largely independent of heap memory. As for hardware parameters, CPU base clock affected multi-threaded execution most with a linear correlation of $0.37$, whereas boost frequency was the most influential parameter for single-threaded runs (correlation coefficient $0.55$); cache size played no significant role. When comparing our benchmark scores with popular high-performance computing benchmarks, we found a strong linear relationship with Dolfyn ($R^2 = 0.79$) in the single-threaded scenario. Using data from the 3DMark CPU Profile consumer benchmark, we created a linear model for optimal (multi-threaded) Isabelle performance. When validating, the model has an average $R^2$-score of $0.87$; the mean absolute error in the final model corresponds to a wall-clock time of $46.6$s. With a dataset of true median values for the 3DMark, the error improves to $37.1$s.
翻译:为了缓解伊莎贝尔的问题,我们启动了一个社区基准,用于测量HOL- Anacular的构建时间。在54美元不同的CPU上,伊莎贝尔用户报告了与伊莎贝尔不同配置的669美元运行量。结果从107美元到11美元以上不等。我们发现,目前消费的CPU表现最佳,最佳数量为8美元至16美元,基本上不依赖高密度存储。关于硬件参数,CPU基准对多读执行影响最大,线性关系为0.37美元,而增强频率是单读运行最有影响力的参数(摄氏系数0.55美元);缓存大小由伊莎贝尔用户报告。在将我们的基准模型与流行的高性能计算基准进行比较时,我们发现与Dolfyn (R%2=0.9美元) 的绝对直线性关系有所改善,在单读值参数中,C.