Despite the fact that it is publicly available, collecting and processing the full bitcoin blockchain data is not trivial. Its mere size, history, and other features indeed raise quite specific challenges, that we address in this paper. The strengths of our approach are the following: it relies on very basic and standard tools, which makes the procedure reliable and easily reproducible; it is a purely lossless procedure ensuring that we catch and preserve all existing data; it provides additional indexing that makes it easy to further process the whole data and select appropriate subsets of it. We present our procedure in details and illustrate its added value on large-scale use cases, like address clustering. We provide an implementation online, as well as the obtained dataset.
翻译:尽管完全比特币链链数据是公开的,但收集和处理比特币链链中的全部数据并非微不足道,它本身的大小、历史和其他特点确实提出了非常具体的挑战,我们在本文件中讨论。我们的方法的优点如下:它依靠非常基本和标准的工具,使程序可靠和易于复制;它纯粹是无损的程序,确保我们捕获和保存所有现有数据;它提供了额外的索引,便于进一步处理整个数据并选择适当的子集。我们详细介绍了我们的程序,并说明了它在大规模使用案例中的附加值,例如地址集群。我们提供了在线执行,以及所获得的数据集。