The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.
翻译:WARC文件格式被网络档案广泛用于保存收集的网络内容,供将来使用。随着网络档案的迅速发展,以及人们日益希望将这些档案作为统计和分析研究的大数据来源重新使用,将这些数据转化为洞察力的速度变得至关重要。在本文中,我们表明WARC格式对批量处理工作量有重大的性能处罚。我们将这些惩罚的根源追溯到其数据结构、编码和处理方法。然后我们进行有控制的实验,以说明这些问题可能有多严重。事实上,仅仅通过将WAC文件改制成Parquet或Avro格式,就可以实现一到两个层次的绩效收益。虽然这些结果不一定构成对Avro或Parquet的认可,但网络归档界现在应该考虑以更有效的网络档案格式取代WARC。