Web archives preserve unique and historically valuable information. They hold a record of past events and memories published by all kinds of people, such as journalists, politicians and ordinary people who have shared their testimony and opinion on multiple subjects. As a result, researchers such as historians and sociologists have used web archives as a source of information to understand the recent past since the early days of the World Wide Web. The typical way to extract knowledge from a web archive is by using its search functionalities to find and analyse historical content. This can be a slow and superficial process when analysing complex topics, due to the huge amount of data that web archives have been preserving over time. Big data science tools can cope with this order of magnitude, enabling researchers to automatically extract meaningful knowledge from the archived data. This knowledge helps not only to explain the past but also to predict the future through the computational modelling of events and behaviours. Currently, there is an immense landscape of big data tools, machine learning frameworks and deep learning algorithms that significantly increase the scalability and performance of several computational tasks, especially over text, image and audio. Web archives have been taking advantage of this panoply of technologies to provide their users with more powerful tools to explore and exploit historical data. This chapter presents several examples of these tools and gives an overview of their application to support longitudinal studies over web archive collections.
翻译:网页档案保存着独特的历史宝贵信息,它们保存着各类人,例如记者、政治家和普通人就多个主题分享证词和意见的过去事件和记忆的记录,因此,历史学家和社会学家等研究人员利用网络档案作为信息来源,了解自万维网早期以来的最近历史。从网络档案中获取知识的典型方法是利用其搜索功能查找和分析历史内容。在分析复杂专题时,这可能是一个缓慢和肤浅的过程,因为网络档案保存了大量数据。大型数据科学工具可以应付这种规模的大小,使研究人员能够自动从存档数据中获取有意义的知识。这种知识不仅有助于解释过去,而且有助于通过对事件和行为的计算模型预测未来。目前,大数据工具、机器学习框架和深层次学习算法的景观十分庞大,大大提高了一些计算任务,特别是文字、图像和音频的可操作性。网络档案档案利用了这一技术的这一庞大规模,使研究人员能够从存档数据中自动提取有意义的知识。这一知识不仅有助于解释过去,而且有助于通过对事件和行为进行计算模型的模型的模型模型模型来预测未来。目前,大数据工具、机器学习框架和深层次算法大大地增加了一些计算任务,特别是文字、图像和音频的可及音频工具的可图。网络档案工具的利用。