To identify robots and humans and analyze their respective access patterns, we used the Internet Archive's (IA) Wayback Machine access logs from 2012 and 2019, as well as Arquivo.pt's (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the two years of IA access logs (2012 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 is greater than in IA 2019 (21% more in requests and 18% more in sessions). Robots account for 98% of requests (97% of sessions) in Arquivo.pt (2019). We found that the robots are almost entirely limited to "Dip" and "Skim" access patterns in IA 2012, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.
翻译:为了识别机器人和人类,分析他们各自的访问模式,我们使用了互联网档案(IA)的回路机器访问记录(2012年和2019年)以及2019年的Arquivo.pt(葡萄牙网络档案)访问记录(2019年)。我们在访问日志中确定了用户会议,并根据浏览行为将这些会议分类为人或机器人。为了更好地了解用户如何通过网络档案浏览,我们评估了这些会议,以发现用户访问模式。根据这两个档案,以及在IA访问日志的两年(2012年和2019年)之间,我们比较了被检测到的机器人相对于人类及其用户访问模式和时间偏好。2012年在IA中检测到的机器人总数超过了2019年IA(请求增加21%,会议增加18%)。在Arquivo.pt. (2019年),机器人占请求的98%(97%)。我们发现机器人几乎完全限于“Dip”和“Skim”访问模式(2012年),但我们展示了2012年IA的所有接近模式及其组合。但是在IA 2019年的网页上展示了人类过去的模型和组合。