Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.
翻译:重放时, 存档网页上的 JavaScript 可以生成重复的 HTTP 请求, 导致不必要的网络归档。 例如, 存档网页平均每分钟平均超过1000个请求。 这些请求对于用户来说是看不到的, 所以如果用户在浏览器标签中将这样一个存档网页打开, 他们将不知道他们的浏览器继续生成网络档案的流量。 我们发现, 需要定期更新的网页( 例如, 电台播放列表、 体育评分更新、 图像carousls ) 更有可能做出这样的重复请求。 如果网页请求的资源没有存档, 一些网络档案档案可能会试图通过请求现场网站的资源来补齐档案。 如果请求的资源无法在现场网络上被打开, 资源无法存档, 答复仍然是 HTTP 404 。 某些拟议存档网页像在现场网站上所做的那样, 继续像在浏览服务器时一样, 更经常地对服务器进行访问, 如果他们请求回复 HTTP 404 的不必要回复, 就会达到不必要的流量。 在大范围内, 这些网页档案网页实际上无法使用服务器的服务器测试, 。