A core problem in the development and maintenance of crowd-sourced filter lists is that their maintainers cannot confidently predict whether (and where) a new filter list rule will break websites. This is a result of enormity of the Web, which prevents filter list authors from broadly understanding the impact of a new blocking rule before they ship it to millions of users. The inability of filter list authors to evaluate the Web compatibility impact of a new rule before shipping it severely reduces the benefits of filter-list-based content blocking: filter lists are both overly-conservative (i.e. rules are tailored narrowly to reduce the risk of breaking things) and error-prone (i.e. blocking tools still break large numbers of sites). To scale to the size and scope of the Web, filter list authors need an automated system to detect when a new filter rule breaks websites, before that breakage has a chance to make it to end users. In this work, we design and implement the first automated system for predicting when a filter list rule breaks a website. We build a classifier, trained on a dataset generated by a combination of compatibility data from the EasyList project and novel browser instrumentation, and find it is accurate to practical levels (AUC 0.88). Our open source system requires no human interaction when assessing the compatibility risk of a proposed privacy intervention. We also present the 40 page behaviors that most predict breakage in observed websites.
翻译:开发和维护众源过滤器清单的一个核心问题是,其维护者无法自信地预测新的过滤器清单规则是否会破坏网站。 这是因为网络的庞大规模使得过滤器清单作者无法在将新的阻塞规则运送给数百万用户之前广泛理解新的阻塞规则的影响。 过滤器清单作者无法在运输新规则之前评估新规则的网络兼容性影响,从而严重降低了过滤器清单封隔内容的好处:过滤器清单既过于保守(即,规则的定制范围过于狭窄,以降低破碎事物的风险),又容易出错(即,屏蔽工具仍然打破大量网站 ) 。 为了扩大网络的规模和范围,过滤器清单作者需要有一个自动化系统,以便在新的过滤器规则破解网站之前,能够让用户最终使用。 在这项工作中,我们设计和实施第一个自动系统,在过滤器清单规则打破网站时进行预测。 我们建立一个分类器,通过将简单Lisc最精确的兼容性数据组合而经过培训的分类器,在目前最精确的准确的服务器服务器上,我们需要一个不精确的精确的系统,而新版本的服务器工具评估。