Internet censorship is a phenomenon of societal importance and attracts investigation from multiple disciplines. Several research groups, such as Censored Planet, have deployed large scale Internet measurement platforms to collect network reachability data. However, existing studies generally rely on manually designed rules (i.e., using censorship fingerprints) to detect network-based Internet censorship from the data. While this rule-based approach yields a high true positive detection rate, it suffers from several challenges: it requires human expertise, is laborious, and cannot detect any censorship not captured by the rules. Seeking to overcome these challenges, we design and evaluate a classification model based on latent feature representation learning and an image-based classification model to detect network-based Internet censorship. To infer latent feature representations from network reachability data, we propose a sequence-to-sequence autoencoder to capture the structure and the order of data elements in the data. To estimate the probability of censorship events from the inferred latent features, we rely on a densely connected multi-layer neural network model. Our image-based classification model encodes a network reachability data record as a gray-scale image and classifies the image as censored or not using a dense convolutional neural network. We compare and evaluate both approaches using data sets from Censored Planet via a hold-out evaluation. Both classification models are capable of detecting network-based Internet censorship as we were able to identify instances of censorship not detected by the known fingerprints. Latent feature representations likely encode more nuances in the data since the latent feature learning approach discovers a greater quantity, and a more diverse set, of new censorship instances.
翻译:互联网审查是一种具有社会重要性的现象,它吸引了多种学科的调查。一些研究团体,例如《全球警戒》,已经部署了大型互联网测量平台来收集网络可访问性数据。然而,现有的研究一般依靠人工设计的规则(即使用检查指纹)来从数据中检测网络的互联网审查。虽然这种基于规则的方法可以产生一个很高的真正的正检测率,但它也存在若干挑战:它需要人的专门知识,是艰苦的,无法发现任何没有被规则所吸收的检查。为了克服这些挑战,我们设计和评价了一个基于潜在地貌代表学习和基于图像的分类模型,以探测基于网络的互联网审查。为了从网络可访问性数据中推断出潜在的地貌代表(即使用检查指纹指纹指纹的指纹),我们建议一个从顺序到顺序的自动编码,以掌握数据元素的顺序和顺序。为了根据推断的潜伏性特征来估计审查事件的概率,我们依靠一个密系多层次的多层次网络模型。我们基于图像的分类方法将网络可访问性数据记录为灰度的图像图像图像,并且将图像通过更深层次的网络的浏览式的网络进行对比,而不是通过更深层次的网络,我们所了解的深度的网络的深度评估,我们所了解的深度的网络,通过更深层次的网络的网络,通过更深层的深度的深度的深度的网络的深度的网络,通过一个更深层次评估,通过更深层次评估,我们更深层次评估,我们所了解到的网络的网络的网络的网络的层次评估,我们所了解的深度评估,我们所了解的网络的深度评估,我们所了解的精确性评估,我们所了解的网络的深度的深度的网络的深度的深度的网络的深度的深度的深度的深度的深度的深度的深度的深度的深度的网络的网络的深度的精确的特性是更深。