This paper provides an experimentally validated, probabilistic model of file behavior when consumed by a set of pre-existing parsers. File behavior is measured by way of a standardized set of Boolean "messages" produced as the files are read. By thresholding the posterior probability that a file exhibiting a particular set of messages is from a particular dialect, our model yields a practical classification algorithm for two dialects. We demonstrate that this thresholding algorithm for two dialects can be bootstrapped from a training set consisting primarily of one dialect. Both the (parametric) theoretical and the (non-parametric) empirical distributions of file behaviors for one dialect yield good classification performance, and outperform classification based on simply counting messages. Our theoretical framework relies on statistical independence of messages within each dialect. Violations of this assumption are detectable and allow a format analyst to identify "boundaries" between dialects. A format analyst can therefore greatly reduce the number of files they need to consider when crafting new criteria for dialect detection, since they need only consider the files that exhibit ambiguous message patterns.
翻译:本文提供了一个实验性、 概率化的文件行为模型, 由一组先前存在的解析器使用。 文件行为是通过一套标准化的布尔语“ 消息” 来测量的, 当文件被阅读时生成的。 通过将显示特定一组信息的文件来自特定方言的事后概率阈值, 我们的模型为两种方言提供了一种实用的分类算法。 我们证明, 两种方言的这一临界算法可以从主要由一种方言组成的培训组中跳出。 一种方言的文件行为( 参数) 理论和( 非参数) 经验性分布都产生良好的分类性能, 以及基于简单计算信息的超文本分类。 我们的理论框架依赖于每种方言中的信息的统计独立性。 违反这一假设是可以检测的, 并允许格式分析师在确定两种方言之间的“ 边界” 。 因此, 格式分析师可以在设计新的方言检测标准时大量减少他们需要考虑的文件数量, 因为他们只需要考虑显示模糊信息模式的文件。