缩放时基于内容的文本文件类型类型检测 (Content-Based Textual File Type Detection at Scale)

Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model reaches 85% in our experiments for a relatively high number of recognized classes (more than 130 file types).

翻译：在分析大源代码基础时,通常需要检测语言编程程序,这是分析大源代码基础时常见的一种需要。它得到一些现有工具的支持,这些工具依赖若干特性,特别是文件扩展,以决定文件类型。我们考虑了准确检测软件代码基础中常见的、仅基于文本文件内容的常见文件类型的问题。这样做有助于对缺少文件扩展的源代码进行分类(例如,在网络或可执行脚本上张贴代码片段),以避免错误地分类错误地记录了以错误或不常见的扩展文件记录下来的源代码,并对源代码文件的内在可识别性提供一些信息。我们提出了一个简单的模型,即(a) 文本文档使用语言类异词符号,(b) 1/2 格中的组符号,(c) 建立基于N克频率或可执行脚本的特性矢量,(d) 使用简单、完全连接的神经网络进行分类。作为培训的设置,我们使用从GitHub仓库中提取的至少1000个恒星的文本文件扩展为地面真理。我们提议的模型在相当高的类别中达到了85%。(尽管它很简洁,但有130级的实验的样本类型为130个,但有130个)的样本类型为近130个。

相关内容