Although information theory has found success in disciplines, the literature on its applications to software evolution is limit. We are still missing artifacts that leverage the data and tooling available to measure how the information content of a project can be a proxy for its complexity. In this work, we explore two definitions of entropy, one structural and one textual, and apply it to the historical progression of the commit history of 25 open source projects. We produce evidence that they generally are highly correlated. We also observed that they display weak and unstable correlations with other complexity metrics. Our preliminary investigation of outliers shows an unexpected high frequency of events where there is considerable change in the information content of the project, suggesting that such outliers may inform a definition of surprisal.
翻译:虽然信息理论在许多学科中取得了成功,在其应用于软件演化的文献却很有限。我们仍然缺乏旨在利用可用数据和工具来衡量项目信息内容的定义,以此作为其复杂性的代理。在这项工作中,我们探究了两种熵的定义,一种是结构性的,另一种是文本性的,并将其应用于25个开源项目的历史提交进程。我们得出的证据表明它们通常高度相关。我们还观察到它们与其他复杂度度量显示出弱且不稳定的相关性。我们对异常值的初步调查显示出了一个意外的高频事件,在其中项目的信息内容发生了显著变化,这表明这种异常值可能可以提供关于“惊奇度”的定义。