In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based on specific SE knowledge) such as (e.g.) co-train two learners and use good labels from one to train the other. Another approach of SSL in software analytics is potentially use ``strong'' knowledge that use SE knowledge. For example, an often-used heuristic in SE is that unusually large artifacts contain undesired properties (e.g. more bugs). This paper argues that such ``strong'' algorithms perform better than those standard, weaker, SSL algorithms. We show this by learning models from labels generated using weak SSL or our ``stronger'' FRUGAL algorithm. In four domains (distinguishing security-related bug reports; mitigating bias in decision-making; predicting issue close time; and (reducing false alarms in static code warnings), FRUGAL required only 2.5% of the data to be labeled yet out-performed standard semi-supervised learners that relied on (e.g.) some domain-independent graph theory concepts. Hence, for future work, we strongly recommend the use of strong heuristics for semi-supervised learning for SE applications. To better support other researchers, our scripts and data are on-line at https://github.com/HuyTu7/FRUGAL.
翻译:在许多领域,这些例子有许多例子,而且标签也少得多;例如,我们也许可以访问数以百万计的源代码线条,但只能访问关于该代码的少量警告。在这些领域,半监管学习者(SSL)可以从少量例子推断标签到数据的其他部分。标准 SSL 算法使用“weak' ” 知识(即不是基于特定 SE 知识的知识),例如(例如) 双级双级双级学习者,并从一个中推荐好标签来培训另一个。软件内部分析的另一种 SSL 方法可能使用“ 坚固” 的SEE 知识。例如, 半监管学习半级学习者(SSL) 可以将“ 坚固的” 算法用于(例如,更多的错误)。 本文认为,这种“ 坚固” 的算法比标准要好得多, 弱一些 SLSL 。我们通过学习从未来SLSL 或“ 坚固的” FRUGAL 的图形应用来显示这一点。 在四个域( 的SUGAL) 中, 将“ 快速的SUL) 数据预测中, 需要“ 更精确的SUL) 和 更精确的SOLUL 的系统化的系统化的判变法。