作为静态分析警报分类器培训数据来源的测试套件 (Test Suites as a Source of Training Data for Static Analysis Alert Classifiers) - 专知论文

会员服务 ·

0

训练数据 · 查准率/准确率 · TOOLS · 留出法 · 可辨认的 ·

2021 年 5 月 7 日

Test Suites as a Source of Training Data for Static Analysis Alert Classifiers

翻译：作为静态分析警报分类器培训数据来源的测试套件

Lori Flynn,William Snavely,Zachary Kurtz

from arxiv, 9 pages, 3 figures, 6 tables, to be published in proceedings of Conference on Automation of Software Test (AST 2021)

Flaw-finding static analysis tools typically generate large volumes of code flaw alerts including many false positives. To save on human effort to triage these alerts, a significant body of work attempts to use machine learning to classify and prioritize alerts. Identifying a useful set of training data, however, remains a fundamental challenge in developing such classifiers in many contexts. We propose using static analysis test suites (i.e., repositories of "benchmark" programs that are purpose-built to test coverage and precision of static analysis tools) as a novel source of training data. In a case study, we generated a large quantity of alerts by executing various static analyzers on the Juliet C/C++ test suite, and we automatically derived ground truth labels for these alerts by referencing the Juliet test suite metadata. Finally, we used this data to train classifiers to predict whether an alert is a false positive. Our classifiers obtained high precision (90.2%) and recall (88.2%) for a large number of code flaw types on a hold-out test set. This preliminary result suggests that pre-training classifiers on test suite data could help to jumpstart static analysis alert classification in data-limited contexts.

翻译：法律调查静态分析工具通常产生大量的代码缺陷警报,包括许多假正数。为了节省人类对这些警报进行分类的努力,大量的工作尝试是利用机器学习来分类和确定警报的优先次序。然而,确定一套有用的培训数据在许多方面仍然是开发这类分类器的基本挑战。我们提议使用静态分析测试套件(即专门为测试静态分析工具的覆盖范围和精确度而建立的“基准标记”程序储存库)作为新颖的培训数据来源。在一项案例研究中,我们通过在朱丽叶C/C+++测试套件上执行各种静态分析器,产生了大量的警报,我们通过引用朱丽叶测试套件元数据自动生成这些警报的地面真相标签。最后,我们利用这些数据来培训分类员,以预测警报是否为假正数。我们的分类器获得了很高的精确度(90.2%),并忆及(88.2%),用于在暂停测试套件上的大量代码瑕疵类型。这一初步结果表明,测试套件中的培训前分类器的分类员可以帮助在数据范围内启动静态分析警报。

0

相关内容

训练数据

【重磅】2021年IEEE Fellow出炉！ 282位新晋升会士！七十多位华人当选！

专知会员服务

23+阅读 · 2020年11月25日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

专知会员服务

26+阅读 · 2020年5月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

专知会员服务

134+阅读 · 2020年3月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

已删除

将门创投

3+阅读 · 2019年5月6日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Arxiv

0+阅读 · 2021年6月25日

An Empirical Study of Graph-Based Approaches for Semi-Supervised Time Series Classification

Arxiv

0+阅读 · 2021年6月24日

Technical Reports Compilation: Detecting the Fire Drill anti-pattern using Source Code and issue-tracking data

Technical Reports Compilation: Detecting the Fire Drill anti-pattern using Source Code and issue-tracking data

Arxiv

0+阅读 · 2021年6月24日

Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

Arxiv

0+阅读 · 2021年6月24日

Predicting Legal Proceedings Status: Approaches Based on Sequential Text Data

Arxiv

0+阅读 · 2021年6月23日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Cloze-driven Pretraining of Self-attention Networks

Arxiv

6+阅读 · 2019年3月19日

Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014

Arxiv

3+阅读 · 2018年6月3日

Baselines and test data for cross-lingual inference

Arxiv

3+阅读 · 2018年3月2日

VIP会员

文章信息

相关主题

查准率/准确率

相关VIP内容

【重磅】2021年IEEE Fellow出炉！ 282位新晋升会士！七十多位华人当选！

专知会员服务

23+阅读 · 2020年11月25日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

专知会员服务

26+阅读 · 2020年5月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

专知会员服务

134+阅读 · 2020年3月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML2025】QuRe：通过困难负样本采样实现查询相关的组合图像检索

自动驾驶中的3D目标检测研究进展

中文版 | 无人机战争与乌克兰战场演进（2024-2025）

【阿姆斯特丹博士论文】在嘈杂和低资源环境中提升神经检索器的鲁棒性与有效性

相关资讯

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

已删除

将门创投

3+阅读 · 2019年5月6日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Arxiv

0+阅读 · 2021年6月25日

An Empirical Study of Graph-Based Approaches for Semi-Supervised Time Series Classification

Arxiv

0+阅读 · 2021年6月24日

Technical Reports Compilation: Detecting the Fire Drill anti-pattern using Source Code and issue-tracking data

Technical Reports Compilation: Detecting the Fire Drill anti-pattern using Source Code and issue-tracking data

Arxiv

0+阅读 · 2021年6月24日

Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

Arxiv

0+阅读 · 2021年6月24日

Predicting Legal Proceedings Status: Approaches Based on Sequential Text Data

Arxiv

0+阅读 · 2021年6月23日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Cloze-driven Pretraining of Self-attention Networks

Arxiv

6+阅读 · 2019年3月19日

Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014

Arxiv

3+阅读 · 2018年6月3日

Baselines and test data for cross-lingual inference

Arxiv

3+阅读 · 2018年3月2日

微信扫码咨询专知VIP会员