编程竞赛Java作业抄袭数据集的构建 (Towards a Dataset of Programming Contest Plagiarism in Java) - 专知论文

会员服务 ·

0

TOOLS · 数据集 · Performer · 代码 · Java ·

2023 年 3 月 19 日

Towards a Dataset of Programming Contest Plagiarism in Java

翻译：编程竞赛Java作业抄袭数据集的构建

Evgeniy Slobodkin,Alexander Sadovnikov

from arxiv, 5 pages, 1 figure, 1 table

In this paper, we describe and present the first dataset of source code plagiarism specifically aimed at contest plagiarism. The dataset contains 251 pairs of plagiarized solutions of competitive programming tasks in Java, as well as 660 non-plagiarized ones, however, the described approach can be used to extend the dataset in the future. Importantly, each pair comes in two versions: (a) "raw" and (b) with participants' repeated template code removed, allowing for evaluating tools in different settings. We used the collected dataset to compare the available source code plagiarism detection tools, including state-of-the-art ones, specifically in their ability to detect contest plagiarism. Our results indicate that the tools show significantly worse performance on the contest plagiarism because of the template code and the presence of other misleadingly similar code. Of the tested tools, token-based ones demonstrated the best performance in both variants of the dataset.

翻译：在本文中，我们首次描述并提出了一个特别针对竞赛抄袭的源代码抄袭数据集。该数据集包含251对Java竞赛编程任务的抄袭解决方案，以及660个非抄袭解决方案，但是，所述方法可以用于将来扩展数据集。重要的是，每个配对有两个版本：(a)“原始”和(b)删除参与者重复的模板代码，允许在不同的环境中评估工具。我们使用收集的数据集比较了可用的源代码抄袭检测工具，包括最先进的工具，特别是它们在检测竞赛抄袭方面的能力。我们的结果表明，由于模板代码和存在其他具有误导性但相似的代码，工具在竞赛抄袭方面表现出显着较差的性能。在两个数据集变体中，基于令牌的工具表现出最优秀的性能。

0

相关内容

TOOLS

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【干货书】Python 编程，480页pdf

【干货书】Python 编程，480页pdf

专知会员服务

243+阅读 · 2020年8月14日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

专知会员服务

20+阅读 · 2019年11月22日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

拟南芥FT调控开花的分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

t-设计与多重传递群和Z_4码

国家自然科学基金

1+阅读 · 2015年12月31日

杨梅雌雄性别决定的遗传和基因组学基础

国家自然科学基金

0+阅读 · 2014年12月31日

redox信号介导的6-BA调控黄瓜弱光适应性的生理与分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

内蒙古粘细菌资源多样性及其抗马铃薯晚疫病活性分析

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

西北早粳稻种质资源遗传多样性研究及核心种质构建

国家自然科学基金

0+阅读 · 2012年12月31日

关于图顶点划分的 Thomassen 猜想

国家自然科学基金

0+阅读 · 2011年12月31日

《软件学报》学术期刊

国家自然科学基金

6+阅读 · 2011年12月31日

组合Web服务的建模与验证

国家自然科学基金

1+阅读 · 2008年12月31日

TidyBot: Personalized Robot Assistance with Large Language Models

Arxiv

0+阅读 · 2023年5月9日

TinyML Design Contest for Life-Threatening Ventricular Arrhythmia Detection

Arxiv

0+阅读 · 2023年5月9日

Towards Better Evaluation of GNN Expressiveness with BREC Dataset

Arxiv

0+阅读 · 2023年5月8日

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Arxiv

0+阅读 · 2023年5月7日

Towards indicating interdisciplinarity: Characterizing interdisciplinary knowledge flow

Arxiv

0+阅读 · 2023年5月7日

Sherlock in OSS: A Novel Approach of Content-Based Searching in Object Storage System

Arxiv

0+阅读 · 2023年5月6日

Analysis of h-index for research awards

Arxiv

0+阅读 · 2023年5月5日

Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations

Arxiv

0+阅读 · 2023年5月4日

Towards Reasoning in Large Language Models: A Survey

Arxiv

34+阅读 · 2022年12月20日

Towards Out-Of-Distribution Generalization: A Survey

Arxiv

38+阅读 · 2021年8月31日

VIP会员

文章信息

相关主题

相关VIP内容

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【干货书】Python 编程，480页pdf

【干货书】Python 编程，480页pdf

专知会员服务

243+阅读 · 2020年8月14日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

专知会员服务

20+阅读 · 2019年11月22日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

NeurIPS 2025 | 自动化所新作速览（一）

大型语言模型（LLM）赋能的知识图谱构建：综述

NeurIPS 2025 | 自动化所新作速览（二）

领域特定文本分类中的预训练语言模型新进展：系统综述

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

TidyBot: Personalized Robot Assistance with Large Language Models

Arxiv

0+阅读 · 2023年5月9日

TinyML Design Contest for Life-Threatening Ventricular Arrhythmia Detection

Arxiv

0+阅读 · 2023年5月9日

Towards Better Evaluation of GNN Expressiveness with BREC Dataset

Arxiv

0+阅读 · 2023年5月8日

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Arxiv

0+阅读 · 2023年5月7日

Towards indicating interdisciplinarity: Characterizing interdisciplinary knowledge flow

Arxiv

0+阅读 · 2023年5月7日

Sherlock in OSS: A Novel Approach of Content-Based Searching in Object Storage System

Arxiv

0+阅读 · 2023年5月6日

Analysis of h-index for research awards

Arxiv

0+阅读 · 2023年5月5日

Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations

Arxiv

0+阅读 · 2023年5月4日

Towards Reasoning in Large Language Models: A Survey

Arxiv

34+阅读 · 2022年12月20日

Towards Out-Of-Distribution Generalization: A Survey

Arxiv

38+阅读 · 2021年8月31日

相关基金

拟南芥FT调控开花的分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

t-设计与多重传递群和Z_4码

国家自然科学基金

1+阅读 · 2015年12月31日

杨梅雌雄性别决定的遗传和基因组学基础

国家自然科学基金

0+阅读 · 2014年12月31日

redox信号介导的6-BA调控黄瓜弱光适应性的生理与分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

内蒙古粘细菌资源多样性及其抗马铃薯晚疫病活性分析

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

西北早粳稻种质资源遗传多样性研究及核心种质构建

国家自然科学基金

0+阅读 · 2012年12月31日

关于图顶点划分的 Thomassen 猜想

国家自然科学基金

0+阅读 · 2011年12月31日

《软件学报》学术期刊

国家自然科学基金

6+阅读 · 2011年12月31日

组合Web服务的建模与验证

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员