基于文档相似度算法的比较 (A Comparison of Document Similarity Algorithms) - 专知论文

会员服务 ·

0

相似度 · 算法 · 语言处理 · 自然语言处理 · 文本摘要 ·

2023 年 4 月 3 日

A Comparison of Document Similarity Algorithms

翻译：基于文档相似度算法的比较

Nicholas Gahman,Vinayak Elangovan

Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major positive impact on the field of Natural Language Processing. This report sets out to examine the numerous document similarity algorithms, and determine which ones are the most useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based algorithms. The most effective algorithms in each category are also compared in our work using a series of benchmark datasets and evaluations that test every possible area that each algorithm could be used in.

翻译：文档相似度是自然语言处理的重要组成部分，最常用于检测抄袭和文本摘要。因此，找到总体上最有效的文档相似度算法可能对自然语言处理领域产生重大积极影响。本文旨在研究众多的文档相似度算法，并确定其中哪些是最有用的。我们将文档相似度算法分类为3种类型：统计算法，神经网络和基于语料库/知识的算法，进而比较各类别中最有效的算法。我们使用一系列基准数据集和评估来测试每种算法可能用到的所有领域，并对每种算法进行比较。

0

相关内容

相似度

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

专知会员服务

32+阅读 · 2020年2月1日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

AI实战圣经《Machine Learning Yearning》第1-52章中英文版pdf分享

AI实战圣经《Machine Learning Yearning》第1-52章中英文版pdf分享

深度学习与NLP

15+阅读 · 2018年9月8日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

LibRec 精选：推荐的可解释性[综述]

LibRec 精选：推荐的可解释性[综述]

LibRec智能推荐

10+阅读 · 2018年5月4日

论文报告 | Graph-based Neural Multi-Document Summarization

论文报告 | Graph-based Neural Multi-Document Summarization

科技创新与创业

15+阅读 · 2017年12月15日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

粘接结构温度疲劳损伤的非线性超声检测与评价

国家自然科学基金

0+阅读 · 2014年12月31日

大数据偏好查询算法关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

约束Lp正则化问题算法及应用

国家自然科学基金

0+阅读 · 2012年12月31日

情感信息抽取的资源建设及关键技术研究

国家自然科学基金

2+阅读 · 2012年12月31日

跨汉斯拉夫蒙古文的信息检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

图的染色和控制集问题的理论和算法研究

国家自然科学基金

0+阅读 · 2009年12月31日

椭圆曲线密码学算法研究

国家自然科学基金

1+阅读 · 2009年12月31日

miRNA-1和miRNA-133在缺血后处理的心肌保护机制中的调控作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

液氢涡轮泵转子系统非线性随机稳定性研究

国家自然科学基金

0+阅读 · 2009年12月31日

应用于面向问题的自动文摘任务的篇章分析关键技术研究

国家自然科学基金

0+阅读 · 2008年12月31日

A Diagnosis Algorithms for a Rotary Indexing Machine

Arxiv

0+阅读 · 2023年5月25日

BookGPT: A General Framework for Book Recommendation Empowered by Large Language Model

Arxiv

0+阅读 · 2023年5月25日

Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

Arxiv

0+阅读 · 2023年5月24日

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Arxiv

0+阅读 · 2023年5月24日

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

Arxiv

0+阅读 · 2023年5月23日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

Recent Advances in Deep Learning-based Dialogue Systems

Arxiv

18+阅读 · 2021年5月10日

Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking

Arxiv

11+阅读 · 2021年1月7日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

VIP会员

文章信息

相关主题

自然语言处理

相关VIP内容

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

专知会员服务

32+阅读 · 2020年2月1日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

仿生机器人技术的军事应用

《反集群作战：基于深度学习的分布式决策方法》89页

机器人领域中最佳的三维场景表示是什么？——从几何表示到基础模型

《多域作战兵棋推演：运用形态学分析与人工智能加强国防人员训练》

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

AI实战圣经《Machine Learning Yearning》第1-52章中英文版pdf分享

AI实战圣经《Machine Learning Yearning》第1-52章中英文版pdf分享

深度学习与NLP

15+阅读 · 2018年9月8日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

LibRec 精选：推荐的可解释性[综述]

LibRec 精选：推荐的可解释性[综述]

LibRec智能推荐

10+阅读 · 2018年5月4日

论文报告 | Graph-based Neural Multi-Document Summarization

论文报告 | Graph-based Neural Multi-Document Summarization

科技创新与创业

15+阅读 · 2017年12月15日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

A Diagnosis Algorithms for a Rotary Indexing Machine

Arxiv

0+阅读 · 2023年5月25日

BookGPT: A General Framework for Book Recommendation Empowered by Large Language Model

Arxiv

0+阅读 · 2023年5月25日

Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

Arxiv

0+阅读 · 2023年5月24日

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Arxiv

0+阅读 · 2023年5月24日

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

Arxiv

0+阅读 · 2023年5月23日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

Recent Advances in Deep Learning-based Dialogue Systems

Arxiv

18+阅读 · 2021年5月10日

Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking

Arxiv

11+阅读 · 2021年1月7日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

相关基金

粘接结构温度疲劳损伤的非线性超声检测与评价

国家自然科学基金

0+阅读 · 2014年12月31日

大数据偏好查询算法关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

约束Lp正则化问题算法及应用

国家自然科学基金

0+阅读 · 2012年12月31日

情感信息抽取的资源建设及关键技术研究

国家自然科学基金

2+阅读 · 2012年12月31日

跨汉斯拉夫蒙古文的信息检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

图的染色和控制集问题的理论和算法研究

国家自然科学基金

0+阅读 · 2009年12月31日

椭圆曲线密码学算法研究

国家自然科学基金

1+阅读 · 2009年12月31日

miRNA-1和miRNA-133在缺血后处理的心肌保护机制中的调控作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

液氢涡轮泵转子系统非线性随机稳定性研究

国家自然科学基金

0+阅读 · 2009年12月31日

应用于面向问题的自动文摘任务的篇章分析关键技术研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员