孟加拉语中高血压检测:基于文本相似性的方法 (Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach)

Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.

翻译：Plagiarism 是指使用另一个人的工作,而没有给予他们任何荣誉。 Plagiarism 是学术界和研究人员中最严重的问题之一。尽管在一份文件中有许多工具可以用来检测plagiarism, 但大多数工具都是针对域的, 设计用于英文文本, 但这种工具并不仅限于一种语言。 Bengali是孟加拉国最广泛使用的语言,也是印度第二大语言,有3亿母语和3 700万第二语言。 Plagiarism 的检测需要大量的数据来进行比较。 Bengali文学有1300年的历史。因此,大多数孟加拉文学书籍尚未被适当数字化。由于没有为我们的目的提供这种工具,所以我们没有为印度国家数字图书馆收集孟加拉文学书籍,因此我们从中收集了一种全面的方法,从中提取了文字,并构建了我们的文。我们的实验结果发现,在使用 OCRR 的文本提取中,平均为72. 10 - 79.89% 。 Levestein 远程算法是用来确定Plagiagiarism的。我们为最终的检测目的,我们为Blasmaim m 建造了一种更精确的检测和成功的书。我们为Blamabal。我们为将来的检测而建造了一个更精确的搜索。我们为Bastium 。我们为Bastium 。我们建造了一个更精确的搜索而建造了一台。