确定哈萨克语文本文件相似性的方法, " 考虑同义词:扩展至TF-IDF " (Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF) - 专知论文

会员服务 ·

0

Extensibility · TF-IDF · 相似度 · Weight · INFORMS ·

2022 年 11 月 22 日

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

翻译：确定哈萨克语文本文件相似性的方法, " 考虑同义词:扩展至TF-IDF "

from arxiv, 2022 International Conference on Smart Information Systems and Technologies (SIST)

The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.

翻译：确定文本文件相似性的任务在许多领域,如信息检索、文本采矿、自然语言处理(NLP)和计算语言学,都受到相当重视。将数据转移到数字矢量是一项复杂的任务,在其中使用了代号化、断字过滤、断字和术语加权等算法。频度-反向文档频率(TF-IDF)是便利搜索相关文件的最广泛使用的用词权重方法。为了改进术语的权重,大量TF-IDF扩展。本文建议了TF-IDF方法的另一个延伸,在考虑同义词的情况下。该方法的有效性通过对Cosine、Dice和Jacard等功能的实验得到确认,以测量哈萨克语文本文件的相似性。

0

相关内容

Extensibility

iOS 8 提供的应用间和应用跟系统的功能交互特性。

Today (iOS and OS X): widgets for the Today view of Notification Center
Share (iOS and OS X): post content to web services or share content with others
Actions (iOS and OS X): app extensions to view or manipulate inside another app
Photo Editing (iOS): edit a photo or video in Apple's Photos app with extensions from a third-party apps
Finder Sync (OS X): remote file storage in the Finder with support for Finder content annotation
Storage Provider (iOS): an interface between files inside an app and other apps on a user's device
Custom Keyboard (iOS): system-wide alternative keyboards

Source: iOS 8 Extensions: Apple’s Plan for a Powerful App Ecosystem

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

专知会员服务

432+阅读 · 2021年1月11日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

面向汉语-泰语跨语言新闻事件检索方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

Shh信号通路对CCL2调控在哮喘发生中的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

图像语义自动文本描述技术研究

国家自然科学基金

2+阅读 · 2012年12月31日

Nanostrands－膨胀石墨电磁屏蔽材料制备及性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

Predicting the Channel Access of Bluetooth Low Energy

Arxiv

0+阅读 · 2023年1月19日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

Beyond Lexical: A Semantic Retrieval Framework for Textual SearchEngine

Beyond Lexical: A Semantic Retrieval Framework for Textual SearchEngine

Arxiv

16+阅读 · 2020年8月10日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Text Classification Algorithms: A Survey

Arxiv

15+阅读 · 2019年6月25日

VIP会员

文章信息

相关主题

相关VIP内容

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

专知会员服务

432+阅读 · 2021年1月11日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型中的事件抽取：方法、模态与未来展望的全面综述

美海军作战管理系统：变革战场空间的二十年

【MIT博士论文】以语言为中心的医学影像理解

俄罗斯“沙希德”/“天竺葵”攻击无人机

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Predicting the Channel Access of Bluetooth Low Energy

Arxiv

0+阅读 · 2023年1月19日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

Beyond Lexical: A Semantic Retrieval Framework for Textual SearchEngine

Beyond Lexical: A Semantic Retrieval Framework for Textual SearchEngine

Arxiv

16+阅读 · 2020年8月10日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Text Classification Algorithms: A Survey

Arxiv

15+阅读 · 2019年6月25日

相关基金

面向汉语-泰语跨语言新闻事件检索方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

Shh信号通路对CCL2调控在哮喘发生中的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

图像语义自动文本描述技术研究

国家自然科学基金

2+阅读 · 2012年12月31日

Nanostrands－膨胀石墨电磁屏蔽材料制备及性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员