使用随机矩阵理论方法自定义断断字 (Customized determination of stop words using Random Matrix Theory approach)

The distances between words calculated in word units are studied and compared with the distributions of the Random Matrix Theory (RMT). It is found that the distribution of distance between the same words can be well described by the single-parameter Brody distribution. Using the Brody distribution fit, we found that the distance between given words in a set of texts can show mixed dynamics, coexisting regular and chaotic regimes. It is found that distributions correctly fitted by the Brody distribution with a certain goodness of the fit threshold can be identifid as stop words, usually considered as the uninformative part of the text. By applying various threshold values for the goodness of fit, we can extract uninformative words from the texts under analysis to the desired extent. On this basis we formulate a fully agnostic recipe that can be used in the creation of a customized set of stop words for texts in any language based on words.

翻译：对用文字单位计算的单词之间的距离进行了研究,并与随机矩阵理论(RMT)的分布进行比较。发现同一词之间的距离分布可以通过单一参数Brody分布来很好地描述。使用Brody分布的合适方法,我们发现一组文本中给定的单词之间的距离可以显示混杂的动态, 并同时存在常规和混乱的制度。人们发现, Brody 分布的正确配齐的适合阈值的分布可以被识别为句式词, 通常被视为文本中不具有信息规范的部分。通过应用各种临界值, 我们就可以从所分析的文本中提取非信息化的单词, 从而达到预期的程度。在此基础上, 我们制定了一种完全不可知的配方, 可以用来为基于文字的任何语言的文本创建一套定制的断字。

相关内容

矩阵论

关注 6

随着科学技术的迅速发展，古典的线性代数知识已不能满足现代科技的需要，矩阵的理论和方法业已成为现代科技领域必不可少的工具。诸如数值分析、优化理论、微分方程、概率统计、控制论、力学、电子学、网络等学科领域都与矩阵理论有着密切的联系，甚至在经济管理、金融、保险、社会科学等领域，矩阵理论和方法也有着十分重要的应用。当今电子计算机及计算技术的迅速发展为矩阵理论的应用开辟了更广阔的前景。因此，学习和掌握矩阵的基本理论和方法，对于工科研究生来说是必不可少的。全国的工科院校已普遍把“矩阵论”作为研究生的必修课。

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知会员服务

78+阅读 · 2020年7月23日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日