Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
翻译:远程语义分析(LSA)和函授分析(CA)是使用单值分解法(SVD)来降低维度的两种技术。LSA被广泛用于获取能反映文件和术语之间关系的低维度表达方式。在本篇文章中,我们从文件期矩阵的角度对这两种技术进行了理论分析和比较。我们表明,CA与LSA相比,具有一些有吸引力的特性,例如,有效消除了不同文件长度和期限差异所产生的边际效应,从而使CA解决方案最适于侧重于文件和术语之间的关系。我们提出了一个统一框架,将CA和LSA都作为特例列入其中。我们从经验上将CA与基于英文文本分类和荷兰历史文本作者归属的各种LSA方法进行比较,发现CA的表现要好得多。我们还将CA应用于关于荷兰国民Anthe Wilhelmus作者的长期问题,并进一步支持它可以归因于Datheen作者,以及几个竞争者。