Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, i.e. sums of row elements and column elements, arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
翻译:远程语义分析(LSA)和函授分析(CA)是使用单值分解(SVD)来降低维度的两种技术。LSA被广泛用于获取能反映文件和术语之间关系的低维度表达方式。在本篇文章中,我们从文件期矩阵的角度对这两种技术进行了理论分析和比较。我们表明,CA与LSA相比,具有一些有吸引力的特性,例如,由于不同文件长度和期限频率的不同,差幅(即行元素和列元素的总量)效应得到有效消除,因此CA解决方案最适宜侧重于文件和术语之间的关系。我们提出了一个统一框架,将CAA和LSA作为特例列入其中。我们从经验上将CA与基于英文文本分类和荷兰历史文本作者归属的各种LSA方法进行比较,发现CA的表现要好得多。我们还将CA应用于关于荷兰国歌Wilhelmus作者长期存在的问题,并进一步支持将其归属给Datheen作者。