Both latent semantic analysis (LSA) and correspondence analysis (CA) use a singular value decomposition (SVD) for dimensionality reduction. In this article, LSA and CA are compared from a theoretical point of view and applied in both a toy example and an authorship attribution example. In text mining interest goes out to the relationships among documents and terms: for example, what terms are more often used in what documents. However, the LSA solution displays a mix of marginal effects and these relationships. It appears that CA has more attractive properties than LSA. One such property is that, in CA, the effect of the margins is effectively eliminated, so that the CA solution is optimally suited to focus on the relationships among documents and terms. Three mechanisms are distinguished to weight documents and terms, and a unifying framework is proposed that includes these three mechanisms and includes both CA and LSA as special cases. In the authorship attribution example, the national anthem of the Netherlands, the application of the discussed methods is illustrated.
翻译:潜在语义分析(LSA)和对应分析(CA)都使用单值分解法(SVD)来降低维度。在本条中,LSA和CA是从理论角度比较的,既适用于玩具的例子,也适用于作者归属的例子。在文本中,采矿利益是指文件和术语之间的关系:例如,哪些术语更经常在文件中使用。但是,LSA解决方案显示了边际效应和这些关系的综合体。CA似乎具有比LSA更具吸引力的特性。其中之一是,在CA中,边际效应的效果被有效消除,这样CA解决方案就最适宜侧重于文件和术语之间的关系。三种机制被区分为加权文档和术语,并提出了一个统一框架,其中包括这三个机制,并将CA和LSA作为特例包括在内。在作者归属示例中,荷兰的国名,说明了所讨论方法的应用。