We describe an algorithm for automatic classification of idiomatic and literal expressions. Our starting point is that words in a given text segment, such as a paragraph, that are highranking representatives of a common topic of discussion are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate a simple analysis of the intensity of the emotions expressed by the contexts. We investigate the bag of words topic representation of one to three paragraphs containing an expression that should be classified as idiomatic or literal (a target phrase). We extract topics from paragraphs containing idioms and from paragraphs containing literals using an unsupervised clustering method, Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Since idiomatic expressions exhibit the property of non-compositionality, we assume that they usually present different semantics than the words used in the local topic. We treat idioms as semantic outliers, and the identification of a semantic shift as outlier detection. Thus, this topic representation allows us to differentiate idioms from literals using local semantic contexts. Our results are encouraging.
翻译:我们描述一个自动分类单词和字面表达式的算法。 我们的出发点是,某文本部分,例如某段落中的词语,作为共同讨论议题的高级代表,不太可能成为单词表达式的一部分。我们另外的假设是,在通常情况下,出现单词的背景更具有感知性,因此,我们对背景所表现的情绪强度进行简单分析。我们调查一至三段的单词表达式,其中包含一个应归类为单词或字面表达式的表达式(一个目标短语)。我们从含有异语的段落和含有公升数的段落中提取专题,使用非监督的组别方法(LDA)(Blei等人,2003年)。由于单词表达式表达式表现出非共性的特点,我们假定它们通常呈现与本地议题中使用的单词不同的语性。我们把异语作为语流处理,并将语义变化的识别为局外探测。因此,我们使用本地语背景的演示结果可以使我们区分。