It is a well-known approach for fringe groups and organizations to use euphemisms -- ordinary-sounding and innocent-looking words with a secret meaning -- to conceal what they are discussing. For instance, drug dealers often use "pot" for marijuana and "avocado" for heroin. From a social media content moderation perspective, though recent advances in NLP have enabled the automatic detection of such single-word euphemisms, no existing work is capable of automatically detecting multi-word euphemisms, such as "blue dream" (marijuana) and "black tar" (heroin). Our paper tackles the problem of euphemistic phrase detection without human effort for the first time, as far as we are aware. We first perform phrase mining on a raw text corpus (e.g., social media posts) to extract quality phrases. Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates. Finally, we rank those candidates by a masked language model -- SpanBERT. Compared to strong baselines, we report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
翻译:对于边缘群体和组织来说,一种众所周知的方法是使用委婉主义 -- -- 普通的、能听懂的和无辜的、隐秘的言词 -- -- 来掩盖他们所讨论的内容。例如,毒贩经常用“罐子”来掩盖大麻,用“avocado”来掩盖海洛因。从社交媒体内容的温和观点来看,虽然NLP最近的进展使得这种单词委婉主义能够自动发现,但没有任何现有工作能够自动发现多词委婉主义,例如“蓝梦”(marijuana)和“黑焦油”(heroin)。据我们所知,我们的文件首次解决了未经人类努力而探测的委婉词的问题。我们首先用原始文字材料(例如社交媒体文章)来挖掘高质量的词句。然后,我们用语言嵌入相似性词来选择一组委婉用语的候选人。最后,我们用隐蔽语言模型来将这些候选人排位 -- SpanBERT。与强的基线相比,我们报告说,20-50% 更高的查谎话短语使用我们测算算法进行20-50%。