The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting and inclusion, approximating human linguistic behavior, and increasing systems' performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with "diversity" or "diverse" in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.
翻译:近年来,多样性概念在自然语言处理领域受到日益广泛的关注。这源于多方面的动因,包括促进包容性、逼近人类语言行为模式以及提升系统性能等。然而,NLP领域对多样性的探讨往往呈现碎片化特征,且鲜少与那些已建立完善理论体系的相关学科形成明确关联。本研究系统检索了ACL Anthology近六年来标题包含“diversity”或“diverse”的文献,发现多样性量化研究覆盖了高度分化的应用场景,且普遍存在术语使用不一致的现象。我们提出了一个涵盖动因、对象、场景与方法的统一分类体系,并基于生态学与经济学中的斯特林框架(Stirling, 2007)构建了包含多样性三维度(丰富度、均衡度、差异度)的统一度量模型。通过这种系统化分析,我们揭示了当前研究的发展趋势。本研究为NLP领域建立更完善的多样性形式化体系奠定了基础,有望推动对该概念的深入理解,并提升不同研究路径之间的可比性。