The paper discusses the potential of large vision-language models as objects of interest for empirical cultural studies. Focusing on the comparative analysis of outputs from two popular text-to-image synthesis models, DALL-E 2 and Stable Diffusion, the paper tries to tackle the pros and cons of striving towards culturally agnostic vs. culturally specific AI models. The paper discusses several examples of memorization and bias in generated outputs which showcase the trade-off between risk mitigation and cultural specificity, as well as the overall impossibility of developing culturally agnostic models.
翻译:本文件讨论了大型视觉语言模型作为经验性文化研究感兴趣的对象的潜力,侧重于对两种流行的文本到图像综合模型(DALL-E 2和稳定传播)的产出进行比较分析,力求解决努力在文化上的不可知与文化上独特的AI模式之间的利弊。本文件讨论了在生成的产出中体现减少风险与文化特性之间的权衡和偏见的几个例子,以及全面不可能发展文化上的不可知模式。