This paper employs two major natural language processing techniques, topic modeling and clustering, to find patterns in folktales and reveal cultural relationships between regions. In particular, we used Latent Dirichlet Allocation and BERTopic to extract the recurring elements as well as K-means clustering to group folktales. Our paper tries to answer the question what are the similarities and differences between folktales, and what do they say about culture. Here we show that the common trends between folktales are family, food, traditional gender roles, mythological figures, and animals. Also, folktales topics differ based on geographical location with folktales found in different regions having different animals and environment. We were not surprised to find that religious figures and animals are some of the common topics in all cultures. However, we were surprised that European and Asian folktales were often paired together. Our results demonstrate the prevalence of certain elements in cultures across the world. We anticipate our work to be a resource to future research of folktales and an example of using natural language processing to analyze documents in specific domains. Furthermore, since we only analyzed the documents based on their topics, more work could be done in analyzing the structure, sentiment, and the characters of these folktales.
翻译:本文使用两种主要的自然语言处理技术,即主题建模和集群,以发现民间传说的模式,并揭示区域间的文化关系。特别是,我们使用Lient Dirichlet分配和BERTopic来提取反复出现的元素以及K- means集聚到民间传说团体中。我们的论文试图解答关于民间传说之间的相似和差异以及它们对于文化的看法。我们在这里展示了民间传说之间的共同趋势是家庭、食物、传统性别角色、神话人物和动物。此外,民间传说的主题也因地理位置不同而不同,在不同区域发现有不同动物和环境的民间传说。我们并不惊讶地发现宗教人物和动物是所有文化中的一些共同话题。然而,我们感到惊讶的是,欧洲和亚洲民间传说往往结合在一起。我们的结果显示某些元素在世界各地文化中的普及程度。我们预计我们的工作将成为未来对民间传说的研究资源,以及利用自然语言处理分析特定领域的文件的范例。此外,因为我们仅仅根据这些话题分析了文件,因此只能分析基于这些民间传说的主题而做更多的工作。