The increase in performance in NLP due to the prevalence of distributional models and deep learning has brought with it a reciprocal decrease in interpretability. This has spurred a focus on what neural networks learn about natural language with less of a focus on how. Some work has focused on the data used to develop data-driven models, but typically this line of work aims to highlight issues with the data, e.g. highlighting and offsetting harmful biases. This work contributes to the relatively untrodden path of what is required in data for models to capture meaningful representations of natural language. This entails evaluating how well English and Spanish semantic spaces capture a particular type of relational knowledge, namely the traits associated with concepts (e.g. bananas-yellow), and exploring the role of co-occurrences in this context.
翻译:由于分布模型的普及和深层次的学习,国家语言平台的绩效提高,从而导致对等解释性下降,这促使人们注重神经网络对自然语言的学习,而不太注重如何学习自然语言。有些工作侧重于用于开发数据驱动模型的数据,但通常这项工作方针的目的是突出数据方面的问题,例如突出和抵消有害的偏见。这项工作有助于在模型中需要哪些数据来获取对自然语言的有意义的表现,而这种数据则相对不严谨。这需要评估英语和西班牙语语系空间如何很好地捕捉特定类型的关系知识,即与概念(例如香蕉-黄树)相关的特征,并探讨在这方面共同发生事件的作用。