探索利用新闻文章了解印度各区发展模式的范围 (Exploring the Scope of Using News Articles to Understand Development Patterns of Districts in India)

Understanding what factors bring about socio-economic development may often suffer from the streetlight effect, of analyzing the effect of only those variables that have been measured and are therefore available for analysis. How do we check whether all worthwhile variables have been instrumented and considered when building an econometric development model? We attempt to address this question by building unsupervised learning methods to identify and rank news articles about diverse events occurring in different districts of India, that can provide insights about what may have transpired in the districts. This can help determine whether variables related to these events are indeed available or not to model the development of these districts. We also describe several other applications that emerge from this approach, such as to use news articles to understand why pairs of districts that may have had similar socio-economic indicators approximately ten years back ended up at different levels of development currently, and another application that generates a newsfeed of unusual news articles that do not conform to news articles about typical districts with a similar socio-economic profile. These applications outline the need for qualitative data to augment models based on quantitative data, and are meant to open up research on new ways to mine information from unstructured qualitative data to understand development.

翻译：了解导致社会经济发展的因素往往会受到街灯效应的影响,即只分析已经测量并因此可供分析的变数的影响,我们如何核查在建立计量经济学发展模式时,是否所有有价值的变数都已使用过工具和考虑过?我们试图通过建立不受监督的学习方法来解决这一问题,以识别印度不同地区发生的不同事件,并确定其新闻文章的顺序,这些方法可以使人们了解这些地区可能发生的情况。这可以帮助确定与这些事件有关的变数是否确实存在,以模拟这些地区的发展。我们还描述了这一方法产生的其他几个应用,例如利用新闻文章来了解为什么大约十年来拥有类似社会经济指标的两对地区在目前不同的发展水平上已经结束,以及产生与具有类似社会经济概况的典型地区的新闻文章不一致的另一种应用。这些应用可以说明是否有必要提供定性数据,以补充基于定量数据的模型,并旨在开启关于从未结构化的定性数据到理解发展的地雷信息的新途径的研究。