We conduct an exploratory, large-scale, longitudinal study of 50 years of commits to publicly available version control system repositories, in order to characterize the geographic diversity of contributors to public code and its evolution over time. We analyze in total 2.2 billion commits collected by Software Heritage from 160 million projects and authored by 43 million authors during the 1971-2021 time period. We geolocate developers to 12 world regions derived from the United Nation geoscheme, using as signals email top-level domains, author names compared with names distributions around the world, and UTC offsets mined from commit metadata.We find evidence of the early dominance of North America in open source software, later joined by Europe. After that period, the geographic diversity in public code has been constantly increasing. We also identify relevant historical shifts related to the UNIX wars, the increase of coding literacy in Central and South Asia, and broader phenomena like colonialism and people movement across countries (immigration/emigration).
翻译:我们进行了为期50年的探索性、大规模、纵向研究,承诺公开提供版本控制系统储存库,以确定公共代码贡献者的地理多样性及其随时间演变的特点。我们分析了1971-2021年期间软件遗产从1.6亿个项目中收集并由4 300万作者撰写的总共22亿项承诺。我们从联合国地理化学中将开发者划入12个世界区域,将电子邮件最高层域用作信号,作者名称与全世界名称分布相比,以及UTC抵消了从承诺中提取的元数据。我们发现有证据表明北美在开放源软件中早期占据主导地位,后来欧洲加入。此后,公共代码的地理多样性一直在不断提高。我们还确定了与UNIX战争、中亚和南亚编码扫盲增加以及更广泛的现象(移民/移民)相关的历史变化。