Statistics Netherlands (CBS) hosted a huge amount of data not only on the statistical level but also on the individual level. With the development of data science technologies, more and more researchers request to conduct their research by using high-quality individual data from CBS (called CBS Microdata) or combining them with other data sources. Making great use of these data for research and scientific purposes can tremendously benefit the whole society. However, CBS Microdata has been collected and maintained in different ways by different departments in and out of CBS. The representation, quality, metadata of datasets are not sufficiently harmonized. The project converts the descriptions of all CBS microdata sets into one knowledge graph with comprehensive metadata in Dutch and English using text mining and semantic web technologies. Researchers can easily query the metadata, explore the relations among multiple datasets, and find the needed variables. For example, if a researcher searches a dataset about "Age at Death" in the Health and Well-being category, all information related to this dataset will appear including keywords and variable names. "Age at Death" dataset has a keyword - "Death". This keyword will lead to other datasets such as "Date of Death". "Cause of Death", "Production statistics Health and welfare" from Population, Business categories, and Health and well-being categories. This will tremendously save time and costs for the data requester but also data maintainers.
翻译:荷兰统计局(CBS)不仅在统计层面,而且还在个人层面提供了大量数据。随着数据科技的发展,越来越多的研究人员要求使用来自CBS(称为CBS Microdata)的高质量个人数据(称为CBS Microdata)或与其他数据来源相结合来进行研究。为研究和科学目的大量使用这些数据可以极大地造福整个社会。然而,CBS Micdata是由CBS不同部门以不同方式收集和维护的。数据集的表述、质量和元数据并不够统一。该项目将所有CBS微观数据集的描述转换成一个知识图表,其中含有荷兰文和英文的综合元数据。研究人员可以很容易地查询元数据,探讨多数据集之间的关系,并找到所需的变量。例如,如果研究人员在健康和福利类别中搜索关于“死亡死亡”的数据集,所有与该数据集相关的信息都将包括关键词和变量。“死亡”数据集有一个关键词――“死亡”但含有荷兰语和英语综合元数据。这个关键词“死亡和健康”的“生命”数据类别,因为“死亡”的“生命和健康”等数据类别将导致其他数据“死亡和健康”的“数据”。