The increasing volumes of data produced by high-throughput instruments coupled with advanced computational infrastructures for scientific computing have enabled what is often called a {\em Fourth Paradigm} for scientific research based on the exploration of large datasets. Current scientific research is often interdisciplinary, making data integration a critical technique for combining data from different scientific domains. Research data management is a critical part of this paradigm, through the proposition and development of methods, techniques, and practices for managing scientific data through their life cycle. Research on microbial communities follows the same pattern of production of large amounts of data obtained, for instance, from sequencing organisms present in environmental samples. Data on microbial communities can come from a multitude of sources and can be stored in different formats. For example, data from metagenomics, metatranscriptomics, metabolomics, and biological imaging are often combined in studies. In this article, we describe the design and current state of implementation of an integrative research data management framework for the Cluster of Excellence Balance of the Microverse aiming to allow for data on microbial communities to be more easily discovered, accessed, combined, and reused. This framework is based on research data repositories and best practices for managing workflows used in the analysis of microbial communities, which includes recording provenance information for tracking data derivation.
翻译:高通量仪器和先进的科学计算计算基础设施所产生的数据数量不断增加,使得通常被称为“第四位标准”的数据能够用于基于大型数据集勘探的科学研究。目前的科学研究往往是跨学科的,使数据整合成为综合不同科学领域数据的关键技术。研究数据管理是这一模式的一个关键部分,通过提出和制定方法、技术和做法,在科学数据生命周期内管理科学数据。微生物群落的研究遵循了同样的模式,即生产大量数据,例如来自环境样品生物测序的数据。微生物群落的数据可以来自多种来源,并且可以以不同格式储存。例如,来自远代遗传学、线性统计学、新陈代谢学和生物成像的数据往往在研究中结合。在文章中,我们描述了微生物群落中综合研究数据管理框架的设计和实施现状,目的是使微生物群落的数据更容易被发现、获取、合并和再利用。这一框架以研究库和最佳数据跟踪方法为基础,用于管理微生物群落的可靠数据跟踪。这一数据流中所使用的最新数据跟踪方法包括用于管理研究数据库和最佳数据流。