Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for traditional data ingestion scenarios where a select number of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data sharing and reuse. Most images in publications are presented as components of a larger figure with their explicit context buried in the main body or caption text, so even if aggregated, collections of images with weak or no digitized contextual labels have limited value. To solve the problem of curating labeled microscopy data from literature, this work introduces the EXSCLAIM! Python toolkit for the automatic EXtraction, Separation, and Caption-based natural Language Annotation of IMages from scientific literature. We highlight the methodology behind the construction of EXSCLAIM! and demonstrate its ability to extract and label open-source scientific images at high volume.
翻译:由于最近图像分辨率和获取速度的改善,材料显微镜正在经历已出版成像数据的爆炸性。标准出版格式虽然足以应对传统数据摄取情况,即某些图像可以人工进行严格检查和整理,但不利于大规模数据汇总或分析,妨碍数据共享和再利用。出版物中的大多数图像都是大图的组成部分,其明确背景被埋在主体或说明文本中,因此,即使综合起来,以薄弱或没有数字化背景标签收集图像的价值也有限。为解决文献中贴有标签的显微镜数据的问题,这项工作引入了EXSCLAIM!基于自然语言的自动提取、分离和说明科学文献中的IMages自然语言工具包。我们强调构建EXSCLAIM背后的方法。我们强调其高容量提取和标注公开源科学图像的能力。