Lizard:一个大型数据组,用于核电合核事件的分类和分类。 (Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification)

Simon Graham,Mostafa Jahanifar,Ayesha Azam,Mohammed Nimir,Yee-Wah Tsang,Katherine Dodd,Emily Hero,Harvir Sahota,Atisha Tank,Ksenija Benes,Noorul Wahab,Fayyaz Minhas,Shan E Ahmed Raza,Hesham El Daly,Kishore Gopalakrishnan,David Snead,Nasir Rajpoot

The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. However, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We have released the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.

翻译：开发计算病理学(CPath)的深分解模型有助于推动对可解释的形态生物标志(CPath)的调查。然而,由于受监督的深层次学习模型需要大量准确标签的数据,这种方法的成功存在重大瓶颈,因为受监督的深层次学习模型需要大量准确标签的数据。在CPath领域,这一问题更加严重,因为生成详细的注释通常要求病理学家投入能够区分不同的组织构造和核心。人工标注核核也许不是收集大规模附加说明数据集的可行办法,特别是当一个图像区域能够包含数千个不同的细胞时。然而,仅仅依靠自动生成说明将限制地面真理的准确性和可靠性。因此,为了帮助克服上述挑战,我们建议一个多阶段注解管道,以便能够收集大规模的数据数据集,用于进行基因图象分析,并配有病理学家在环球中的改进步骤。我们利用这一管道,生成了已知的最大核实例分解和分类数据集,其中含有近50万个标注的H & E-E封闭的细胞驱动力研究模型,鼓励下游的C-Parequemal 数据库。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。