Polyps in the colon are widely known as cancer precursors identified by colonoscopy either related to diagnostic work-up for symptoms, colorectal cancer screening or systematic surveillance of certain diseases. Whilst most polyps are benign, the number, size and the surface structure of the polyp are tightly linked to the risk of colon cancer. There exists a high missed detection rate and incomplete removal of colon polyps due to the variable nature, difficulties to delineate the abnormality, high recurrence rates and the anatomical topography of the colon. In the past, several methods have been built to automate polyp detection and segmentation. However, the key issue of most methods is that they have not been tested rigorously on a large multi-center purpose-built dataset. Thus, these methods may not generalise to different population datasets as they overfit to a specific population and endoscopic surveillance. To this extent, we have curated a dataset from 6 different centers incorporating more than 300 patients. The dataset includes both single frame and sequence data with 3446 annotated polyp labels with precise delineation of polyp boundaries verified by six senior gastroenterologists. To our knowledge, this is the most comprehensive detection and pixel-level segmentation dataset curated by a team of computational scientists and expert gastroenterologists. This dataset has been originated as the part of the Endocv2021 challenge aimed at addressing generalisability in polyp detection and segmentation. In this paper, we provide comprehensive insight into data construction and annotation strategies, annotation quality assurance and technical validation for our extended EndoCV2021 dataset which we refer to as PolypGen.
翻译:结肠癌中的聚合物被广泛称为结肠镜检查所发现的癌症先质,这些先质要么与症状的诊断工作、直肠癌检查或某些疾病的系统监测有关,要么与某些疾病的诊断性工作有关。虽然大多数聚虫是良性的,但聚虫体的数量、大小和表面结构与结肠癌的风险密切相关。由于性质不同,难以分辨异常、复发率高和结肠结肠的解剖地形,因此结肠聚体的检测率高且去除不完全。在过去,已经建立了几种方法来进行自动化聚合物检测和分解。然而,大多数方法的关键问题是,它们还没有在大型多中心目的建立数据集的数据集中进行严格测试。因此,这些方法可能无法概括不同的人口数据集,因为它们与特定人群和内分层监测有关。我们从6个不同的中心整理了一个数据集,其中含有300多名患者。数据集包括一个单一的保证框架和序列数据,其中附有3446个附加的聚谱质聚合质检测和分解的标签,并精确地界定了聚谱质谱的分界,而多数的精确地测量为六个多中心目的目的目的的精确的诊断数据,这是由高级化学分解数据,由我们用来进行实验室分解的分解。