In digital pathology, Whole Slide Image (WSI) analysis is usually formulated as a Multiple Instance Learning (MIL) problem. Although transformer-based architectures have been used for WSI classification, these methods require modifications to adapt them to specific challenges of this type of image data. Despite their power across domains, reference transformer models in classical Computer Vision (CV) and Natural Language Processing (NLP) tasks are not used for pathology slide analysis. In this work we demonstrate the use of standard, frozen, text-pretrained, transformer language models in application to WSI classification. We propose SeqShort, a multi-head attention-based sequence reduction input layer to summarize each WSI in a fixed and short size sequence of instances. This allows us to reduce the computational costs of self-attention on long sequences, and to include positional information that is unavailable in other MIL approaches. We demonstrate the effectiveness of our methods in the task of cancer subtype classification, without the need of designing a WSI-specific transformer or performing in-domain self-supervised pretraining, while keeping a reduced compute budget and number of trainable parameters.
翻译:在数字病理学中,全幻灯片图像分析通常是一个多例学习(MIL)问题。虽然以变压器为基础的结构已被用于WSI分类,但这些方法需要加以修改,以适应这类图像数据的具体挑战。尽管它们具有跨域的力量,但古典计算机视觉(CV)和自然语言处理(NLP)任务中的参考变压器模型没有用于病理幻灯片分析。在这项工作中,我们展示了在应用WSI分类时使用标准、冻结、文本预设、变压器语言模型的情况。我们建议SeqShort,一个基于多端点的减少顺序输入层,以固定和短小的顺序总结每个WSI。这使我们能够降低长序列中自留的计算成本,并纳入其他MIL方法中无法提供的定位信息。我们展示了我们在癌症子类型分类任务中的方法的有效性,而无需设计一个特定的WSI变压器或进行内部自上调前训练,同时保持一个较低的计算预算和可培训参数的数目。