The surroundings of a cancerous tumor impact how it grows and develops in humans. New data from early breast cancer patients contains information on the collagen fibers surrounding the tumorous tissue -- offering hope of finding additional biomarkers for diagnosis and prognosis -- but poses two challenges for typical analysis. Each image section contains information on hundreds of fibers, and each tissue has multiple image sections contributing to a single prediction of tumor vs. non-tumor. This nested relationship of fibers within image spots within tissue samples requires a specialized analysis approach. We devise a novel support vector machine (SVM)-based predictive algorithm for this data structure. By treating the collection of fibers as a probability distribution, we can measure similarities between the collections through a flexible kernel approach. By assuming the relationship of tumor status between image sections and tissue samples, the constructed SVM problem is non-convex and traditional algorithms can not be applied. We propose two algorithms that exchange computational accuracy and efficiency to manage data of all sizes. The predictive performance of both algorithms is evaluated on the collagen fiber data set and additional simulation scenarios. We offer reproducible implementations of both algorithms of this approach in the R package mildsvm.
翻译:肿瘤对人体生长和发育的影响。早期乳腺癌患者的新数据包含肿瘤组织周围的科伦纤维信息 -- -- 提供了寻找更多用于诊断和预测的生物标志的希望,但对典型分析提出了两个挑战。每个图像部分包含数百个纤维的信息,每个组织都有多个图像部分,有助于单一预测肿瘤对非图象的影响。组织样本中图像点中纤维的嵌套关系需要专门的分析方法。我们为这一数据结构设计了一个基于新颖支持矢量机(SVM)的预测算法。通过将纤维的收集作为概率分布处理,我们可以通过灵活的内核分析方法测量收藏之间的相似性。通过假设图象部分和组织样本之间的肿瘤状况关系,所构建的SVM问题是非曲线问题,不能应用传统的算法。我们建议两种算法,交换计算准确性和效率,以管理各种大小的数据。两种算法的预测性性性性性表现都用科伦根纤维纤维结构数据集来评估,并在温室模型中进行进一步模拟。我们提出这种算法的实施。