An abundance of information about cancer exists online, but categorizing and extracting useful information from it is difficult. Almost all research within healthcare data processing is concerned with formal clinical data, but there is valuable information in non-clinical data too. The present study combines methods within distributed computing, text retrieval, clustering, and classification into a coherent and computationally efficient system, that can clarify cancer patient trajectories based on non-clinical and freely available information. We produce a fully-functional prototype that can retrieve, cluster and present information about cancer trajectories from non-clinical forum posts. We evaluate three clustering algorithms (MR-DBSCAN, DBSCAN, and HDBSCAN) and compare them in terms of Adjusted Rand Index and total run time as a function of the number of posts retrieved and the neighborhood radius. Clustering results show that neighborhood radius has the most significant impact on clustering performance. For small values, the data set is split accordingly, but high values produce a large number of possible partitions and searching for the best partition is hereby time-consuming. With a proper estimated radius, MR-DBSCAN can cluster 50000 forum posts in 46.1 seconds, compared to DBSCAN (143.4) and HDBSCAN (282.3). We conduct an interview with the Danish Cancer Society and present our software prototype. The organization sees a potential in software that can democratize online information about cancer and foresee that such systems will be required in the future.
翻译:大量的有关癌症的信息存在于互联网上,但是对其进行分类和提取有难度。几乎所有的医疗数据处理研究都关注于正式的临床数据,但是非临床数据中也存在有价值的信息。本研究将分布式计算、文本检索、聚类和分类方法融合成一个连贯且计算效率高的系统,可以基于非临床的自由信息澄清癌症患者的轨迹。我们开发了一个完全功能的原型系统,可以从非临床论坛帖子中检索、聚类和展示关于癌症轨迹的信息。我们评估了三种聚类算法(MR-DBSCAN、DBSCAN和HDBSCAN),并根据检索到的帖子数和邻域半径的函数,比较了它们在调整兰德指数和总运行时间方面的表现。聚类结果显示,邻域半径对聚类性能的影响最为显著。对于较小的值,数据集会被相应地分割,但高值会产生大量可能的划分,这导致寻找最佳划分变得耗时。MR-DBSCAN 能够在适当的估计半径下,在 46.1 秒内聚类 50000 个论坛帖子,相比 DBSCAN(143.4)和 HDBSCAN(282.3)要快。我们与丹麦癌症协会进行了访谈,并展示了我们的软件原型。该组织看到了软件能够民主化关于癌症的在线信息的潜力,预测这种系统在未来将会得到需要。