The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data: data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the effectiveness of knowledge-guided generative methods.
翻译:深层次学习的成功在很大程度上是由于有大量的培训数据,涵盖特定概念或意义的各种实例。在医学领域,拥有一套关于特定疾病的各种培训数据,可以导致开发一个能够准确预测该疾病的模型。然而,尽管可能带来好处,由于缺乏高质量的附加说明数据,在图像诊断方面没有取得显著进展。本条款强调了使用以数据为中心的方法提高数据表述质量的重要性,特别是在现有数据有限的情况下。为了解决这一“小数据”问题,我们讨论了产生和汇总培训数据的四种方法:数据增加、转让、学习、联合学习和GANs(基因对抗网络)。我们还提议使用以知识为指南的GANs将域知识纳入培训数据生成过程。由于在经过培训的大型语言模型方面最近取得的进展,我们认为有可能获得高质量的知识,用于提高知识引导基因方法的有效性。