Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.
翻译:在现代机器学习管道中,数据增强在现代机器学习管道中发挥着关键作用。虽然在计算机视觉和自然语言处理方面研究了许多增强战略,但在其他数据模式方面却不那么为人所知。我们的工作将数据增强的成功扩展到组成数据,即对人体微生物特别感兴趣的简单X价值数据。我们的数据增强利用组成数据分析中的关键原则,例如简单x和子组合的Aitchison几何法,我们为这一数据模式定义了新的增强战略。将我们的数据增强纳入标准监督的学习管道,导致在一系列广泛的标准基准数据集中取得一致的绩效收益。特别是,我们为关键的疾病预测任务,包括直肠癌、2型糖尿病和克罗恩氏氏病,制定了新的状态技术。此外,我们的数据增强使我们能够定义新的对比学习模式,从而改进先前对微生物构成数据的代表性学习方法。我们的代码可在https://github.com/cunningham-lab/AugCoDa查阅。