Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.
翻译:深度自动代数往往以受监督或对抗性损失的方式扩大,以了解具有适当属性的潜在表现,如标签和结果的预测性更高,或对敏感变量的公平性。尽管受监督和对抗性潜伏因素模型普遍存在,但这些方法应表明比实际所偏爱的更简单的线性方法有所改进。这需要复制仍然符合强化受监督或对抗目标的可复制线性模拟。我们通过提出以受监督或对抗性目标加强主要组成部分分析目标的方法来解决这一方法上的差距,提供有监督或对抗性目标的主要组成部分,并提供分析和可复制的解决办法。我们用开放源Python软件包实施这些方法,可以产生极好的真实世界基线。我们展示了这些要素模型在开放源、RNA等值癌症基因表达数据集方面的效用,表明在改进下游分类性方面有监督的客观结果,产生更准确性的主要组成部分,便于识别与数据差异主要轴对特定癌症类型发展的影响相一致的基因。