Variational autoencoders are among the most popular methods for distilling low-dimensional structure from high-dimensional data, making them increasingly valuable as tools for data exploration and scientific discovery. However, unlike typical machine learning problems in which a single model is trained once on a single large dataset, scientific workflows privilege learned features that are reproducible, portable across labs, and capable of incrementally adding new data. Ideally, methods used by different research groups should produce comparable results, even without sharing fully trained models or entire data sets. Here, we address this challenge by introducing the Rosetta VAE (R-VAE), a method of distilling previously learned representations and retraining new models to reproduce and build on prior results. The R-VAE uses post hoc clustering over the latent space of a fully-trained model to identify a small number of Rosetta Points (input, latent pairs) to serve as anchors for training future models. An adjustable hyperparameter, $\rho$, balances fidelity to the previously learned latent space against accommodation of new data. We demonstrate that the R-VAE reconstructs data as well as the VAE and $\beta$-VAE, outperforms both methods in recovery of a target latent space in a sequential training setting, and dramatically increases consistency of the learned representation across training runs.
翻译:从高维数据中蒸馏低维结构的最受欢迎的方法之一,是变化式自动编码器,从高维数据中蒸馏低维结构,使其作为数据探索和科学发现的工具越来越有价值,然而,与典型的机器学习问题不同,在单一大型数据集中,单一模型曾经受过一次培训,而科学工作流程特权所学到的功能是可复制的,跨实验室可移植,并能够逐步增加新的数据。理想的情况是,不同研究小组使用的方法应产生可比的结果,即使不分享经过充分培训的模型或整个数据集。在这里,我们通过引入罗塞塔VAE(R-VAE)来应对这一挑战,这是一种蒸馏以前学到的演示和再培训新模型的方法,以复制和借鉴先前的结果。 RVAE在经过充分训练的模型潜在空间上,利用临时组合来确定少量的罗塞塔点(投入、潜在配对)作为未来模型培训的支柱。可调整的超比值计,美元,平衡先前学到的潜在空间与新数据的搭配。我们证明,RVAE(RVAE)将数据重新生成数据,并更新了在不断的连续培训中将数据作为VVAEAAA-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-