Autoregressive models have achieved impressive results over a wide range of domains in terms of generation quality and downstream task performance. In the continuous domain, a key factor behind this success is the usage of quantized latent spaces (e.g., obtained via VQ-VAE autoencoders), which allow for dimensionality reduction and faster inference times. However, using existing pre-trained models to perform new non-trivial tasks is difficult since it requires additional fine-tuning or extensive training to elicit prompting. This paper introduces LASS as a way to perform vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models. Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens. We test our method on images and audio with several sampling strategies (e.g., ancestral, beam search) showing competitive results with existing approaches in terms of separation quality while offering at the same time significant speedups in terms of inference time and scalability to higher dimensional data.
翻译:在一系列广泛的领域,自动递减模型在生成质量和下游任务性能方面取得了令人印象深刻的成果。在连续领域,成功的一个关键因素是使用量化的潜层空间(例如,通过VQ-VAE自动编码器获得的隐形空间),这样可以降低维度,加快推导时间;然而,利用现有的预先培训模型执行新的非三重任务十分困难,因为它需要额外的微调或广泛的培训才能激发。本文介绍LASS,作为进行矢量定量的中位自动递增源分离的一种方法(即,将输入信号解密到其组成源中),而不需要额外的梯度优化或修改现有模型。我们的分离方法依赖于巴伊斯的公式,即自动递减模型是先行,而离异(非参数)的概率功能是通过执行频率计数超过添加符号的潜值来构建的。我们用若干抽样战略来测试图像和音频方法(例如,祖传、建模搜索),将输入输入输入源源的输入信号信号信号信号信号,而不需要额外的梯度优化优化或修改现有数据质量,同时在高度上显示高度数据质量,同时在高度上显示高度上的数据。