The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.
翻译:生成模型的广泛使用已形成一种反馈循环,其中每一代模型均在部分由其前代模型生成的数据上进行训练。这一过程引发了关于模型崩溃的担忧:即因反复使用合成数据训练而导致的性能严重退化。然而,文献中的不同分析对模型崩溃的严重程度得出了不同结论。因此,该现象的实际危害程度以及在何种假设下可避免崩溃仍不明确。为探究此问题,我们从理论上研究了最大似然估计(MLE)框架下的模型崩溃问题,设定于合成数据逐步加入原始数据集的自然场景中。在标准假设(类似于长期用于证明MLE渐近一致性与正态性的条件)下,我们建立了非渐近界,表明即使真实数据占比趋近于零时仍可避免崩溃。另一方面,我们证明某些假设(超越MLE一致性)确属必要:若缺乏这些假设,即使原始数据仍存在于训练集中,模型崩溃仍可能任意快速地发生。据我们所知,这是在累积数据的迭代生成建模中首次严格证明会迅速导致模型崩溃的实例。