This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.
翻译:本研究提出了一种结合基于代理模型的黑盒优化、后处理与量子退火的方法,用于从受污染的训练数据集中移除误标注实例。误标注训练实例作为现实数据集中普遍存在的问题,常导致模型泛化性能下降,因此需要稳健高效的噪声去除策略。该方法基于验证损失评估过滤后的训练子集,通过基于代理模型的黑盒优化与后处理迭代优化损失估计,并利用量子退火高效采样具有低验证误差的多样化训练子集。在含噪声多数位任务上的实验表明,该方法能够优先移除高风险误标注实例。与OpenJij的模拟量子退火采样器或Neal的模拟退火采样器相比,集成在物理量子退火器上运行的D-Wave团采样器实现了更快的优化速度和更高质量的训练子集,为提升数据集质量提供了可扩展的框架。本研究证明了所提方法在监督学习任务中的有效性,未来研究方向包括将其应用于无监督学习、现实数据集及大规模实施场景。