The problem of missing data, usually absent incurated and competition-standard datasets, is an unfortunate reality for most machine learning models used in industry applications. Recent work has focused on understanding the nature and the negative effects of such phenomena, while devising solutions for optimal imputation of the missing data, using both discriminative and generative approaches. We propose a novel mechanism based on multi-head attention which can be applied effortlessly in any model and achieves better downstream performance without the introduction of the full dataset in any part of the modeling pipeline. Our method inductively models patterns of missingness in the input data in order to increase the performance of the downstream task. Finally, after evaluating our method against baselines for a number of datasets, we found performance gains that tend to be larger in scenarios of high missingness.
翻译:缺少数据的问题,通常是没有不发生数据和竞争标准数据集的问题,对于工业应用中所使用的大多数机器学习模型来说,这是一个不幸的现实,最近的工作侧重于了解这类现象的性质和负面影响,同时利用歧视性和基因化的方法,设计出对所缺数据进行最佳估算的解决办法;我们提议一个基于多头关注的新机制,可以不费力地在任何模型中应用,并在没有在建模管道的任何部分引入完整的数据集的情况下实现更好的下游性能;我们的方法是输入数据缺失的诱导模型模式,以增加下游任务的绩效;最后,在根据一些数据集的基准评估我们的方法之后,我们发现,在高度缺失的情况下,绩效收益往往更大。