VIP内容

大量的机器学习模型假设训练数据和测试数据来源于相同的数据分布(IID假设)。然而,在实际情况下,这个条件并不一定满足,比如,我们在不同的时间段和区域内收集的数据可能会有不同的数据分布,从而导致训练和测试数据的分布不同。更严重的是,最近有文献指出,模型偏差可能会引入更大的泛化误差。为了解决训练和测试偏差的问题,已经提出了一些方法,比如,迁移学习,然而其需要预先知道测试数据分布,然而真实情况下测试数据是不可知的。最近,有研究考虑了模型偏差问题,并尝试通过样本重加权实现变量去相关以学习具有稳定性保证的模型。然而,他们尝试通过以下方删除所有变量之间的相关性新的学习样本权重集。但是,这种激进的目标可能会导致样本量过分减少,这种情况会影响机器学习模型性能。

不同于之前去除所有的变量相关性,本文认为并不是所有的相关性都有必要去除。例如,当您想在图像分类任务中识别狗时,尽管狗的鼻子,耳朵和嘴巴可能会由不同的变量代表,它们作为一个整体这样的相关性在不同的环境中都是稳定的。同样,可能存在另一堆变量代表背景(即草)。由于选择偏差,我们可能会观察到这两种变量之间的强相关性在有偏差的训练数据上。但是,这样的“虚假”相关不能推广到新的环境。因此,对于这种情况,我们仅需要消除显著变量和背景变量之间的虚假相关性来获得准确的狗分类器。

成为VIP会员查看完整内容
0
12

最新论文

Fairness has emerged as an important consideration in algorithmic decision-making. Unfairness occurs when an agent with higher merit obtains a worse outcome than an agent with lower merit. Our central point is that a primary cause of unfairness is uncertainty. A principal or algorithm making decisions never has access to the agents' true merit, and instead uses proxy features that only imperfectly predict merit (e.g., GPA, star ratings, recommendation letters). None of these ever fully capture an agent's merit; yet existing approaches have mostly been defining fairness notions directly based on observed features and outcomes. Our primary point is that it is more principled to acknowledge and model the uncertainty explicitly. The role of observed features is to give rise to a posterior distribution of the agents' merits. We use this viewpoint to define a notion of approximate fairness in ranking. We call an algorithm $\phi$-fair (for $\phi \in [0,1]$) if it has the following property for all agents $x$ and all $k$: if agent $x$ is among the top $k$ agents with respect to merit with probability at least $\rho$ (according to the posterior merit distribution), then the algorithm places the agent among the top $k$ agents in its ranking with probability at least $\phi \rho$. We show how to compute rankings that optimally trade off approximate fairness against utility to the principal. In addition to the theoretical characterization, we present an empirical analysis of the potential impact of the approach in simulation studies. For real-world validation, we applied the approach in the context of a paper recommendation system that we built and fielded at the KDD 2020 conference.

0
0
下载
预览
参考链接
Top