A wide range of machine learning applications such as privacy-preserving learning, algorithmic fairness, and domain adaptation/generalization among others, involve learning invariant representations of the data that aim to achieve two competing goals: (a) maximize information or accuracy with respect to a target response, and (b) maximize invariance or independence with respect to a set of protected features (e.g., for fairness, privacy, etc). Despite their wide applicability, theoretical understanding of the optimal tradeoffs -- with respect to accuracy, and invariance -- achievable by invariant representations is still severely lacking. In this paper, we provide an information theoretic analysis of such tradeoffs under both classification and regression settings. More precisely, we provide a geometric characterization of the accuracy and invariance achievable by any representation of the data; we term this feasible region the information plane. We provide an inner bound for this feasible region for the classification case, and an exact characterization for the regression case, which allows us to either bound or exactly characterize the Pareto optimal frontier between accuracy and invariance. Although our contributions are mainly theoretical, a key practical application of our results is in certifying the potential sub-optimality of any given representation learning algorithm for either classification or regression tasks. Our results shed new light on the fundamental interplay between accuracy and invariance, and may be useful in guiding the design of future representation learning algorithms.
翻译:一系列广泛的机器学习应用,如隐私保护学习、算法公平、领域调整/普及等,涉及对旨在实现以下两个相互竞争的目标的数据进行不一的描述:(a) 尽量扩大目标反应的信息或准确性,(b) 尽量扩大一套受保护特征(例如,为了公平、隐私等)的偏差或独立性,尽管它们具有广泛适用性,但对于最佳权衡的理论理解 -- -- 在准确性方面和差异性方面 -- -- 仍然严重缺乏。在本文件中,我们对分类和回归情况下的这种权衡提供了信息理论分析。更确切地说,我们对数据的任何表述都可实现准确性和差异性的几何描述;我们将这一区域称为可行的信息平台。我们为这一可行的区域提供了一个内部约束分类案例,并对回归性案例作了准确性描述,使我们能够在准确性和差异性表述之间形成最佳的界限。虽然我们的贡献主要是理论性,但我们在分类和反差性环境下得出的结果的主要实用性分析。 在任何基本的回溯性分析中,我们在任何可能的回溯性分析性分析中,在任何可能的回溯性分析中,都可以证明我们的任何次理解性分析结果结果的精确性之间。