利用结构改进对有分类组合的有偏见数据分类 (Leveraging Structure for Improved Classification of Grouped Biased Data)

We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partition-projected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.

翻译：我们考虑对数据点自然分组的应用(例如,按国家分类的调查答复)和标签数据偏差(例如,调查答卷人不代表人口)进行半监督的二进制分类分类(即,调查答卷人不代表人口)分类。特征空间的重叠和输入-输出模式在各组间是相互关联的。为了模拟这些数据的内在结构,我们假定根据群体-不可知特征空间界定的分区预测的等级-条件差异性类别。我们证明,根据这一假设,该组在群体-不可知特征之上,在ROC曲线下可明显改进的区域,拥有关于该类的额外信息(例如,调查答卷人不代表人口)。进一步假设分区-预测的分类条件分布在标签和无标签数据之间是互不相异的。我们得出半监督的算法,明确利用该结构学习最佳的、群体认知的、概率校准的分类器,尽管标签数据存在偏差。对合成和真实数据进行实验表明,我们的算法在适当的基线和混合模型下具有效力,而没有直接纳入标准、监督和半监督的群状模型。

相关内容

GROUP

关注 1

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日