We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partition-projected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.
翻译:我们考虑对数据点自然分组的应用(例如,按国家分类的调查答复)和标签数据偏差(例如,调查答卷人不代表人口)进行半监督的二进制分类分类(即,调查答卷人不代表人口)分类。特征空间的重叠和输入-输出模式在各组间是相互关联的。为了模拟这些数据的内在结构,我们假定根据群体-不可知特征空间界定的分区预测的等级-条件差异性类别。我们证明,根据这一假设,该组在群体-不可知特征之上,在ROC曲线下可明显改进的区域,拥有关于该类的额外信息(例如,调查答卷人不代表人口)。进一步假设分区-预测的分类条件分布在标签和无标签数据之间是互不相异的。我们得出半监督的算法,明确利用该结构学习最佳的、群体认知的、概率校准的分类器,尽管标签数据存在偏差。对合成和真实数据进行实验表明,我们的算法在适当的基线和混合模型下具有效力,而没有直接纳入标准、监督和半监督的群状模型。