Unsupervised domain adaptation (UDA) generally aligns the unlabeled target domain data to the distribution of the source domain to mitigate the distribution shift problem. The standard UDA requires sharing the source data with the target, having potential data privacy leaking risks. To protect the source data's privacy, we first propose to share the source feature distribution instead of the source data. However, sharing only the source feature distribution may still suffer from the membership inference attack who can infer an individual's membership by the black-box access to the source model. To resolve this privacy issue, we further study the under-explored problem of privacy-preserving domain adaptation and propose a method with a novel differential privacy training strategy to protect the source data privacy. We model the source feature distribution by Gaussian Mixture Models (GMMs) under the differential privacy setting and send it to the target client for adaptation. The target client resamples differentially private source features from GMMs and adapts on target data with several state-of-art UDA backbones. With our proposed method, the source data provider could avoid leaking source data privacy during domain adaptation as well as reserve the utility. To evaluate our proposed method's utility and privacy loss, we apply our model on a medical report disease label classification task using two noisy challenging clinical text datasets. The results show that our proposed method can preserve source data's privacy with a minor performance influence on the text classification task.
翻译:不受监督的域适应(UDA) 通常将未贴标签的目标域数据与源域的分布相匹配,以缓解分配变化问题。标准的 UDA 要求与目标共享源数据,并有潜在的数据隐私泄露风险。为了保护源数据的隐私,我们首先提议共享源特性分布,而不是源数据。然而,仅共享源特性分布仍可能受到成员推论攻击的影响,因为成员推论个人成员可以通过黑箱访问源模型来推断个人成员身份。为了解决这一隐私问题,我们进一步研究了隐私保护域适应不足的问题,并提出了一个方法,以新的差异性隐私培训战略与目标共享源数据,以保护源数据隐私。我们用差异性隐私设定来模拟源特性分布,并将其发送给适应目标客户。目标客户对来自GMMS的私人来源特性进行不同的选择,并调整目标数据与若干州级UDA的隐私影响。我们的拟议方法是,源数据提供者可以避免源数据隐私保护领域数据保密性泄露,我们在域适应过程中,将拟议使用的系统保密性数据定位格式用作我们的拟议工具,我们的拟议版本。