Readily available proxies for time of disease onset such as time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error models for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initial estimator solely based on the labelled subset, we perform a one-step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semi-supervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Partners Biobank Electronic Health Records (EHR).
翻译:由于缺少详细的文件和人工批注的劳动强度,因此通常只能通过后续时间而不是确切时间来为一小部分子子类确定该疾病目前的状况。在本文件中,我们的目标是为开始时间制定风险预测模型,同时有效地利用关于当前状况的少量标签和大量关于不完美的代理的未贴标签观测,在根据差数进行分析的情况下,可能导致风险预测错误。在半参数转换模型下,在代理启动时间的高度灵活度测量错误模型下,我们建议采用半超值风险预测方法,方法是高效率地综合来自代理和有限标签的信息。从一个仅以标签子类为基础的初始估计器出发,我们进行一步骤的更正,利用从准点得出的零级平均相关得分来增加全部数据。我们建立了拟议的半监督估测算器的一致性和不适中性正常性,并且提供了一种用于代理启动时间的高度灵活的测量误差模型,我们建议采用半超前风险预测器进行半超前风险预测方法。我们从一个仅以标签子类分类为基础的初始估测算器,我们用一个完整的数据来进行一步修正。我们用拟议的测算模型来进行拟议的测测测测算。我们提议的测测测测测测测测测测测的模型,以进行测测测测测测测测测测测的比结果。