Typically, electronic health record data are not collected towards a specific research question. Instead, they comprise numerous observations recruited at different ages, whose medical, environmental and oftentimes also genetic data are being collected. Some phenotypes, such as disease-onset ages, may be reported retrospectively if the event preceded recruitment, and such observations are termed ``prevalent". The standard method to accommodate this ``delayed entry" conditions on the entire history up to recruitment, hence the retrospective prevalent failure times are conditioned upon and cannot participate in estimating the disease-onset age distribution. An alternative approach conditions just on survival up to recruitment age, plus the recruitment age itself. This approach allows incorporating the prevalent information but brings about numerical and computational difficulties. In this work we develop consistent estimators of the coefficients in a regression model for the age-at-onset, while utilizing the prevalent data. Asymptotic results are provided, and simulations are conducted to showcase the substantial efficiency gain that may be obtained by the proposed approach. In particular, the method is highly useful in leveraging large-scale repositories for replicability analysis of genetic variants. Indeed, analysis of urinary bladder cancer data reveals that the proposed approach yields about twice as many replicated discoveries compared to the popular approach.
翻译:暂无翻译