量化睡眠:用于观测研究的机器学习技术 (Quantified Sleep: Machine learning techniques for observational n-of-1 studies)

This paper applies statistical learning techniques to an observational Quantified-Self (QS) study to build a descriptive model of sleep quality. A total of 472 days of my sleep data was collected with an Oura ring and combined with lifestyle, environmental, and psychological data. Such n-of-1 QS projects pose a number of challenges: heterogeneous data sources; missing values; high dimensionality; dynamic feedback loops; human biases. This paper directly addresses these challenges with an end-to-end QS pipeline that produces robust descriptive models. Sleep quality is one of the most difficult modelling targets in QS research, due to high noise and a large number of weakly-contributing factors. Sleep quality was selected so that approaches from this paper would generalise to most other n-of-1 QS projects. Techniques are presented for combining and engineering features for the different classes of data types, sample frequencies, and schema - including event logs, weather, and geo-spatial data. Statistical analyses for outliers, normality, (auto)correlation, stationarity, and missing data are detailed, along with a proposed method for hierarchical clustering to identify correlated groups of features. The missing data was overcome using a combination of knowledge-based and statistical techniques, including several multivariate imputation algorithms. "Markov unfolding" is presented for collapsing the time series into a collection of independent observations, whilst incorporating historical information. The final model was interpreted in two ways: by inspecting the internal $\beta$-parameters, and using the SHAP framework. These two interpretation techniques were combined to produce a list of the 16 most-predictive features, demonstrating that an observational study can greatly narrow down the number of features that need to be considered when designing interventional QS studies.

翻译：本文将统计学习技术应用于观测定量自毁(QS)研究,以建立睡眠质量的描述性模型。总共用OURa环和生活方式、环境和心理数据收集了472天的睡眠数据。这类N-IQS项目提出了诸多挑战:数据来源不一;数值缺失;高度维度;动态反馈环路;人类偏见。本文直接用一个端到端的QS管道来应对这些挑战,从而产生强有力的描述性模型。睡眠质量是QS研究中最困难的建模目标之一,因为噪音高和大量薄弱促成因素。从本文中挑选了睡眠数据质量,这样可以概括到大多数其他 n-1 QS 项目。介绍了各种数据来源的组合和工程特征,包括事件日志、天气和地理-空间数据。对于外部、正常度、(自动)温度和缺损的观察特征, 睡眠质量质量质量数据被选择为详细的方法, 包括历史级数数据集集, 数字序列的混合方法被使用。