According to the Global Burden of Disease list provided by the World Health Organization (WHO), mental disorders are among the most debilitating disorders.To improve the diagnosis and the therapy effectiveness in recent years, researchers have tried to identify individual biomarkers. Gathering neurobiological data however, is costly and time-consuming. Another potential source of information, which is already part of the clinical routine, are therapist-patient dialogues. While there are some pioneering works investigating the role of language as predictors for various therapeutic parameters, for example patient-therapist alliance, there are no large-scale studies. A major obstacle to conduct these studies is the availability of sizeable datasets, which are needed to train machine learning models. While these conversations are part of the daily routine of clinicians, gathering them is usually hindered by various ethical (purpose of data usage), legal (data privacy) and technical (data formatting) limitations. Some of these limitations are particular to the domain of therapy dialogues, like the increased difficulty in anonymisation, or the transcription of the recordings. In this paper, we elaborate on the challenges we faced in starting our collection of therapist-patient dialogues in a psychiatry clinic under the General Data Privacy Regulation of the European Union with the goal to use the data for Natural Language Processing (NLP) research. We give an overview of each step in our procedure and point out the potential pitfalls to motivate further research in this field.
翻译:根据世界卫生组织(世卫组织)提供的《全球疾病负担清单》,精神失常是最具削弱力的疾病。为了改善诊断和治疗效果,研究人员近年来努力确定个体生物标志,但收集神经生物数据成本高,耗时费时。另一个潜在的信息来源已经是临床常规的一部分,是临床-病人对话。虽然有一些开拓性的工作,调查语言作为各种治疗参数的预测者的作用,例如病人-治疗者联盟,但没有大规模研究。进行这些研究的主要障碍是提供大量数据集,这是培训机器学习模型所需要的。虽然这些谈话是临床医生日常例行工作的一部分,但收集这些数据通常受到各种道德(数据使用目的)、法律(数据隐私)和技术(数据格式)限制的阻碍。其中一些限制特别涉及治疗对话领域,例如匿名难度增加,或录音的抄录。在本文件中,我们阐述了我们在开始在精神病学研究模型中收集临床-病人对话方面遇到的挑战。我们根据《通用数据条例》,在每项研究中,将“我们的语言研究过程的动力”带“我们”进入了“联盟”的每一个数据库。