We present a syntactic dependency treebank for naturalistic child and child-directed speech in English (MacWhinney, 2000). Our annotations largely followed the guidelines of the Universal Dependencies project (UD (Zeman et al., 2022)), with detailed extensions to lexical/syntactic structures unique to conversational speech (in opposition to written texts). Compared to existing UD-style spoken treebanks as well as other dependency corpora of child-parent interactions specifically, our dataset is of (much) larger size (N of utterances = 44,744; N of words = 233, 907) and contains speech from a total of 10 children covering a wide age range (18-66 months). With this dataset, we ask: (1) How well would state-of-the-art dependency parsers, tailored for the written domain, perform for speech of different interlocutors in spontaneous conversations? (2) What is the relationship between parser performance and the developmental stage of the child? To address these questions, in ongoing work, we are conducting thorough dependency parser evaluations using both graph-based and transition-based parsers with different hyperparameterization, trained from three different types of out-of-domain written texts: news, tweets, and learner data.
翻译:我们用英语(MacWhinney,2000年)提出了一个关于自然儿童和儿童演讲的综合依赖树库(MacWhinney,2000年),我们的说明基本上遵循了普遍依赖项目(UD(Zeman等人,2022年))的指导方针,详细扩展了对话演讲特有的法律/合成结构(反对书面文本)。与现有的UD式口语树库以及儿童家长互动的其他依赖性团体相比,我们的数据集(很大)规模更大(词数=44,744;字数=233,907),并包含10名儿童在大范围内(18-66个月)的演讲。我们问,有了这个数据集,我们问:(1) 专为书面领域定制的、在自发对话中为不同对话者演讲提供的最新依赖性分析员有多好?(2) 分析员业绩与儿童发展阶段之间的关系如何?为了解决这些问题,在目前的工作中,我们正在利用基于图表和过渡性书面文本的总共10名儿童的演讲(18-66个月)进行彻底的依赖性评价。我们问:(1) 为书面领域而专门设计、不同类型、具有不同程度的高级教科书的、不同类型、从高级学习的高级教科书进行的全面分析。