Automatic pronunciation assessment is a major component of a computer-assisted pronunciation training system. To provide in-depth feedback, scoring pronunciation at various levels of granularity such as phoneme, word, and utterance, with diverse aspects such as accuracy, fluency, and completeness, is essential. However, existing multi-aspect multi-granularity methods simultaneously predict all aspects at all granularity levels; therefore, they have difficulty in capturing the linguistic hierarchy of phoneme, word, and utterance. This limitation further leads to neglecting intimate cross-aspect relations at the same linguistic unit. In this paper, we propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model, which hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention that reflects associations across aspects at the same level to create more connotative representations. By obtaining relational information from both the granularity- and aspect-side, HiPAMA can take full advantage of multi-task learning. Remarkable improvements in the experimental results on the speachocean762 datasets demonstrate the robustness of HiPAMA, particularly in the difficult-to-assess aspects.
翻译:自动发音评估是计算机辅助发音培训系统的一个主要组成部分。 要提供深入反馈, 以电话、 字词和发音等不同层面的颗粒度来评分发音, 包括精度、 流度和完整性等不同方面, 至关重要。 但是, 现有的多方位多发性多发性方法同时预测所有颗粒级的所有方面; 因此, 它们难以掌握电话、 字数和发音的语言等级。 这一限制进一步导致忽略同一语言单位的亲密交叉关系。 在本文中, 我们提议采用具有多重关注的等级性发音评估( HIPAMA) 模型, 该模型按等级代表颗粒性水平直接捕捉其语言结构, 并引入反映同一级别各个层面的关联的多发性关注, 以创造更具有共性的表现。 通过从颗粒性和侧面获取相关信息, HIPAMA 能够充分利用多发性学习。 实验性结果的改进在Speach- 76-MA 特别困难的方面展示了Special- AS- ASet the degloat- dasationality exalsealsealse。