Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.
翻译:自留机制使变压器能够在许多语音到文字任务上取得超人水平的性能,然而自动分解的挑战仍未解决。 在本文中,我们微调了预先训练的STT Wissper 模型,通过重新标定低频符号来批注进化单元(IU)的边界。 我们的方法达到95.8%的准确度,超过以往的方法,而不需要大规模标记数据或企业级计算资源。 我们还通过应用一系列过滤器来减少输入信号,发现在3.2千赫级的低传分解过滤器可以提高在抽样和分配环境之外的分解性能。 我们发布我们的模型,既作为抄录工具,又作为进一步改进分解的基线。