This paper rethink some aspects of speech processing using speech encoders, specifically about extracting entities directly from speech, without intermediate textual representation. In human-computer conversations, extracting entities such as names, street addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech ignoring the superfluous portions such as carrier phrases, or spell name entities. In the context of dialog from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step approach which first generates lexical transcriptions followed by text-based entity extraction for identifying spoken entities.
翻译:本文重新考慮了使用語音編碼器進行語音處理的一些方面,特別是從語音直接提取實體,而不需要中間的文本表示。在人與計算機交互中,從語音中提取名稱、街道地址和電子郵件地址等實體是一項具有挑戰性的任務。在本文中,我們研究了微調預訓練語音編碼器對於直接從語音中提取人類可讀形式的口語實體的影響,而無需文本轉錄。我們說明了這種直接的方法優化了編碼器,以轉錄語音中僅與實體有關的部分,忽略了冗餘部分,例如載體詞或拼寫名稱實體。在企業虛擬代理人對話的背景下,我們證明了一步驟方法優於典型的步驟方法,即首先生成詞彙轉錄,然後進行基於文本的實體提取,以識別口語實體。