Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior state-of-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
翻译:数据驱动语音处理模型通常在大量文本监管下运作良好,但收集转录语音数据的成本很高。 因此,我们提议采用SpeopleCLIP(SpeopleCLIP),这是一个通过图像连接语音和文字的新框架,用图像强化语音模型,而不用抄录。 我们利用最先进的预先培训的HuBERT(HuBERT)和CLIP(CLIP),通过配对图像和口语字幕使其与微调相匹配。 SpeopleCLIP(TechCLIP)优于先前的图像语音检索最新技术,在没有直接由抄录监管的情况下进行零发语音文本检索。 此外,SpeclCLIP还可以直接从语音中检索与语义相关的关键词。