United States Courts make audio recordings of oral arguments available as public record, but these recordings rarely include speaker annotations. This paper addresses the Speech Audio Diarization problem, answering the question of "Who spoke when?" in the domain of judicial oral argument proceedings. We present a workflow for diarizing the speech of judges using audio recordings of oral arguments, a process we call Reference-Dependent Speaker Verification. We utilize a speech embedding network trained with the Generalized End-to-End Loss to encode speech into d-vectors and a pre-defined reference audio library based on annotated data. We find that by encoding reference audio for speakers and full arguments and computing similarity scores we achieve a 13.8% Diarization Error Rate for speakers covered by the reference audio library on a held-out test set. We evaluate our method on the Supreme Court of the United States oral arguments, accessed through the Oyez Project, and outline future work for diarizing legal proceedings. A code repository for this research is available at github.com/JeffT13/rd-diarization
翻译:美国法院将口头辩论的录音录音作为公开记录,但这些录音很少包括演讲人的说明。本文论述在司法口头辩论程序中的“谁发言时”的问题,回答“谁发言时”的问题。我们展示了使用口头辩论录音对法官讲话进行分化的工作流程,我们称之为“参考独立发言人核查”程序。我们利用在通用端到端损失中受过培训的演讲嵌入网络,将演讲编码成d-矢量器,以及根据附加说明的数据预先界定的参考音频库。我们发现,通过对演讲人进行编码参考音频,以及完整的辩论和计算相似的分数,我们达到了参考音频图书馆在悬置测试集上覆盖的发言者的13.8%的分数错误率。我们评估了美国最高法院通过Oyez项目查阅的口头辩论方法,并概述了对法律程序进行分解的未来工作。在Githhub.com/JeffT13/rd-diarization上提供了这项研究的代码库。