Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/
翻译:音响对立体声的扩音器比对立体声的扩音器更快地到达一个音响对立的麦克风,从而造成一个传导方向的跨时滞。 估计音响的延迟需要找到每个麦克风所录的信号之间的通信。 我们提议通过自我监督,利用视觉跟踪的最新技术,学习这些通信。 我们调整贾布里等人的反常随机行走,从未贴标签的音响中学习循环一致的表达方式,从而形成一种模式,在“野外”互联网录音上与监督方法保持同步。 我们还提议一个多式对比学习模式,解决视觉指导本地化任务:估计一个多声带混合物中某个特定人的时间延时,以视觉显示其面貌。项目网站:https://ficl.githubio/stereocrw/。