Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time-frequency scattering in Python. Unlike prior implementations, ours accommodates NumPy, PyTorch, and TensorFlow as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds.
翻译:联合时频散射(JTFS)是时间频域的一个革命性操作者,它以不同速度和尺度提取时速调制频谱,提供了主要听力皮层中光谱时空接收场的理想模式,因此可以作为以孤立音频事件规模衡量人类感知判断的生物合理替代工具。然而,以前对JTFS和STRF的实施仍然不在声音生成的感知相似措施和评价方法的标准工具包之外。我们追踪到这个问题有三种限制:差异性、速度和灵活性。我们在本文件中介绍了在Python实施时间频散射场的理想模式。与先前的实施不同,我们把NumPy、PyTorch和TensorFlow作为后端,因此在CPU和GPU上都可移植。我们通过三种应用显示JTFS的有用性:对光谱调、音乐仪器的监督性分类以及生物合成声学。