Abusive content detection in spoken text can be addressed by performing Automatic Speech Recognition (ASR) and leveraging advancements in natural language processing. However, ASR models introduce latency and often perform sub-optimally for profane words as they are underrepresented in training corpora and not spoken clearly or completely. Exploration of this problem entirely in the audio domain has largely been limited by the lack of audio datasets. Building on these challenges, we propose ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users. Through quantitative experiments across monolingual and cross-lingual zero-shot settings, we take the first step in democratizing audio based content moderation in Indic languages and set forth our dataset to pave future work.
翻译:在口头文字中,通过自动语音识别(ASR)和自然语言处理方面的优势,可以解决对内容的粗暴检测问题,但是,ASR模型引入了潜伏,而且往往对粗俗的词进行副最理想的处理,因为在培训公司中代表性不足,而且没有清晰或完全的说话。 完全在音频领域对这一问题的探索由于缺少音频数据集而大受限制。根据这些挑战,我们建议ADIMA,一个新颖的、语言多样性的、道德来源的、专家附加说明的、平衡的多语种探测音频数据集,由10种印度语的11 775个音频样本组成,覆盖65小时,由6 446个独特用户使用。通过单语和跨语级零弹式环境的定量实验,我们迈出了第一步,使基于语言的音频内容节化,并提出了我们为未来工作铺设的数据集。