Audio-visual video parsing (AVVP) aims to detect event categories and their temporal boundaries in videos, typically under weak supervision. Existing methods mainly focus on (i) improving temporal modeling using attention-based architectures or (ii) generating richer pseudo-labels to address the absence of frame-level annotations. However, attention-based models often overfit noisy pseudo-labels, leading to cumulative training errors, while pseudo-label generation approaches distribute attention uniformly across frames, weakening temporal localization accuracy. To address these challenges, we propose TEn-CATG, a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning. More specifically, we design a bi-directional text fusion (BiT) module by leveraging audio-visual features as semantic anchors to refine text embeddings, which departs from conventional text-to-feature alignment, thereby mitigating noise and enhancing cross-modal consistency. Furthermore, we introduce the category-aware temporal graph (CATG) module to model temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors, enabling effective event-dependent temporal reasoning. Extensive experiments demonstrate that TEn-CATG achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, highlighting its robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
翻译:暂无翻译