This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Additionally, MLVAS features an advanced strobing video extraction module that specifically identifies strobing frames from laryngeal videostroboscopy by analyzing hue, saturation, and value fluctuations. Beyond key segment extraction, MLVAS provides effective metrics for Vocal Fold Paralysis (VFP) detection. It employs a novel two-stage glottis segmentation process using a U-Net for initial segmentation, followed by a diffusion-based refinement to reduce false positives, providing better segmentation masks for downstream tasks. MLVAS estimates the vibration dynamics for both left and right vocal folds from the segmented glottis masks to detect unilateral VFP by measuring the angle deviation with the estimated glottal midline. Comparing the variance between left's and right's dynamics, the system effectively distinguishes between left and right VFP. We conducted several ablation studies to demonstrate the effectiveness of each module in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.
翻译:暂无翻译