Are extralinguistic signals such as image pixels crucial for inducing constituency grammars? While past work has shown substantial gains from multimodal cues, we investigate whether such gains persist in the presence of rich information from large language models (LLMs). We find that our approach, LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods on the task of unsupervised constituency parsing, achieving state-of-the-art performance on a variety of datasets. Moreover, LC-PCFG results in an over 50% reduction in parameter count, and speedups in training time of 1.7x for image-aided models and more than 5x for video-aided models, respectively. These results challenge the notion that extralinguistic signals such as image pixels are needed for unsupervised grammar induction, and point to the need for better text-only baselines in evaluating the need of multi-modality for the task.
翻译:外语信号,如图像像素,对于引导选区语法至关重要吗?虽然过去的工作显示从多式提示中获得巨大收益,但我们调查这些收益是否在大型语言模型(LLM)的丰富信息中继续存在。 我们发现我们的方法,即基于LLM的C-PCFG(LC-PCFG),在未经监督的选区分析任务上优于以前的多模式方法,在各种数据集上达到最先进的性能。 此外,LC-PCFG导致参数计数减少50%以上,在图像辅助模型培训时间方面加快1.7x,视频辅助模型培训时间加快5x以上。这些结果挑战了如下概念,即无需监督的语法感应需要超语言信号,如图像像素等,并且指出需要更好的文本单一基线来评估对任务多模式的需求。