Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.
翻译:对话主题分割为摘要生成、信息检索、记忆管理及会话连续性提供支持。尽管历经数十年研究,现有评估实践仍主要依赖严格边界匹配和基于F1的度量指标。基于现代大语言模型(LLM)的对话系统日益依赖分割技术来管理超出固定上下文窗口的对话历史。在此类系统中,非结构化的上下文累积会降低效率与连贯性。本文提出一种评估框架,在报告窗口容错F1(W-F1)的同时,提供边界密度与片段对齐诊断(纯度与覆盖率)指标。通过将边界评分与边界选择解耦,本框架能评估不同密度区间而非单一操作点上的分割质量。跨数据集评估表明,现有文献报道的性能差异往往反映的是标注粒度失配,而非单纯的边界定位质量。我们在涵盖任务导向、开放域、会议风格及合成交互的八个对话数据集上,评估了结构各异的分割策略。基于边界的度量指标与边界密度存在强耦合关系:阈值扫描产生的W-F1变化幅度甚至超过不同方法间的性能差异。这些发现支持将主题分割视为粒度选择问题,而非对单一正确边界集的预测。这为在不同标注粒度下分析和调整分割性能,提出了将边界评分与边界选择相分离的研究范式。