TripTide：面向中断情境下自适应旅行规划的基准 (TripTide: A Benchmark for Adaptive Travel Planning under Disruptions)

Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.

翻译：近期如TripCraft和TravelPlanner等研究推动了大型语言模型在个性化、约束感知的旅行行程生成中的应用。然而，实际旅行常面临各种中断。为此，我们提出了TripTide——首个评估LLM在真实中断情境下修改行程能力的基准。TripTide建模了中断严重程度与旅行者容忍度等关键维度，能够细致评估LLM对航班取消、天气闭园、景点超额预订等事件的适应能力。我们进行了三重评估：首先，引入包括意图保持度（修订计划对可行性及目标的维持程度）、响应性（中断处理的及时性与恰当性）和适应性（原计划与修订计划在语义、空间及序列层面的差异度）在内的自动评估指标。其次，采用LLM即评判员方法自动评估修订质量。第三，执行人工专家评估以验证修订是否保持语义、空间、序列及响应层面的合理性。实验表明，LLM在序列一致性与语义稳定性方面表现良好；短途旅行的空间偏离较大，但随行程延长而减小，说明更长的计划有助于提升地理连贯性。然而，随着计划长度增加，中断处理能力下降，这揭示了LLM鲁棒性的局限。TripTide为评估现实不确定环境下基于LLM的旅行规划在适应性、个性化及韧性方面的表现建立了基准。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日