Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.
翻译:近期如TripCraft和TravelPlanner等研究推动了大型语言模型在个性化、约束感知的旅行行程生成中的应用。然而,实际旅行常面临各种中断。为此,我们提出了TripTide——首个评估LLM在真实中断情境下修改行程能力的基准。TripTide建模了中断严重程度与旅行者容忍度等关键维度,能够细致评估LLM对航班取消、天气闭园、景点超额预订等事件的适应能力。我们进行了三重评估:首先,引入包括意图保持度(修订计划对可行性及目标的维持程度)、响应性(中断处理的及时性与恰当性)和适应性(原计划与修订计划在语义、空间及序列层面的差异度)在内的自动评估指标。其次,采用LLM即评判员方法自动评估修订质量。第三,执行人工专家评估以验证修订是否保持语义、空间、序列及响应层面的合理性。实验表明,LLM在序列一致性与语义稳定性方面表现良好;短途旅行的空间偏离较大,但随行程延长而减小,说明更长的计划有助于提升地理连贯性。然而,随着计划长度增加,中断处理能力下降,这揭示了LLM鲁棒性的局限。TripTide为评估现实不确定环境下基于LLM的旅行规划在适应性、个性化及韧性方面的表现建立了基准。