We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/
翻译:我们提出了ExAct,这是一个用于专业级理解人类熟练体力活动的新视频-语言基准。我们的新基准包含3521个由专家策划的视频问答对,涵盖6个领域中的11项体力活动:体育、自行车维修、烹饪、健康、音乐和舞蹈。ExAct要求从五个精心设计的候选选项中选出正确答案,因此需要对人类体力技能进行细致、细粒度、专家级的理解。在ExAct上评估当前最先进的视觉语言模型(VLMs)显示,其性能与人类专家表现存在显著差距。具体而言,表现最佳的GPT-4o模型仅达到44.70%的准确率,远低于经过训练的人类专家所达到的82.02%。我们相信ExAct将有助于开发和评估能够精确理解人类在不同体力和程序性领域技能的VLMs。数据集和代码可在https://texaser.github.io/exact_project_page/获取。