Data privacy is a central problem for embodied agents that can perceive the environment, communicate with humans, and act in the real world. While helping humans complete tasks, the agent may observe and process sensitive information of users, such as house environments, human activities, etc. In this work, we introduce privacy-preserving embodied agent learning for the task of Vision-and-Language Navigation (VLN), where an embodied agent navigates house environments by following natural language instructions. We view each house environment as a local client, which shares nothing other than local updates with the cloud server and other clients, and propose a novel federated vision-and-language navigation (FedVLN) framework to protect data privacy during both training and pre-exploration. Particularly, we propose a decentralized training strategy to limit the data of each client to its local model training and a federated pre-exploration method to do partial model aggregation to improve model generalizability to unseen environments. Extensive results on R2R and RxR datasets show that under our FedVLN framework, decentralized VLN models achieve comparable results with centralized training while protecting seen environment privacy, and federated pre-exploration significantly outperforms centralized pre-exploration while preserving unseen environment privacy.
翻译:在帮助人类完成各项任务的同时,该代理人可以观察和处理用户的敏感信息,如家庭环境、人类活动等。 在这项工作中,我们为视野和语言导航(VLN)的任务引入了隐私保护代理学习,其中,一个体现代理人通过自然语言指示对室内环境进行导航;我们认为每个家庭环境都是当地客户,除了与云端服务器和其他客户进行本地更新外,别无其他内容,并提议一个新型的联盟式视觉和语言导航(FedVLN)框架,以便在培训和勘探前保护数据隐私。特别是,我们提出一个分散化的培训战略,将每个客户的数据限制在本地模型培训中,并采用联邦式的勘探前方法,进行部分模型汇总,以改进对隐蔽环境的模型可比较性。 R2R 和 RxR 数据集的广泛结果显示,在我们的FDVLN 框架内,分散化的VLN 模型在中央化的隐私保护前前,在保护中心化环境之前,同时进行集中化的探索前,同时实现可比较的结果。