Cross-device federated learning is an emerging machine learning (ML) paradigm where a large population of devices collectively train an ML model while the data remains on the devices. This research field has a unique set of practical challenges, and to systematically make advances, new datasets curated to be compatible with this paradigm are needed. Existing federated learning benchmarks in the image domain do not accurately capture the scale and heterogeneity of many real-world use cases. We introduce FLAIR, a challenging large-scale annotated image dataset for multi-label classification suitable for federated learning. FLAIR has 429,078 images from 51,414 Flickr users and captures many of the intricacies typically encountered in federated learning, such as heterogeneous user data and a long-tailed label distribution. We implement multiple baselines in different learning setups for different tasks on this dataset. We believe FLAIR can serve as a challenging benchmark for advancing the state-of-the art in federated learning. Dataset access and the code for the benchmark are available at \url{https://github.com/apple/ml-flair}.
翻译:跨盘联谊学习是一种新兴的机器学习模式,在这个模式中,大量设备集体培训ML模型,而数据仍留在设备上。这个研究领域有一套独特的实际挑战,并需要系统地取得进展,新的数据集经过整理,以符合这一模式。在图像领域现有的联合学习基准不能准确地捕捉许多真实世界使用案例的规模和异质性。我们引入了FLAIR,这是一个具有挑战性的大型附加说明图像数据集,供适合联合学习的多标签分类使用。FLAIR拥有51,414Flickr用户的429,078个图像,并捕捉了在联合学习中通常遇到的许多复杂问题,例如混杂用户数据和长期的标签分配。我们在不同学习组合中为该数据集的不同任务执行多个基线。我们认为FLAIR可以作为一个具有挑战性的基准,用以推进联邦化学习中的状态艺术。数据设置访问和基准代码可在urlas/glibub./pruplair}/mair.