Most deep-learning frameworks for understanding biological swarms are designed to fit perceptive models of group behavior to individual-level data (e.g., spatial coordinates of identified features of individuals) that have been separately gathered from video observations. Despite considerable advances in automated tracking, these methods are still very expensive or unreliable when tracking large numbers of animals simultaneously. Moreover, this approach assumes that the human-chosen features include sufficient features to explain important patterns in collective behavior. To address these issues, we propose training deep network models to predict system-level states directly from generic graphical features from the entire view, which can be relatively inexpensive to gather in a completely automated fashion. Because the resulting predictive models are not based on human-understood predictors, we use explanatory modules (e.g., Grad-CAM) that combine information hidden in the latent variables of the deep-network model with the video data itself to communicate to a human observer which aspects of observed individual behaviors are most informative in predicting group behavior. This represents an example of augmented intelligence in behavioral ecology -- knowledge co-creation in a human-AI team. As proof of concept, we utilize a 20-day video recording of a colony of over 50 Harpegnathos saltator ants to showcase that, without any individual annotations provided, a trained model can generate an "importance map" across the video frames to highlight regions of important behaviors, such as dueling (which the AI has no a priori knowledge of), that play a role in the resolution of reproductive-hierarchy re-formation. Based on the empirical results, we also discuss the potential use and current challenges.
翻译:用于了解生物群集的最深层次的学习框架旨在将群体行为的概念模型与个人数据(例如,个人特征的空间坐标)相匹配,这些数据是单独从视频观测中收集的。尽管在自动跟踪方面有相当大的进步,但这些方法在同时跟踪大量动物方面仍然非常昂贵或不可靠。此外,这一方法假设人类选择的特征包含足够的特征,可以解释集体行为的重要模式。为了解决这些问题,我们提议培训深层次的网络模型,以便从整个视图的通用图形特征中直接预测系统级国家,而从整个视图中收集的通用图形特征可能比较便宜,而以完全自动化的方式收集。由于由此产生的预测模型并非以人性偏差的预测器为基础,因此我们使用解释模块(例如,Grad-CAM),这些模型在同时跟踪大量动物时,仍然非常昂贵或不可靠。