How can we build AI systems that are aligned with human values and objectives in order to avoid causing harm or violating societal standards for acceptable behavior? Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance, among others. We propose that this kind of representational alignment between machine learning (ML) models and humans is also a necessary condition for value alignment, where ML systems conform to human values and societal norms. We focus on ethics as one aspect of value alignment and train multiple ML agents (support vector regression and kernel regression) in a multi-armed bandit setting, where rewards are sampled from a distribution that reflects the morality of the chosen action. We then study the relationship between each agent's degree of representational alignment with humans and their performance when learning to take the most ethical actions.
翻译:暂无翻译