This paper introduces semantic features as a candidate conceptual framework for building inherently interpretable neural networks. A proof of concept model for informative subproblem of MNIST consists of 4 such layers with the total of 5K learnable parameters. The model is well-motivated, inherently interpretable, requires little hyperparameter tuning and achieves human-level adversarial test accuracy - with no form of adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn
翻译:暂无翻译