韩文字体数据集:用于调查派任代表的等级和组成数据集 (Hangul Fonts Dataset: a Hierarchical and Compositional Dataset for Investigating Learned Representations)

Hierarchy and compositionality are common latent properties in many natural and scientific datasets. Determining when a deep network's hidden activations represent hierarchy and compositionality is important both for understanding deep representation learning and for applying deep networks in domains where interpretability is crucial. However, current benchmark machine learning datasets either have little hierarchical or compositional structure, or the structure is not known. This gap impedes precise analysis of a network's representations and thus hinders development of new methods that can learn such properties. To address this gap, we developed a new benchmark dataset with known hierarchical and compositional structure. The Hangul Fonts Dataset (HFD) is comprised of 35 fonts from the Korean writing system (Hangul), each with 11,172 blocks (syllables) composed from the product of initial consonant, medial vowel, and final consonant glyphs. All blocks can be grouped into a few geometric types which induces a hierarchy across blocks. In addition, each block is composed of individual glyphs with rotations, translations, scalings, and naturalistic style variation across fonts. We find that both shallow and deep unsupervised methods only show modest evidence of hierarchy and compositionality in their representations of the HFD compared to supervised deep networks. Supervised deep network representations contain structure related to the geometrical hierarchy of the characters, but the compositional structure of the data is not evident. Thus, HFD enables the identification of shortcomings in existing methods, a critical first step toward developing new machine learning algorithms to extract hierarchical and compositional structure in the context of naturalistic variability.

翻译：在许多自然和科学数据集中, 分层和构成是常见的潜在特性。当深网络隐藏的激活代表等级和组成性时, 确定深网络显示等级和组成性对于理解深层代表性学习和在解释性至关重要的领域应用深层网络都很重要。但是, 当前的基准机器学习数据集要么没有等级或组成结构, 或结构不为人知。这个差距妨碍对网络的表达方式进行精确分析, 从而妨碍开发能够学习这种属性的新方法。为了弥补这一差距, 我们开发了一个新的基准数据集, 有已知的等级和组成结构。韩文字体数据集( HHFD) 由韩国写作系统( Hangul) 的35个字体组成, 每个有11, 172个区块( 符号) 组成, 由初始正对调、介质、和最后对调的图形构成构成构成构成构成构成构成构成构成。所有的区块可以归为几个几几何类型, 从而导致跨区段的等级。此外, 每个区块由各个不同的直级结构组成, 由新的直级结构组成,, 只能由不同的直行、缩、缩缩、和自然结构结构内部结构结构结构结构结构结构结构的直向直径向字体变化的直向直径。我们发现, 以浅、结构向深度结构向下、结构结构结构的直观、结构结构的直向下、显示的直向下、结构结构结构结构的平的直向结构结构结构结构结构的直观、。