Differentiable architecture search (DARTS) is successfully applied in many vision tasks. However, directly using DARTS for Transformers is memory-intensive, which renders the search process infeasible. To this end, we propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs. By relieving the memory burden for DARTS, it allows us to search with larger hidden size and more candidate operations. We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech. Experimental results show that our network consistently outperforms standard Transformers across the tasks. Moreover, our method compares favorably with big-size Evolved Transformers, reducing search computation by an order of magnitude.
翻译:不同的建筑搜索( DARTS) 成功地应用在许多视觉任务中。 但是, 直接使用 DARTS 进行变换是记忆密集的, 这使得搜索进程无法进行。 为此, 我们提出一个多功能可逆网络, 并将其与 DARTS 合并。 具体地说, 我们设计了一个反向转换与重建的算法, 这样我们只需要保存最后一个层的输出。 通过减轻 DARTS 的记忆负担, 它允许我们用更大的隐藏大小和更多的候选操作来搜索。 我们用三个序列到顺序的数据集, 即 WMT' 14 英德、 WMT' 14 英法英法和WMT' 14 英文- 捷克文来评估搜索结构 。 实验结果显示, 我们的网络始终超越了任务中的标准变换器。 此外, 我们的方法比大型变换器要好得多, 将搜索量减少一个数量级。