For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. In the first stage, we sufficiently widen the deep thin network and train it until convergence. In the second stage, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by letting the thin network imitate the immediate outputs of the wide network from layer to layer. In the last stage, we further fine tune this well initialized deep thin network. The theoretical guarantee is established by using mean field analysis, which shows the advantage of layerwise imitation over traditional training deep thin networks from scratch by backpropagation. We also conduct large-scale empirical experiments to validate our approach. By training with our method, ResNet50 can outperform ResNet101, and BERT_BASE can be comparable with BERT_LARGE, where both the latter models are trained via the standard training procedures as in the literature.

14
下载
关闭预览

相关内容

Networking:IFIP International Conferences on Networking。 Explanation:国际网络会议。 Publisher:IFIP。 SIT: http://dblp.uni-trier.de/db/conf/networking/index.html

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

0
10
下载
预览

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

0
4
下载
预览

Model efficiency has become increasingly important in computer vision. In this paper, we systematically study various neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations, we have developed a new family of object detectors, called EfficientDet, which consistently achieve an order-of-magnitude better efficiency than prior art across a wide spectrum of resource constraints. In particular, without bells and whistles, our EfficientDet-D7 achieves stateof-the-art 51.0 mAP on COCO dataset with 52M parameters and 326B FLOPS1 , being 4x smaller and using 9.3x fewer FLOPS yet still more accurate (+0.3% mAP) than the best previous detector.

0
5
下载
预览

Deep Learning is applied to energy markets to predict extreme loads observed in energy grids. Forecasting energy loads and prices is challenging due to sharp peaks and troughs that arise due to supply and demand fluctuations from intraday system constraints. We propose deep spatio-temporal models and extreme value theory (EVT) to capture theses effects and in particular the tail behavior of load spikes. Deep LSTM architectures with ReLU and $\tanh$ activation functions can model trends and temporal dependencies while EVT captures highly volatile load spikes above a pre-specified threshold. To illustrate our methodology, we use hourly price and demand data from 4719 nodes of the PJM interconnection, and we construct a deep predictor. We show that DL-EVT outperforms traditional Fourier time series methods, both in-and out-of-sample, by capturing the observed nonlinearities in prices. Finally, we conclude with directions for future research.

0
4
下载
预览

Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery in deep learning. Existing works indicate that this observation holds for both complicated real datasets and simple datasets of one-dimensional (1-d) functions. In this work, for natural images and low-frequency dominant 1-d functions, we empirically found that a DNN with common settings first quickly captures the dominant low-frequency components, and then relatively slowly captures high-frequency ones. We call this phenomenon Frequency Principle (F-Principle). F-Principle can be observed over various DNN setups of different activation functions, layer structures and training algorithms in our experiments. F-Principle can be used to understand (i) the behavior of DNN training in the information plane and (ii) why DNNs often generalize well albeit its ability of overfitting. This F-Principle potentially can provide insights into understanding the general principle underlying DNN optimization and generalization for real datasets.

0
4
下载
预览

This paper addresses the problem of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs (robustness to bounded norm adversarial perturbations, for example). Most previous work on this topic was limited in its applicability by the size of the network, network architecture and the complexity of properties to be verified. In contrast, our framework applies to a general class of activation functions and specifications on neural network inputs and outputs. We formulate verification as an optimization problem (seeking to find the largest violation of the specification) and solve a Lagrangian relaxation of the optimization problem to obtain an upper bound on the worst case violation of the specification being verified. Our approach is anytime i.e. it can be stopped at any time and a valid bound on the maximum violation can be obtained. We develop specialized verification algorithms with provable tightness guarantees under special assumptions and demonstrate the practical significance of our general verification approach on a variety of verification tasks.

0
3
下载
预览

In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. The optimal cost function of the aggregate problem, a nonlinear function of the features, serves as an architecture for approximation in value space of the optimal cost function or the cost functions of policies of the original problem. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with reinforcement learning based on deep neural networks, which is used to obtain the needed features. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by deep reinforcement learning, thereby potentially leading to more effective policy improvement.

0
8
下载
预览

Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen "robustly": commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.

0
7
下载
预览

Object tracking is an essential problem in computer vision that has been researched for several decades. One of the main challenges in tracking is to adapt to object appearance changes over time, in order to avoid drifting to background clutter. We address this challenge by proposing a deep neural network architecture composed of different parts, which functions as a society of tracking parts. The parts work in conjunction according to a certain policy and learn from each other in a robust manner, using co-occurrence constraints that ensure robust inference and learning. From a structural point of view, our network is composed of two main pathways. One pathway is more conservative. It carefully monitors a large set of simple tracker parts learned as linear filters over deep feature activation maps. It assigns the parts different roles. It promotes the reliable ones and removes the inconsistent ones. We learn these filters simultaneously in an efficient way, with a single closed-form formulation for which we propose novel theoretical properties. The second pathway is more progressive. It is learned completely online and thus it is able to better model object appearance changes. In order to adapt in a robust manner, it is learned only on highly confident frames, which are decided using co-occurrences with the first pathway. Thus, our system has the full benefit of two main approaches in tracking. The larger set of simpler filter parts offers robustness, while the full deep network learned online provides adaptability to change. As shown in the experimental section, our approach achieves state of the art performance on the challenging VOT17 benchmark, outperforming the existing published methods both on the general EAO metric as well as in the number of fails by a significant margin.

0
4
下载
预览

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

1
5
下载
预览
小贴士
相关论文
Zhuohan Li,Eric Wallace,Sheng Shen,Kevin Lin,Kurt Keutzer,Dan Klein,Joseph E. Gonzalez
10+阅读 · 2020年6月23日
Ruibin Xiong,Yunchang Yang,Di He,Kai Zheng,Shuxin Zheng,Chen Xing,Huishuai Zhang,Yanyan Lan,Liwei Wang,Tie-Yan Liu
4+阅读 · 2020年2月12日
EfficientDet: Scalable and Efficient Object Detection
Mingxing Tan,Ruoming Pang,Quoc V. Le
5+阅读 · 2019年11月20日
Deep Learning for Energy Markets
Michael Polson,Vadim Sokolov
4+阅读 · 2019年4月10日
Training behavior of deep neural network in frequency domain
Zhi-Qin J. Xu,Yaoyu Zhang,Yanyang Xiao
4+阅读 · 2018年8月21日
A Dual Approach to Scalable Verification of Deep Networks
Krishnamurthy, Dvijotham,Robert Stanforth,Sven Gowal,Timothy Mann,Pushmeet Kohli
3+阅读 · 2018年8月3日
Chiyuan Zhang,Oriol Vinyals,Remi Munos,Samy Bengio
7+阅读 · 2018年4月20日
Elena Burceanu,Marius Leordeanu
4+阅读 · 2018年4月5日
John Duchi,Hongseok Namkoong
5+阅读 · 2017年12月14日
相关VIP内容
专知会员服务
113+阅读 · 2020年2月1日
Stabilizing Transformers for Reinforcement Learning
专知会员服务
30+阅读 · 2019年10月17日
《DeepGCNs: Making GCNs Go as Deep as CNNs》
专知会员服务
23+阅读 · 2019年10月17日
Keras François Chollet 《Deep Learning with Python 》, 386页pdf
专知会员服务
85+阅读 · 2019年10月12日
强化学习最新教程,17页pdf
专知会员服务
75+阅读 · 2019年10月11日
【哈佛大学商学院课程Fall 2019】机器学习可解释性
专知会员服务
56+阅读 · 2019年10月9日
相关资讯
Hierarchically Structured Meta-learning
CreateAMind
13+阅读 · 2019年5月22日
强化学习的Unsupervised Meta-Learning
CreateAMind
7+阅读 · 2019年1月7日
meta learning 17年:MAML SNAIL
CreateAMind
9+阅读 · 2019年1月2日
RL 真经
CreateAMind
4+阅读 · 2018年12月28日
AI/ML/DNN硬件加速设计怎么入门?
StarryHeavensAbove
8+阅读 · 2018年12月4日
Hierarchical Imitation - Reinforcement Learning
CreateAMind
16+阅读 · 2018年5月25日
【推荐】直接未来预测:增强学习监督学习
机器学习研究会
6+阅读 · 2017年11月24日
Auto-Encoding GAN
CreateAMind
5+阅读 · 2017年8月4日
强化学习 cartpole_a3c
CreateAMind
9+阅读 · 2017年7月21日
Top
微信扫码咨询专知VIP会员