【学习】Hierarchical Softmax

2017 年 8 月 6 日 机器学习研究会
【学习】Hierarchical Softmax


点击上方 “机器学习研究会”可以订阅哦


摘要
 

转自:爱可可-爱生活

[These are the notes from a talk I gave at the seminar]

Hierarchical softmax is an alternative to the softmax in which the probability of any one outcome depends on a number of model parameters that is only logarithmic in the total number of outcomes. In “vanilla” softmax, on the other hand, the number of such parameters is linear in the number of total number of outcomes. In a case where there are many outcomes (e.g. in language modelling) this can be a huge difference. The consequence is that models using hierarchical softmax are significantly faster to train with stochastic gradient descent, since only the parameters upon which the current training example depend need to be updated, and less updates means we can move on to the next training example sooner. At evaluation time, hierarchical softmax models allow faster calculation of individual outcomes, again because they depend on less parameters (and because the calculation using the parameters is just as straightforward as in the softmax case). So hierarchical softmax is very interesting from a computational point-of-view. By explaining it here, I hope to convince you that it is also interesting conceptually. To keep things concrete, I’ll illustrate using the CBOW learning task from word2vec (and fasttext, and others).

链接:

http://building-babylon.net/2017/08/01/hierarchical-softmax/


原文链接:

https://m.weibo.cn/1402400261/4136923293170695

“完整内容”请点击【阅读原文】
↓↓↓



登录查看更多
3

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

0
6
下载
预览

Human parsing is for pixel-wise human semantic understanding. As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task. Focusing on this, we seek to simultaneously exploit the representational capacity of deep graph networks and the hierarchical human structures. In particular, we provide following two contributions. First, three kinds of part relations, i.e., decomposition, composition, and dependency, are, for the first time, completely and precisely described by three distinct relation networks. This is in stark contrast to previous parsers, which only focus on a portion of the relations and adopt a type-agnostic relation modeling strategy. More expressive relation information can be captured by explicitly imposing the parameters in the relation networks to satisfy the specific characteristics of different relations. Second, previous parsers largely ignore the need for an approximation algorithm over the loopy human hierarchy, while we instead address an iterative reasoning process, by assimilating generic message-passing networks with their edge-typed, convolutional counterparts. With these efforts, our parser lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Comprehensive experiments on five datasets demonstrate that our parser sets a new state-of-the-art on each.

0
5
下载
预览

Meta learning is a promising solution to few-shot learning problems. However, existing meta learning methods are restricted to the scenarios where training and application tasks share the same out-put structure. To obtain a meta model applicable to the tasks with new structures, it is required to collect new training data and repeat the time-consuming meta training procedure. This makes them inefficient or even inapplicable in learning to solve heterogeneous few-shot learning tasks. We thus develop a novel and principled HierarchicalMeta Learning (HML) method. Different from existing methods that only focus on optimizing the adaptability of a meta model to similar tasks, HML also explicitly optimizes its generalizability across heterogeneous tasks. To this end, HML first factorizes a set of similar training tasks into heterogeneous ones and trains the meta model over them at two levels to maximize adaptation and generalization performance respectively. The resultant model can then directly generalize to new tasks. Extensive experiments on few-shot classification and regression problems clearly demonstrate the superiority of HML over fine-tuning and state-of-the-art meta learning approaches in terms of generalization across heterogeneous tasks.

0
6
下载
预览

Deep reinforcement learning (RL) has achieved many recent successes, yet experiment turn-around time remains a key bottleneck in research and in practice. We investigate how to optimize existing deep RL algorithms for modern computers, specifically for a combination of CPUs and GPUs. We confirm that both policy gradient and Q-value learning algorithms can be adapted to learn using many parallel simulator instances. We further find it possible to train using batch sizes considerably larger than are standard, without negatively affecting sample complexity or final performance. We leverage these facts to build a unified framework for parallelization that dramatically hastens experiments in both classes of algorithm. All neural network computations use GPUs, accelerating both data collection and training. Our results include using an entire DGX-1 to learn successful strategies in Atari games in mere minutes, using both synchronous and asynchronous algorithms.

0
5
下载
预览

Recently, Neural Architecture Search (NAS) has successfully identified neural network architectures that exceed human designed ones on large-scale image classification problems. In this paper, we study NAS for semantic image segmentation, an important computer vision task that assigns a semantic label to every pixel in an image. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Without any ImageNet pretraining, our architecture searched specifically for semantic image segmentation attains state-of-the-art performance.

0
5
下载
预览

Despite deep reinforcement learning has recently achieved great successes, however in multiagent environments, a number of challenges still remain. Multiagent reinforcement learning (MARL) is commonly considered to suffer from the problem of non-stationary environments and exponentially increasing policy space. It would be even more challenging to learn effective policies in circumstances where the rewards are sparse and delayed over long trajectories. In this paper, we study Hierarchical Deep Multiagent Reinforcement Learning (hierarchical deep MARL) in cooperative multiagent problems with sparse and delayed rewards, where efficient multiagent learning methods are desperately needed. We decompose the original MARL problem into hierarchies and investigate how effective policies can be learned hierarchically in synchronous/asynchronous hierarchical MARL frameworks. Several hierarchical deep MARL architectures, i.e., Ind-hDQN, hCom and hQmix, are introduced for different learning paradigms. Moreover, to alleviate the issues of sparse experiences in high-level learning and non-stationarity in multiagent settings, we propose a new experience replay mechanism, named as Augmented Concurrent Experience Replay (ACER). We empirically demonstrate the effects and efficiency of our approaches in several classic Multiagent Trash Collection tasks, as well as in an extremely challenging team sports game, i.e., Fever Basketball Defense.

0
6
下载
预览

End-to-end networks trained for task-oriented dialog, such as for recommending restaurants to a user, suffer from out-of-vocabulary (OOV) problem -- the entities in the Knowledge Base (KB) may not be seen by the network at training time, making it hard to use them in dialog. We propose a novel Hierarchical Pointer Generator Memory Network (HyP-MN), in which the next word may be generated from the decode vocabulary or copied from a hierarchical memory maintaining KB results and previous utterances. This hierarchical memory layout along with a novel KB dropout helps to alleviate the OOV problem. Evaluating over the dialog bAbI tasks, we find that HyP-MN outperforms state-of-the-art results, with considerable improvements (10% on OOV test set). HyP-MN also achieves competitive performances on various real-world datasets such as CamRest676 and In-car assistant dataset.

0
4
下载
预览

We observe that end-to-end memory networks (MN) trained for task-oriented dialogue, such as for recommending restaurants to a user, suffer from an out-of-vocabulary (OOV) problem -- the entities returned by the Knowledge Base (KB) may not be seen by the network at training time, making it impossible for it to use them in dialogue. We propose a Hierarchical Pointer Memory Network (HyP-MN), in which the next word may be generated from the decode vocabulary or copied from a hierarchical memory maintaining KB results and previous utterances. Evaluating over the dialog bAbI tasks, we find that HyP-MN drastically outperforms MN obtaining 12% overall accuracy gains. Further analysis reveals that MN fails completely in recommending any relevant restaurant, whereas HyP-MN recommends the best next restaurant 80% of the time.

0
3
下载
预览

In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax and Angular Softmax have been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW BLUFR and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset. Our code has also been made available at https://github.com/happynear/AMSoftmax

0
10
下载
预览

Sequential recommendation is one of fundamental tasks for Web applications. Previous methods are mostly based on Markov chains with a strong Markov assumption. Recently, recurrent neural networks (RNNs) are getting more and more popular and has demonstrated its effectiveness in many tasks. The last hidden state is usually applied as the sequence's representation to make recommendation. Benefit from the natural characteristics of RNN, the hidden state is a combination of long-term dependency and short-term interest to some degrees. However, the monotonic temporal dependency of RNN impairs the user's short-term interest. Consequently, the hidden state is not sufficient to reflect the user's final interest. In this work, to deal with this problem, we propose a Hierarchical Contextual Attention-based GRU (HCA-GRU) network. The first level of HCA-GRU is conducted on the input. We construct a contextual input by using several recent inputs based on the attention mechanism. This can model the complicated correlations among recent items and strengthen the hidden state. The second level is executed on the hidden state. We fuse the current hidden state and a contextual hidden state built by the attention mechanism, which leads to a more suitable user's overall interest. Experiments on two real-world datasets show that HCA-GRU can effectively generate the personalized ranking list and achieve significant improvement.

0
5
下载
预览
小贴士
相关资讯
Hierarchically Structured Meta-learning
CreateAMind
12+阅读 · 2019年5月22日
Transferring Knowledge across Learning Processes
CreateAMind
8+阅读 · 2019年5月18日
弱监督语义分割最新方法资源列表
专知
7+阅读 · 2019年2月26日
无监督元学习表示学习
CreateAMind
20+阅读 · 2019年1月4日
meta learning 17年:MAML SNAIL
CreateAMind
9+阅读 · 2019年1月2日
A Technical Overview of AI & ML in 2018 & Trends for 2019
待字闺中
10+阅读 · 2018年12月24日
Hierarchical Imitation - Reinforcement Learning
CreateAMind
16+阅读 · 2018年5月25日
Python机器学习教程资料/代码
机器学习研究会
5+阅读 · 2018年2月22日
【学习】(Python)SVM数据分类
机器学习研究会
5+阅读 · 2017年10月15日
【推荐】RNN/LSTM时序预测
机器学习研究会
22+阅读 · 2017年9月8日
相关VIP内容
相关论文
Weijie Zhao,Deping Xie,Ronglai Jia,Yulei Qian,Ruiquan Ding,Mingming Sun,Ping Li
6+阅读 · 2020年3月12日
Hierarchical Human Parsing with Typed Part-Relation Reasoning
Wenguan Wang,Hailong Zhu,Jifeng Dai,Yanwei Pang,Jianbing Shen,Ling Shao
5+阅读 · 2020年3月10日
Yingtian Zou,Jiashi Feng
6+阅读 · 2019年4月19日
Accelerated Methods for Deep Reinforcement Learning
Adam Stooke,Pieter Abbeel
5+阅读 · 2019年1月10日
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Chenxi Liu,Liang-Chieh Chen,Florian Schroff,Hartwig Adam,Wei Hua,Alan Yuille,Li Fei-Fei
5+阅读 · 2019年1月10日
Hierarchical Deep Multiagent Reinforcement Learning
Hongyao Tang,Jianye Hao,Tangjie Lv,Yingfeng Chen,Zongzhang Zhang,Hangtian Jia,Chunxu Ren,Yan Zheng,Changjie Fan,Li Wang
6+阅读 · 2018年9月25日
Hierarchical-Pointer Generator Memory Network for Task Oriented Dialog
Dinesh Raghu,Nikhil Gupta, Mausam
4+阅读 · 2018年7月11日
Dinesh Raghu,Nikhil Gupta, Mausam
3+阅读 · 2018年5月3日
Feng Wang,Weiyang Liu,Haijun Liu,Jian Cheng
10+阅读 · 2018年1月18日
Qiang Cui,Shu Wu,Yan Huang,Liang Wang
5+阅读 · 2017年12月7日
Top