[1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529.
 
             [2] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489.
 
             [3] Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354.
 
             [4] Levine, Sergey, et al. "End-to-end training of deep visuomotor policies." arXiv preprint arXiv:1504.00702, 2015.
 
             [5] Mao, Hongzi, et al. "Resource management with deep reinforcement learning." Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016.
 
             [6] deepmind.com/blog/deepm
 
             [7] Jaques, Natasha, et al. "Tuning recurrent neural networks with reinforcement learning." (2017).
 
             [8] Henderson, Peter, et al. "Deep reinforcement learning that matters." arXiv
preprint arXiv:1709.06560 (2017).
 
             [9] Islam, Riashat, et al. "Reproducibility of benchmarked deep reinforcement learning tasks for continuous control." arXiv preprint arXiv:1708.04133 (2017).
 
             [10] riashatislam.files.wordpress.com
 
             [11] sites.google.com/view/d
 
             [12]reddit.com/r/MachineLea
 
             [13] alexirpan.com/2018/02/1
 
             [14] himanshusahni.github.io
 
             [15] amid.fish/reproducing-d
 
             [16] rodeo.ai/2018/05/06/rep
 
             [17] Dayan, Peter, and Yael Niv. "Reinforcement learning: the good, the bad and the ugly." Current opinion in neurobiology 18.2 (2008): 185-196.
 
             [18] argmin.net/2018/05/11/o
 
             [19] Mania, Horia, Aurelia Guy, and Benjamin Recht. "Simple random search provides a competitive approach to reinforcement learning." arXiv preprint arXiv:1803.07055 (2018).
 
             [20] Justesen, Niels, et al. "Deep Learning for Video Game Playing." arXiv preprint arXiv:1708.07902 (2017).
 
             [21] youtube.com/watch?
 
             [22] sites.google.com/view/i
 
             [23] Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv preprint arXiv:1612.08810 (2016).
 
             [24] Deisenroth, Marc, and Carl E. Rasmussen. "PILCO: A model-based and data-efficient approach to policy search." Proceedings of the 28th International Conference on machine learning (ICML-11). 2011.
 
             [25] Levine, Sergey, and Vladlen Koltun. "Guided policy search." International Conference on Machine Learning. 2013.
 
             [26] Weber, Théophane, et al. "Imagination-augmented agents for deep reinforcement learning." arXiv preprint arXiv:1707.06203 (2017).
 
             [27] Sutton, Richard S. "Dyna, an integrated architecture for learning, planning, and reacting." ACM SIGART Bulletin 2.4 (1991): 160-163.
 
             [28] Silver, David, Richard S. Sutton, and Martin Müller. "Sample-based learning and search with permanent and transient memories." Proceedings of the 25th international conference on Machine learning. ACM, 2008.
 
             [29] Rajeswaran, Aravind, et al. "Towards generalization and simplicity in continuous control." Advances in Neural Information Processing Systems. 2017.
 
             [30] andreykurenkov.com/writ
 
             [31] UCL Course on RL: www0.cs.ucl.ac.uk/staff
 
             [32] Powell, Warren B. "AI, OR and control theory: A rosetta stone for stochastic optimization." Princeton University (2012).
 
             [33] Pomerleau, Dean A. "Alvinn: An autonomous land vehicle in a neural network." Advances in neural information processing systems. 1989.
 
             [34] Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
 
             [35] Osa, Takayuki, et al. "An algorithmic perspective on imitation learning." Foundations and Trends® in Robotics 7.1-2 (2018): 1-179.
 
             [36] fermatslibrary.com/arxi url=https%3A%2F%2Farxiv.org%2Fpdf%2F1406.2661.pdf
 
             [37] Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.
 
             [38] github.com/carla-simula
 
             [39] 36kr.com/p/5129474.html
 
             [40] Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017).
 
             [41] Ranzato, Marc'Aurelio, et al. "Sequence level training with recurrent neural networks." arXiv preprint arXiv:1511.06732 (2015).
 
             [42] Bahdanau, Dzmitry, et al. "An actor-critic algorithm for sequence prediction." arXiv preprint arXiv:1607.07086 (2016).
 
             [43] Keneshloo, Yaser, et al. "Deep Reinforcement Learning For Sequence to Sequence Models." arXiv preprint arXiv:1805.09461 (2018).
 
             [44] Watters, Nicholas, et al. "Visual interaction networks." arXiv preprint arXiv:1706.01433 (2017).
 
             [45] Hamrick, Jessica B., et al. "Relational inductive bias for physical construction in humans and machines." arXiv preprint arXiv:1806.01203 (2018).
 
             [46] Zambaldi, Vinicius, et al. "Relational Deep Reinforcement Learning." arXiv preprint arXiv:1806.01830 (2018).
 
             [47] Santoro, Adam, et al. "Relational recurrent neural networks." arXiv preprint arXiv:1806.01822 (2018).
 
             [48] Battaglia, Peter W., et al. "Relational inductive biases, deep learning, and graph networks." arXiv preprint arXiv:1806.01261 (2018).
 
             [49] Eslami, SM Ali, et al. "Neural scene representation and rendering." Science 360.6394 (2018): 1204-1210.
 
             [50] Huang, Sandy, et al. "Adversarial attacks on neural network policies." arXiv preprint arXiv:1702.02284 (2017).
 
             [51] Behzadan, Vahid, and Arslan Munir. "Vulnerability of deep reinforcement learning to policy induction attacks." International Conference on Machine Learning and Data Mining in Pattern Recognition. Springer, Cham, 2017.
 
             [52] Tesauro, Gerald. "Temporal difference learning and TD-Gammon." Communications of the ACM 38.3 (1995): 58-68.
 
             [53] Jones, Rebecca M., et al. "Behavioral and neural properties of social reinforcement learning."Journal of Neuroscience 31.37 (2011): 13039-13045.
 
             [54] github.com/YBIGTA/DeepN
 
             [55] Lewis, Mike, et al. "Deal or no deal? end-to-end learning for negotiation dialogues." arXiv preprint arXiv:1706.05125 (2017).
 
             [56] microsoft.com/zh-cn/tra
 
             [57] Derhami, Vali, et al. "Applying reinforcement learning for web pages ranking algorithms." Applied Soft Computing 13.4 (2013): 1686-1692.
 
             [58] Wu, Jiawei, Lei Li, and William Yang Wang. "Reinforced Co-Training." arXiv preprint arXiv:1804.06035 (2018).
 
             [59] Li, Yuxi. "Deep reinforcement learning: An overview." arXiv preprint arXiv:1701.07274 (2017).
 
             [60] Feinberg, Eugene A., and Adam Shwartz, eds. Handbook of Markov decision processes: methods and applications. Vol. 40. Springer Science & Business Media, 2012.
 
             [61] en.wikipedia.org/wiki/C
 
             [62] Turing, Alan M. "Computing machinery and intelligence." Parsing the Turing Test. Springer, Dordrecht, 2009. 23-65.
 
             [63] en.wikipedia.org/wiki/A
 
             [64] Russell, Stuart J., and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
 
             [65] Noë, Alva. Action in perception. MIT press, 2004.
 
             [66] Garnelo, Marta, Kai Arulkumaran, and Murray Shanahan. "Towards deep symbolic reinforcement learning." arXiv preprint arXiv:1609.05518 (2016).
 
             [67] argmin.net/2018/04/16/e
 
             [68] Bojarski, Mariusz, et al. "End to end learning for self-driving cars." arXiv preprint arXiv:1604.07316 (2016).
 
             [69] Recht, Benjamin . "A Tour of Reinforcement Learning:The View from Continuous Control." arXiv preprint arXiv: 1806.09460
 
             [70] Kalashnikov, Dmitry, et al. "QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation." arXiv preprint arXiv:1806.10293 (2018).
 
             [71] Tan, Jie, et al. "Sim-to-Real: Learning Agile Locomotion For Quadruped Robots." arXiv preprint arXiv:1804.10332 (2018).
 
             [72] Auer, Peter. "Using confidence bounds for exploitation-exploration trade-offs." Journal of Machine Learning Research 3.Nov (2002): 397-422.
 
             [73] Agrawal, Shipra, and Navin Goyal. "Thompson sampling for contextual bandits with linear payoffs." International Conference on Machine Learning. 2013.
 
             [74] Mohamed, Shakir, and Danilo Jimenez Rezende. "Variational information maximisation for intrinsically motivated reinforcement learning." Advances in neural information processing systems. 2015.
 
             [75] Pathak, Deepak, et al. "Curiosity-driven exploration by self-supervised prediction." International Conference on Machine Learning (ICML). Vol. 2017. 2017.
 
             [76] Tang, Haoran, et al. "# Exploration: A study of count-based exploration for deep reinforcement learning." Advances in Neural Information Processing Systems. 2017.
 
             [77] McFarlane, Roger. "A Survey of Exploration Strategies in Reinforcement Learning." McGill University, www. cs. mcgill. ca/∼ cs526/roger. pdf, accessed: April (2018).
 
             [78] Plappert, Matthias, et al. "Parameter space noise for exploration." arXiv preprint arXiv:1706.01905 (2017).
 
             [79] Fortunato, Meire, et al. "Noisy networks for exploration." arXiv preprint arXiv:1706.10295 (2017).
 
             [80] Kansky, Ken, et al. "Schema networks: Zero-shot transfer with a generative causal model of intuitive physics." arXiv preprint arXiv:1706.04317 (2017).
 
             [81] Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).
 
             [82] berkeleyautomation.github.io
 
             [83] contest.openai.com/2018