[20171126] 用 AlphaGo Zero 方法实现增强学习下棋- 专知

会员服务 ·

[20171126] 用 AlphaGo Zero 方法实现增强学习下棋

专知内容组

用AlphaGo Zero方法实现增强学习下棋

关于 (Keras/TensorFlow)

用AlphaGo Zero方法实现增强学习下棋

本工程主要基于下面两项研究：

DeepMind的10月19号的期刊：Mastering the Game of Go without Human Knowledge
@mokemokechicken在他的repo中做的对DeepMind想法的扩展： https：//github.com/mokemokechicken/reversi-alpha-zero

注：该项目仍在构建中！！

环境

Python 3.6.3
tensorflow-gpu: 1.3.0
Keras: 2.0.8

模块

强化学习

AlphaGo Zero接口的实现包含三个变量 self , opt , eval.

self 是通过使用最佳模型（BestModel）自我生成训练数据。
opt 是训练模型的训练器（Trainer ），并生成下一代模型。
eval 是评估器（Evaluator ）评估下一代模型是否比BestModel好。如果更好，则替换BestModel。

评估

在评估阶段，你可以用最佳模型（BestModel）下象棋。

数据

data/model/model_best_*: BestModel（最佳模型）.
data/model/next_generation/*: next-generation models.（下一代模型）
data/play_data/play_*.json: generated training data（生成的训练数据）.
logs/main.log: log file（日志文件）.

如果你想从一开始就自己训练模型，就删除上面的目录。

如何使用

安装：

安装库： pip install -r requirements.txt

如果想使用GPU，用下面的语句： pip install tensorflow-gpu

设置环境变量：创建 .env 文件，并写文件如下： KERAS_BACKEND=tensorflow

基本用法

对于训练模型，执行Self-Play, Trainer 和 Evaluator

Self-Play

python src/chess_zero/run.py self 当执行上述语句时，Self-Play将开始使用最佳模式。如果最好的模型不存在，那么将创建新的随机模型作为最好的模型。

选项设置

--new: 创建新的最佳模型（BestModel）
--type mini: 为测试使用最小配置 (see src/chess_zero/configs/mini.py)

Trainer

python src/chess_zero/run.py opt

当执行训练器的时候，则开始训练。将从最新保存的下一代模型加在基础模型。如果不存在，则使用最佳模型。训练模型每隔2000次迭代(mini-batch)保存一次。

Trainer 选项设置 + --type mini: 使用最小配置进行测试。（ src/chess_zero/configs/mini.py） + --total-step: 指定mini-batch总数。mini-batch会影响模型训练的学习速度。

Evaluator

python src/chess_zero/run.py eval

在执行时,评估开始。在进行到200次时，评估BestModel和最近一次的模型。如果下一代模型获胜, 下一代模型就当作BestModel。

Play Game

python src/chess_zero/run.py play_gui

当执行的时候，普通国际象棋棋盘将会显示在ASCII码中，你可以与最好的模型（BestModel）下棋。

技巧和备忘录

GPU 内存

通常情况下，会有内存不足引起的警告，而不是错误。如果发生了错误, 可以尝试改变 src/worker/{evaluate.py,optimize.py,self_play.py}中的 per_process_gpu_memory_fraction 如下语句：

tf_util.set_session_config(per_process_gpu_memory_fraction=0.2)

较少的batch_size 将会减少opt的内存使用，尝试改变NormalConfig中的TrainerConfig #batch_size。

模型性能

下表是最好的模型的记录。

best model generation	winning percentage to best model	Time Spent(hours)	note
1	-	-

展开全文

阅读 0+

评论 0+