贝叶斯分析助你成为优秀的调参侠：自动化搜索物理模型的参数空间

2020 年 12 月 31 日 PaperWeekly

©PaperWeekly 原创 · 作者｜庞龙刚

学校｜华中师范大学

研究方向｜能核物理、人工智能

做研究的时候经常莫名其妙的发现自己成了调参侠，为了使用物理模型拟合某组实验数据，不断在模型参数空间人肉搜索。运气好的话很快找到一组看上去不错的参数，大约能近似的描述实验数据。运气不好的话，怎么调都跟实验数据对不上。你肯定想过，要是电脑能帮自己调参，自动寻找能够描述实验数据的最好的那组物理模型参数该多好。

这一节介绍如何使用贝叶斯分析完成这件事，做个出色的调参侠。

学习内容

1. 贝叶斯公式

2. 科学的研究方法与贝叶斯分析

3. 如何自动化搜索物理模型的参数空间

贝叶斯公式

随机变量的联合概率密度分布可以写成以下两种形式：

若将左边的除到右边，则有：

这就是著名的贝叶斯公式，后面马上会用到。

科学的研究方法与贝叶斯分析

下面这段话介绍了费曼眼中的科研：

First you guess. Don't laugh, this is the most important step. Then you compute the consequences. Compare the consequences to experience. If it disagrees with experience, the guess is wrong. In that simple statement is the key to science. It doesn't matter how beautiful your guess is or how smart you are or what your name is. If it disagrees with experience, it's wrong. That's all there is to it.

——By Richard P. Feynman

翻译过来就是几个步骤：

1. 你先猜，建个模

2. 用这个模型算个结果

3. 拿这个结果跟实验数据对比

4. 符合实验就是对的。不符合就是错的。

这就是为什么我们经常会莫名其妙的变成调参侠——因为我们需要先“建个模”。

其中是模型输出，是我们的物理模型，x 是给定的输入，是模型参数。

寻找使得模型预言与实验数据距离最小的那组参数，称作最大似然估计 MLE。

那个距离可以理解为“似然”。

使用：

把分母展开，解析一下贝叶斯公式：

费曼对科研的描述中：

1. “First you guess”对应先验（a prior), 即基于以往的经验，你对参数的估计，或者说你认为的取值所应满足的分布—— ；

2. “Compute the consequences, and compare with experience”对应似然函数（likelihood) ，即看模型输出与实验数据到底有多相似；

3. 费曼没有提如果有很多模型都可以描述同样的实验数据，那么真实理论是其中一种的几率就会降低。这一项对应归一化系数，即分母上对不同参数的似然加权求和。有时又称“证据” Evidence；

4. 费曼也没有提人们对模型参数的信仰会随着数据的增多而发生改变和更新。这就是后验（Posterior）。

关于模型

物理模型都是真实世界的近似描述。对于不同的参数，张开了一个函数空间，真实世界的函数一般并不能在此函数空间中完美展开。但是，只要对某组参数，能够近似描述实验数据，就可以认为它是真实世界的一个有效模型。

很多时候，我们需要跟实验对比，寻找这个有效模型的参数。

举例，模型有两个参数，，构成 2 维参数空间，可以使用 Grid Search 法逐点探索。但对于大部分参数来说，模型输出与实验数据的 likelihood 都很低，处于不值得浪费时间探索的参数空间区域，如下图所示：

▲ 二维参数空间。黄色圆点对应低 likelihood，红色圆点对应高 likelihood 区域，模型可以近似描述实验。贝叶斯分析要从参数空间的某一点出发，通过随机游走，快速找到高 likelihood 区域。如果红色区域很大，则很多参数点都能描述实验，后验分布是某个参数点的机会降低。

马尔可夫链蒙特卡洛抽样 MCMC

在贝叶斯公式里，实验数据 y 已知，要计算参数的后验分布会发现分母上很难计算，它要求遍历整个参数空间，幸运的是马尔可夫链蒙特卡洛 MCMC 技术可以从非归一化的分布函数抽样，也就是说只需要计算如下非归一化的后验分布：

从这个非归一化的后验分布抽样，得到满足实验数据的模型参数的分布，是贝叶斯分析最重要的目标。

MCMC 里最简单的实现是 Metropolis-Hastings 算法，这里以一个简单的例子介绍 MH 算法如何工作。

例子：使用 Metropolis-Hastings 算法从一维分布抽样。

算法：

1. 初始化一个空列表 X = [ ]，用来存放所有抽样得到的点；从参数空间任一点出发；

2. 根据前一步的位置做随机行走， , 其中是从正态分布或均匀分布抽样得到的步长；

3. 定义舍选率，按均匀分布抽样；

4. 如果，则将放入 X 列表中；否则，将前一步抽样的放入 X 列表；

5. 转到第 2 步。

Metropolis-Hastings 算法最终产生一个数组，

，用 histogram 画出参数的分布，会发现它满足。

算法实现（仅用作教学目的）：

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import os

def mcmc(func, ranges, n=100000, warmup=0.1, thin=1, step=1, initial_guess=None):
    '''Metropolis-Hastings sampling for distribution func
    :func: non-normalized distribution function
    :ranges: the lower and upper bound of params, its shape determines ndim,
             np.array([[xmin, xmax], [ymin, ymax], [zmin, zmax], [..., ...]])
    :n: number of samples
    :warmup: ratio of samples that will be discarded 
    :thin: samples to skip in the output
    :step: effective step length in the proposal function
    :initial_guess: initial position in the parameter space'''
    ran =  np.random.random
    normal = np.random.normal
    ndim = ranges.shape[0]
    samples = np.empty((n, ndim), dtype=np.float32)
    xmin, xmax = ranges[:, 0], ranges[:, 1]
    # start from somewhere in the parameter space
    if initial_guess == None:
        r = ran()
        x0 = xmin * (1 - r) + xmax * r               # initial sample
    else:
        x0 = initial_guess

    step  = step * (xmax - xmin) / (xmax - xmin).max()
    for i in tqdm(range(n)):
        r = normal(size=(ndim))                      # propose the next move
        x1 = x0 + r * step
        alpha = min(func(x1) / func(x0), 1)          # acception rate
        r0 = ran()
        if (r0 <= alpha) and (x1>xmin).all()    \
           and (x1<xmax).all():                      # accept the proposal
            x0 = x1

        samples[i] = x0                              # or keep the previous sample
    return samples[int(warmup*n):-1:thin]

调用上面实现的 MH 算法，对函数抽样：

def fun(theta):
    return np.exp(-np.abs(theta))

# get 1,000,000 samples using mcmc
ranges_ = np.array([[-5, 5]])
samples = mcmc(fun, ranges_, n=1000000, warmup=0.1, thin=10)

100%|██████████| 1000000/1000000 [00:11<00:00, 85041.25it/s]

本来我们设定要抽样 1000,000 个样本，但因为使用了 warmup 和 thin 两个参数，最终得到的样本数量只有 90000 个。

# 'warmup' removes 10% data
# 'thin' reduces the size by a fact of 10
print(samples.shape)

上述代码输出样本个数：（90000,1），如果用 histogram 画一下分布，可以看到 MCMC 方法抽样得到的分布与基本一致。

# this script is to make an animation for the mcmc sampling
%matplotlib notebook
from matplotlib.animation import FuncAnimation

# compute the distribution of these samples using histogram
hist, edges = np.histogram(samples.flatten(), bins=500)

x = 0.5 * (edges[1:] + edges[:-1])
fig1, ax1 = plt.subplots()

ax1.plot(x, hist/hist.max(), '-', label="MCMC")
ax1.plot(x, fun(x), '-', label = r"$f(\theta) = \exp(-|\theta|)$")

plt.legend(loc='best')
plt.xlabel(r'$\theta$')
plt.ylabel(r'$f(\theta)$')
ax1.set_xlim(x[0], x[-1])

dot, = ax1.plot(0, 1, 'o', ms=10)

def update(i):
    xi = samples[i]
    dot.set_data(xi, fun(xi)) 
    return dot,


anim = FuncAnimation(fig1, update, frames=1000, interval=50, blit=False)

Warmup 与 Thin 两个参数是什么

这两个参数都表示要扔掉部分样本。warmup=0.1 表示在样本列表的最前端扔掉 10% 的样本，thin=10 表示在剩下的样本中每隔 10 个样本保留一个。下图中蓝色是保留下的样本，灰色是扔掉的样本。

因为 MCMC 样本都是按顺序产生的，后一个样本依赖于前一个样本的位置，所以样本与样本之间有自关联（Auto Correlation），使用 thin 参数，每隔几个样本保留一个有利于削弱自关联。这里可以画图看一下自关联。

%matplotlib inline
s = pd.Series(samples[:2000].flatten())
ax = pd.plotting.autocorrelation_plot(s)
plt.ylim(-0.25, 0.25)

▲自关联。Lag 表示两个样本之间相隔的样本个数，纵坐标表示自关联。可以看到 thin=2000 时才基本上完全消除了自关联。

Warmup 是为了扔掉初期还没有达到细致平衡的那些样本。在二维或高维抽样时可以非常直观看到早期样本不满足最终分布。

def fun2d(theta):
    return np.exp(-np.abs(theta).sum())

# get 100,000 samples in 2d-parameter space using mcmc
ranges_ = np.array([[-5, 5], [-5, 5]])
samples2d = mcmc(fun2d, ranges_, n=1000, warmup=0.0, step=1)

plt.scatter(samples2d[:, 0], samples2d[:, 1], label="all samples")

plt.plot(samples2d[:15, 0], samples2d[:15, 1], 'ro-', label="initial samples")

plt.plot(samples2d[0, 0], samples2d[0, 1], 'r*', ms=15, label="start")
plt.xlabel('param 1')
plt.ylabel('param 2')

plt.legend(loc='best')