基于结构化 SVM 进行序列标注

2018 年 2 月 5 日 AI研习社 seaboat

作者 seaboat ，本文作者为雷锋网 AI 研习社撰写的独家稿件，未经雷锋网 AI 研习社许可不得转载。

关于SVM

SVM 即支持向量机，常用于二分类模型。它主要的思想是：

1.它是特征空间上间隔最大的线性分类器；

2.对于线性不可分的情况，通过非线性映射算法将低维空间的线性不可分的样本映射到高维特征空间，高维特征空间能够进行线性分析。

什么是结构化

其实机器学习中，如果按照输出空间不同可以分为：

二元分类 (binary classification)
多元分类 (multiclass classification)
回归问题 (regression)
结构化预测 (structured prediction)

其中前面三类都是我们常见且经常用的，第四种结构化预测重点体现在结构化上，前面三类的输出都是标签类别或者回归值之类的单变量，而结构化预测输出是一种结构化的数据结构，比如输入一句话，输出是一颗语法树。此外，结构还可以是图结构、序列结构等。

结构化 SVM

把前面的 SVM 与结构化结合起来就是结构化 SVM 了。它为了处理更加复杂的彼此之间互相存在依赖关系的结构数据，对传统 SVM 进行了改进，可以说结构化 SVM 是在传统 SVM 的基础上扩展出来的。结构化 SVM 使用时主要涉及学习和推理两个过程，与大多数机器学习算法一样，学习其实就是确定模型的参数的过程，而推理就是根据学习到的模型对给定的输入进行预测的过程。

假设给定了训练集，其中 X 和 Y 是两个集合，结构化 SVM 就是通过这些样本来训练一个输入输出对的函数。预测时，对于给定的输入 $x$ ，在所有 $y \in Y$ 中取得最大值的 $y$ 即为预测项。

学习过程

学习结构化数据就是要找到上述的一个判别函数，使之在判别函数确定后，对给定的输入 x ，能选择最大化函数 f 值的 Y 作为输出。假定函数 f 的形式为，

其中判别函数，w 是参数向量，而 Ψ(x,y) 可以看成是输入输出对的特征表示，代表将输入输出对合并起来的特征向量，它的形式取决于具体问题。一般会假设 F(x,y;w) 是 (x,y) 和参数向量 w w 的线性函数，即。

接着还得再定义一个损失函数 $Δ : Y \times Y \to R$ ，它应该满足时，当时。那么有经验风险函数，

所以我们的目标是要找到一个使得经验风险函数最小，而它可能存在经验风险为 0 的情况，此时，满足如下条件

其中，。根据间隔最大化来求解，固定 w 的长度，求能使得间隔最大的w。两个超平面的距离为最大化其实就等价于最小化，这时已经可以转成 SVM 中问题的形式了，

但实际情况中经验风险为 0 可能会导致过拟合现象，这时要考虑容忍训练集中某些样本错误分类，从而引入松弛变量，于是优化问题变为：

约束条件引入损失函数的影响，得

那么现在不管是经验风险为 0 还是不为 0，剩下要做的事就是求解上述优化问题，即根据上述各个式子中的约束条件解得最优值 W。怎么求解还是个难题，如果样本数较少且 Y 状态数较少，能用传统的二次优化求解。

而实际情况中样本数和状态数都较多，于是产生的约束条件规模非常大，总数量为，其中 n 为样本数，|Y | 为 y 可能的状态数。所以在求解过程中需要先将上述优化问题转换成对偶形式，采用割平面训练法，具体优化过程不考虑所有约束条件，从无约束问题出发，逐步选择约束直到精度满足期望后停止。

IOB 标记

常用的标注策略有 IOB 标记，即块的第一个符号处标注为 B，块内部的符号标注为 I，块外的符号标注 O。其中 B 为 Begin，表示开始；I 为 Intermediate，表示中间；O 为 Other，表示其他。比如：

我明天去北京。
OBIOBI

实现例子

使用 dlib 库实现结构化 SVM 序列标注功能，以下仅仅是一个简单的功能。对 “我昨天在学校看到小明”，“小红刚刚才去晚自习” 中的人名进行标注，并且使用 BIO 标记方式，通过训练后对 “我昨天在学校见到大东” 句子进行人名提取。

输出分别为

小 明
小 红
大 东
  
    
    
    
   
     
     
     #include <iostream>
#include <cctype>
#include <dlib/svm_threaded.h>
#include <dlib/string.h>

using namespace std;
using namespace dlib;

class feature_extractor
{
public:
  typedef std::vector<std::string> sequence_type;

  const static bool use_BIO_model = true;
  const static bool use_high_order_features = true;
  const static bool allow_negative_weights = true;
  unsigned long window_size()  const { return 3; }

  unsigned long num_features() const { return 1; }

  template <typename feature_setter>
  void get_features(
    feature_setter& set_feature,
    const sequence_type& sentence,
    unsigned long position
    ) const
  {
    if (sentence[0].compare("小") == 0 || sentence[0].compare("明") == 0 || sentence[0].compare("红") == 0 || sentence[0].compare("张") == 0 || sentence[0].compare("陈") == 0)
      set_feature(0);
  }
};


void make_training_examples(
  std::vector<std::vector<std::string> >& samples,
  std::vector<std::vector<std::pair<unsigned long, unsigned long> > >& segments
  )
{
  std::vector<std::pair<unsigned long, unsigned long> > names;

  samples.push_back(split("我 昨 天 在 学 校 看 到 小 明"));
  names.push_back(make_pair(8, 10));
  segments.push_back(names); names.clear();

  samples.push_back(split("小 红 刚 刚 才 去 晚 自 习"));
  names.push_back(make_pair(0, 2));
  segments.push_back(names); names.clear();

}

void print_segment(
  const std::vector<std::string>& sentence,
  const std::pair<unsigned long, unsigned long>& segment
  )
{
  for (unsigned long i = segment.first; i < segment.second; ++i)
    cout << sentence[i] << " ";
  cout << endl;
}

int main()
{
  std::vector<std::vector<std::string> > samples;
  std::vector<std::vector<std::pair<unsigned long, unsigned long> > > segments;
  make_training_examples(samples, segments);

  structural_sequence_segmentation_trainer<feature_extractor> trainer;

  sequence_segmenter<feature_extractor> segmenter = trainer.train(samples, segments);

  for (unsigned long i = 0; i < samples.size(); ++i)
  {
    std::vector<std::pair<unsigned long, unsigned long> > seg = segmenter(samples[i]);
    for (unsigned long j = 0; j < seg.size(); ++j)
    {
      print_segment(samples[i], seg[j]);
    }
  }

  std::vector<std::string> sentence(split("我 昨 天 在 学 校 见 到 大 东"));
  std::vector<std::pair<unsigned long, unsigned long> > seg = segmenter(sentence);
  for (unsigned long j = 0; j < seg.size(); ++j)
  {
    print_segment(sentence, seg[j]);
  }

}