我佛了！用 KNN 实现验证码识别，又 Get 到一招

会员服务 ·

我佛了！用 KNN 实现验证码识别，又 Get 到一招

2020 年 5 月 28 日 CSDN

作者 | 李秋键

编辑 | Carol

题图 | 视觉中国

出品| CSDN（ID：CSDNnews）

验证码使我们生活中最为常见的防治爬虫和机器人登录攻击的手段，一般的验证码主要由数字和字母组成，故我们可以设想：我们是否可以根据文本识别训练模型进行识别验证码呢？当然可以，今天我们就将利用KNN实现验证码的识别。

关于KNN基本常识如下：

KNN算法我们主要要考虑三个重要的要素，对于固定的训练集，只要这三点确定了，算法的预测方式也就决定了。这三个最终的要素是k值的选取，距离度量的方式和分类决策规则。

对于k值的选择，没有一个固定的经验，一般根据样本的分布，选择一个较小的值，可以通过交叉验证选择一个合适的k值。

选择较小的k值，就相当于用较小的领域中的训练实例进行预测，训练误差会减小，只有与输入实例较近或相似的训练实例才会对预测结果起作用，与此同时带来的问题是泛化误差会增大，换句话说，K值的减小就意味着整体模型变得复杂，容易发生过拟合；

选择较大的k值，就相当于用较大领域中的训练实例进行预测，其优点是可以减少泛化误差，但缺点是训练误差会增大。这时候，与输入实例较远（不相似的）训练实例也会对预测器作用，使预测发生错误，且K值的增大就意味着整体的模型变得简单。

一个极端是k等于样本数m，则完全没有分类，此时无论输入实例是什么，都只是简单的预测它属于在训练实例中最多的类，模型过于简单。

效果图如下：

实验前的准备

首先我们使用的python版本是3.6.5所用到的库有cv2库用来图像处理；

Numpy库用来矩阵运算；

训练的数据集如下所示：

训练模型的搭建

1、获取切割字符轮廓：

我们定义ws和valid_contours数组，用来存放图片宽度和训练数据集中的图片。如果分割错误的话需要重新分割。主要根据字符数量判断是否切割错误，如果切割出有4个字符。说明没啥问题：

代码如下：

   
   
     
      
    
    #定义函数get_rect_box，目的在于获得切割图片字符位置和宽度
    
    
      
def get_rect_box(contours):
    
    
      
    
    
    
      print(
    
    
      "获取字符轮廓。。。")
    
    
      
    #定义ws和valid_contours数组，用来存放图片宽度和训练数据集中的图片。如果分割错误的话需要重新分割
    
    
      
    ws = []
    
    
      
    valid_contours = []
    
    
      
    
    
    
      for contour 
    
    
      in contours:
    
    
      
        #画矩形用来框住单个字符，x,y,w,h四个参数分别是该框子的x,y坐标和长宽。因
    
    
      
        x, y, w, h = cv2.boundingRect(contour)
    
    
      
        
    
    
      if w < 
    
    
      7:
    
    
      
            continue
    
    
      
        valid_contours.append(contour)
    
    
      
        ws.append(w)
    
    
      
#w_min是二值化白色区域最小宽度，目的用来分割。
    
    
      
    w_min = 
    
    
      min(ws)
    
    
      
# w_max是最大宽度
    
    
      
    w_max = 
    
    
      max(ws)
    
    
      
    result = []
    
    
      
    #如果切割出有
    
    
      4个字符。说明没啥问题
    
    
      
    
    
    
      if 
    
    
      len(valid_contours) == 
    
    
      4:
    
    
      
        
    
    
      for contour 
    
    
      in valid_contours:
    
    
      
            x, y, w, h = cv2.boundingRect(contour)
    
    
      
            box = np.int0(
    
    
      [[x,y], [x+w,y], [x+w,y+h], [x,y+h]])
    
    
      
            result.append(box)
    
    
      
    # 如果切割出有
    
    
      3个字符。参照文章，中间分割
    
    
      
    elif 
    
    
      len(valid_contours) == 
    
    
      3:
    
    
      
        
    
    
      for contour 
    
    
      in valid_contours:
    
    
      
            x, y, w, h = cv2.boundingRect(contour)
    
    
      
            
    
    
      if w == w_max:
    
    
      
                box_left = np.int0(
    
    
      [[x,y], [x+w/2,y], [x+w/2,y+h], [x,y+h]])
    
    
      
                box_right = np.int0(
    
    
      [[x+w/2,y], [x+w,y], [x+w,y+h], [x+w/2,y+h]])
    
    
      
                result.append(box_left)
    
    
      
                result.append(box_right)
    
    
      
            
    
    
      else:
    
    
      
                box = np.int0(
    
    
      [[x,y], [x+w,y], [x+w,y+h], [x,y+h]])
    
    
      
                result.append(box)
    
    
      
    # 如果切割出有
    
    
      3个字符。参照文章，将包含了
    
    
      3个字符的轮廓在水平方向上三等分
    
    
      
    elif 
    
    
      len(valid_contours) == 
    
    
      2:
    
    
      
        
    
    
      for contour 
    
    
      in valid_contours:
    
    
      
            x, y, w, h = cv2.boundingRect(contour)
    
    
      
            
    
    
      if w == w_max 
    
    
      and w_max >= w_min * 
    
    
      2:
    
    
      
                box_left = np.int0(
    
    
      [[x,y], [x+w/3,y], [x+w/3,y+h], [x,y+h]])
    
    
      
                box_mid = np.int0(
    
    
      [[x+w/3,y], [x+w*2/3,y], [x+w*2/3,y+h], [x+w/3,y+h]])
    
    
      
                box_right = np.int0(
    
    
      [[x+w*2/3,y], [x+w,y], [x+w,y+h], [x+w*2/3,y+h]])
    
    
      
                result.append(box_left)
    
    
      
                result.append(box_mid)
    
    
      
                result.append(box_right)
    
    
      
            elif w_max < w_min * 
    
    
      2:
    
    
      
                box_left = np.int0(
    
    
      [[x,y], [x+w/2,y], [x+w/2,y+h], [x,y+h]])
    
    
      
                box_right = np.int0(
    
    
      [[x+w/2,y], [x+w,y], [x+w,y+h], [x+w/2,y+h]])
    
    
      
                result.append(box_left)
    
    
      
                result.append(box_right)
    
    
      
            
    
    
      else:
    
    
      
                box = np.int0(
    
    
      [[x,y], [x+w,y], [x+w,y+h], [x,y+h]])
    
    
      
                result.append(box)
    
    
      
    # 如果切割出有
    
    
      3个字符。参照文章，对轮廓在水平方向上做
    
    
      4等分
    
    
      
    elif 
    
    
      len(valid_contours) == 
    
    
      1:
    
    
      
        contour = valid_contours[
    
    
      0]
    
    
      
        x, y, w, h = cv2.boundingRect(contour)
    
    
      
        box0 = np.int0(
    
    
      [[x,y], [x+w/4,y], [x+w/4,y+h], [x,y+h]])
    
    
      
        box1 = np.int0(
    
    
      [[x+w/4,y], [x+w*2/4,y], [x+w*2/4,y+h], [x+w/4,y+h]])
    
    
      
        box2 = np.int0(
    
    
      [[x+w*2/4,y], [x+w*3/4,y], [x+w*3/4,y+h], [x+w*2/4,y+h]])
    
    
      
        box3 = np.int0(
    
    
      [[x+w*3/4,y], [x+w,y], [x+w,y+h], [x+w*3/4,y+h]])
    
    
      
        result.extend([box0, box1, box2, box3])
    
    
      
    elif 
    
    
      len(valid_contours) > 
    
    
      4:
    
    
      
        
    
    
      for contour 
    
    
      in valid_contours:
    
    
      
            x, y, w, h = cv2.boundingRect(contour)
    
    
      
            box = np.int0(
    
    
      [[x,y], [x+w,y], [x+w,y+h], [x,y+h]])
    
    
      
            result.append(box)
    
    
      
    result = sorted(result, key=lambda x: x[
    
    
      0][
    
    
      0])
    
    
      
    
    
    
      return result

2、数据集图像处理：

在读取数据集后，我们需要对图片数据集进行二值化和降噪处理，以获得更为合适的训练数据。

其中代码如下：

   
   
     
    
    
      def process_im(im):
    
    
      
    rows, cols, ch = im.shape
    
    
      
    
    
    
      #转为灰度图
    
    
      
    im_gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
    
    
      
    
    
    
      #二值化，就是黑白图。字符变成白色的，背景为黑色
    
    
      
    ret, im_inv = cv2.threshold(im_gray,
    
    
      127,
    
    
      255,cv2.THRESH_BINARY_INV)
    
    
      
    
    
    
      #应用高斯模糊对图片进行降噪。高斯模糊的本质是用高斯核和图像做卷积。就是去除一些斑斑点点的。因为二值化难免不够完美，去燥使得二值化结果更好
    
    
      
    kernel = 
    
    
      1/
    
    
      16*np.array([[
    
    
      1,
    
    
      2,
    
    
      1], [
    
    
      2,
    
    
      4,
    
    
      2], [
    
    
      1,
    
    
      2,
    
    
      1]])
    
    
      
    im_blur = cv2.filter2D(im_inv,
    
    
      -1,kernel)
    
    
      
    
    
    
      #再进行一次二值化。
    
    
      
    ret, im_res = cv2.threshold(im_blur,
    
    
      127,
    
    
      255,cv2.THRESH_BINARY)
    
    
      
    
    
    
      return im_res

3、切割字符：

在得到字符位置后，我们对图片进行切割和保存

部分代码如下：

   
   
     
    
    
      #借助第一个函数获得待切割位置和长宽后就可以切割了
    
    
      
def split_code(filepath):
    
    
      
    #获取图片名
    
    
      
    filename = filepath.split("/")[-1]
    
    
      
    #图片名即为标签
    
    
      
    filename_ts = filename.split(".")[0]
    
    
      
    im = cv2.imread(filepath)
    
    
      
    im
    
    
      _res = process_im(im)
    
    
      
    im2, contours, hierarchy = cv2.findContours(im
    
    
      _res, cv2.RETR_EXTERNAL, cv2.CHAIN
    
    
      _APPROX_SIMPLE)
    
    
      

    
    
      #这里就是用的第一个函数，获得待切割位置和长宽
    
    
      
    boxes = get
    
    
      _rect_box(contours)
    
    
      

    
    
      #如果没有区分出四个字符，就不切割这个图片
    
    
      
    if len(boxes) != 4:
    
    
      
        print(filepath)
    
    
      

    
    
      # 如果区分出了四个字符，说明切割正确，就可以切割这个图片。将切割后的图片保存在char文件夹下
    
    
      
    for box in boxes:
    
    
      
        cv2.drawContours(im, [box], 0, (0,0,255),2)
    
    
      
        roi = im_res[
    
    
      box[0][
    
    
      1]:box[
    
    
      3][
    
    
      1], box[
    
    
      0][
    
    
      0]:box[
    
    
      1][
    
    
      0]]
    
    
      
        roistd = cv2.resize(roi, (30, 30))
    
    
      
        timestamp = int(time.time() * 1e6)
    
    
      
        filename = "{}.jpg".format(timestamp)
    
    
      
        filepath = os.path.join("char", filename)
    
    
      
        cv2.imwrite(filepath, roistd)
    
    
      
    #cv2.imshow("image", im)
    
    
      
    #cv2.waitKey(0)
    
    
      
    #cv2.destroyAllWindows()
    
    
      

    
    
      # split all captacha codes in training set
    
    
      

    
    
      #调用上面的split_code进行切割即可。
    
    
      
def split_all():
    
    
      
    files = os.listdir(TRAIN_DIR)
    
    
      
    for filename in files:
    
    
      
        filename_ts = filename.split(".")[0]
    
    
      
        patt = "label/{}
    
    
      _*".format(filename_ts)
    
    
      
        saved_chars = glob.glob(patt)
    
    
      
        if len(saved_chars) == 4:
    
    
      
            print("{} done".format(filepath))
    
    
      
            continue
    
    
      
        filepath = os.path.join(TRAIN_DIR, filename)
    
    
      
        split_code(filepath)

4、标注字符：

通过已经标注好的数据集字符读取标签，然后存储标签，以方便和图片达到对应。字符数据集如下：

代码如下：

   
   
     
      
    
    #用来标注单个字符图片，在label文件夹下，很明显可以看到_后面的就是标签。比如图片里是数字
    
    
      6，_后面就是
    
    
      6
    
    
      

    
    
      
def label_data():
    
    
      

    
    
      
    files = 
    
    
      os.listdir(
    
    
      "char")
    
    
      

    
    
      
    
    
    
      for filename 
    
    
      in files:
    
    
      

    
    
      
        filename_ts = filename.split(
    
    
      ".")[
    
    
      0]
    
    
      

    
    
      
        patt = 
    
    
      "label/{}_*".
    
    
      format(filename_ts)
    
    
      

    
    
      
        saved_num = 
    
    
      len(glob.glob(patt))
    
    
      

    
    
      
        
    
    
      if saved_num == 
    
    
      1:
    
    
      

    
    
      
            
    
    
      print(
    
    
      "{} done".
    
    
      format(patt))
    
    
      

    
    
      
            continue
    
    
      

    
    
      
        filepath = 
    
    
      os.
    
    
      path.join(
    
    
      "char", filename)
    
    
      

    
    
      
        im = cv2.imread(filepath)
    
    
      

    
    
      
        cv2.imshow(
    
    
      "image", im)
    
    
      

    
    
      
        key = cv2.waitKey(
    
    
      0)
    
    
      

    
    
      
        
    
    
      if key == 
    
    
      27:
    
    
      

    
    
      
            sys.
    
    
      exit()
    
    
      

    
    
      
        
    
    
      if key == 
    
    
      13:
    
    
      

    
    
      
            continue
    
    
      

    
    
      
        
    
    
      char = chr(key)
    
    
      

    
    
      
        filename_ts = filename.split(
    
    
      ".")[
    
    
      0]
    
    
      

    
    
      
        outfile = 
    
    
      "{}_{}.jpg".
    
    
      format(filename_ts, 
    
    
      char)
    
    
      

    
    
      
        outpath = 
    
    
      os.
    
    
      path.join(
    
    
      "label", outfile)
    
    
      

    
    
      
        cv2.imwrite(outpath, im)
    
    
      

    
    
      
#和标注字符图反过来，我们需要让电脑知道这个字符叫啥名字，即让电脑知道_后面的就是他字符的名字
    
    
      

    
    
      
def analyze_label():
    
    
      

    
    
      
    
    
    
      print(
    
    
      "识别数据标签中。。。")
    
    
      

    
    
      
    files = 
    
    
      os.listdir(
    
    
      "label")
    
    
      

    
    
      
    label_count = {}
    
    
      

    
    
      
    
    
    
      for filename 
    
    
      in files:
    
    
      

    
    
      
        label = filename.split(
    
    
      ".")[
    
    
      0].split(
    
    
      "_")[
    
    
      1]
    
    
      

    
    
      
        label_count.setdefault(label, 
    
    
      0)
    
    
      

    
    
      
        label_count[label] += 
    
    
      1
    
    
      

    
    
      

    
    
      print(label_count)

5、KNN模型训练：

KNN算法我们直接使用OpenCV自带的KNN函数即可。通过读取数据集和标签，加载模型训练即可。代码如下：

   
   
     
    
    
      #训练模型，用的是k相邻算法
    
    
      

    
    
      
def get_code(im):
    
    
      

    
    
      
    #将读取图片和标签
    
    
      

    
    
      
    print("读取数据集和标签中。。。。")
    
    
      

    
    
      
    [samples, label
    
    
      _ids, id_label
    
    
      _map] = load_data()
    
    
      

    
    
      
    #k相邻算法
    
    
      

    
    
      
    print("初始化中...")
    
    
      

    
    
      
    model = cv2.ml.KNearest_create()
    
    
      

    
    
      
    #开始训练
    
    
      

    
    
      
    print("训练模型中，请等待！")
    
    
      

    
    
      
    model.train(samples, cv2.ml.ROW
    
    
      _SAMPLE, label_ids)
    
    
      

    
    
      
    #处理图片。即二值化和降噪
    
    
      

    
    
      
    im
    
    
      _res = process_im(im)
    
    
      

    
    
      
    #提取轮廓
    
    
      

    
    
      
    im2, contours, hierarchy = cv2.findContours(im
    
    
      _res, cv2.RETR_EXTERNAL, cv2.CHAIN
    
    
      _APPROX_SIMPLE)
    
    
      

    
    
      
    #获取各切割区域位置和长宽
    
    
      

    
    
      
    boxes = get
    
    
      _rect_box(contours)
    
    
      

    
    
      
    #判断有没有识别出4个字符，如果没有识别出来，就不往下运行，直接结束了
    
    
      

    
    
      
    if len(boxes) != 4:
    
    
      

    
    
      
        print("cannot get code")
    
    
      

    
    
      
    result = []
    
    
      

    
    
      
    #如果正确分割出了4个字符，下面调用训练好的模型进行识别。
    
    
      

    
    
      
    for box in boxes:
    
    
      

    
    
      
        #获取字符长宽
    
    
      

    
    
      
        roi = im_res[
    
    
      box[0][
    
    
      1]:box[
    
    
      3][
    
    
      1], box[
    
    
      0][
    
    
      0]:box[
    
    
      1][
    
    
      0]]
    
    
      

    
    
      
        #重新设长宽。
    
    
      

    
    
      
        roistd = cv2.resize(roi, (30, 30))
    
    
      

    
    
      
        #将图片转成像素矩阵
    
    
      

    
    
      
        sample = roistd.reshape((1, 900)).astype(np.float32)
    
    
      

    
    
      
        #调用训练好的模型识别
    
    
      

    
    
      
        ret, results, neighbours, distances = model.findNearest(sample, k = 3)
    
    
      

    
    
      
        #获取对应标签id
    
    
      

    
    
      
        label_id = int(results[0,0])
    
    
      

    
    
      
        #根据id得到识别出的结果
    
    
      

    
    
      
        label = id
    
    
      _label_map[label_id]
    
    
      

    
    
      
        #存放识别结果
    
    
      

    
    
      
        result.append(label)
    
    
      

    
    
      
    return result

模型调用

   
   
     
    
    
      if __name__ == 
    
    
      "__main__":
    
    
      

    
    
      
    file=os.listdir(
    
    
      "test")
    
    
      

    
    
      
    filepath=
    
    
      "test/"+file[
    
    
      4]
    
    
      

    
    
      
    im = cv2.imread(filepath)
    
    
      

    
    
      
    preds = get_code(im)
    
    
      

    
    
      
    preds=
    
    
      "识别结果为："+preds[
    
    
      0]+preds[
    
    
      1]+preds[
    
    
      2]+preds[
    
    
      3]
    
    
      

    
    
      
    print(preds)
    
    
      

    
    
      
    canny0 = im
    
    
      

    
    
      
    img_PIL = Image.fromarray(cv2.cvtColor(canny0, cv2.COLOR_BGR2RGB))
    
    
      

    
    
      
    myfont = ImageFont.truetype(
    
    
      r'simfang.ttf', 
    
    
      18)
    
    
      

    
    
      
    draw = ImageDraw.Draw(img_PIL)
    
    
      

    
    
      
    draw.text((
    
    
      20, 
    
    
      5), str(preds), font=myfont, fill=(
    
    
      255, 
    
    
      23, 
    
    
      140))
    
    
      

    
    
      
    img_OpenCV = cv2.cvtColor(np.asarray(img_PIL), cv2.COLOR_RGB2BGR)
    
    
      

    
    
      
    cv2.imshow(
    
    
      "frame", img_OpenCV)
    
    
      

    
    
      

    
    
      

    
    
      
    key = cv2.waitKey(
    
    
      0)
    
    
      

    
    
      
    print(filepath)

到这里，我们整体的程序就搭建完成，下面为我们程序的运行结果：

源码地址：

链接：https://pan.baidu.com/s/1Ir5QNjUZaeTW26T8Gb3txQ

提取码：9eqa

作者简介：

李秋键，CSDN博客专家，CSDN达人课作者。硕士在读于中国矿业大学，开发有taptap竞赛获奖等等。

【END】

6月3日20:00 ，CSDN 创始人&董事长、极客帮创投创始合伙人蒋涛携手全球顶级开源基金会主席、董事，聚焦中国开源现状，直面开发者在开源技术、商业上的难题，你绝不可错过的开源巅峰对谈！立即免费围观：

更多精彩推荐
 ☞AI 看脸算命，3 万张自拍揭露：颜值即命？
☞5 月编程语言排行榜：C 重回第一，今年编程语言名人堂冠军还会是它吗？| 原力计划
☞芯片供应被掐断，华为能否安全渡劫？
☞来了来了！趋势预测算法大PK
☞附代码 | OpenCV实现银行卡号识别，字符识别算法你知多少？
☞15 岁黑进系统，发挑衅邮件意外获 Offer，不惑之年捐出全部财产，Twitter CEO 太牛了
     
     
       
      
      
        
       
       
         
        
        
          
         
         
           
          
          
             
              
               
              
              
              你点的每个“在看”，我都认真当成了喜欢