干货！如何用 Python+KNN 算法实现城市空气质量分析与预测？

会员服务 ·

干货！如何用 Python+KNN 算法实现城市空气质量分析与预测？

2020 年 7 月 6 日 CSDN

作者 | 李秋键

责编 | 伍杏玲

封图 | CSDN 付费下载自东方 IC

出品 | CSDN（ID：CSDNnews）

随着中国工业和科技的发展，中国的一些发达城市的空气质量问题变得越来越严重，其中最为严重的便是PM2.5带来的恶劣环境问题。

本文在根据网络公开空气质量数据的基础上进行爬取相关数据，主要针对环境较为恶劣的城市，天津、北京、广州等几个城市，尤其是针对天津的质量数据进行对比分析。在分析的基础上得出空气质量变化情况，提出一些意见。并借助机器学习算法根据数据预测空气质量，以达到分析预测的典型大数据分析模式效果。

整体分析的流程图如下：

实验前的准备

1.1 数据获取

我们这里所得到的数据来源于网络公开的空气质量数据，数据来源于“天气后报”网站，网址为：http://www.tianqihoubao.com/aqi/tianjin.html。网址内容如下图可见：

图1-1 网址数据图

整个数据的获取使用python进行爬取。流程如下：

（1）导入爬虫所需要的的库：

在air_tianjin_2019.py程序中。

其中Requests 是用Python语言编写，基于urllib，采用 Apache2 Licensed开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。

其中BeautifulSoup库是一个灵活又方便的网页解析库，处理高效，支持多种解析器。利用它就不用编写正则表达式也能方便的实现网页信息的抓取

对应代码如下：

   
   
     
    
    
           
     
       
      
      
        import time
      
      
        

      
      
        import requests
      
      
        

      
      
        from bs4 
      
      
        import BeautifulSoup

（2）为了防止网站的反爬机制，我们设定模拟浏览器进行访问获取数据：

   
   
     
    
    
      headers = {    
    
    
      'User-Agent':
    
    
      'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

（3）然后获取2019年全年的空气质量数据：

   
   
     
    
    
      for i in range(1, 13):
time.sleep(5)
url = 
    
    
      'http://www.tianqihoubao.com/aqi/tianjin-2019' + str(
    
    
      "%02d" % i) + 
    
    
      '.html'
    
    
      
response = requests.
    
    
      get(url=url, headers=headers)
    
    
      
soup = BeautifulSoup(response.text, 
    
    
      'html.parser')
    
    
      
tr = soup.find_all(
    
    
      'tr')

1.2 数据预处理

如果仅仅是从网站上得到的数据会有一些标签等干扰项，我们针对一些标签进行去除即可：

   
   
     
    
    
      for j 
    
    
      in tr[
    
    
      1:]:
    
    
      
td = j.find_all(
    
    
      'td')
    
    
      

    
    
      Date = td[
    
    
      0].get_text().strip()
    
    
      
Quality_grade = td[
    
    
      1].get_text().strip()
    
    
      
AQI = td[
    
    
      2].get_text().strip()
    
    
      
AQI_rank = td[
    
    
      3].get_text().strip()
    
    
      
PM = td[
    
    
      4].get_text()
    
    
      

    
    
      with open(
    
    
      'air_tianjin_2019.csv', 
    
    
      'a+', encoding=
    
    
      'utf-8-sig') 
    
    
      as f:
    
    
      
    f.write(
    
    
      Date + 
    
    
      ',' + Quality_grade + 
    
    
      ',' + AQI + 
    
    
      ',' + AQI_rank + 
    
    
      ',' + PM + 
    
    
      '\n')

最终爬取下来的部分数据如下：

表1-1 部分天津爬取数据表

这几个数据分别对应着AQI指数、当天AQI排名和PM2.5值

数据分析

这里的数据分析主要通过可视化的方法得到图像来进行分析。

（1）天津AQI全年走势图

代码在air_tianjin_2019_AQI.py中

通过导入pyecharts 库来进行绘制走势图

首先通过已经获取到的数据进行读取：

   
   
     
    
    
      df = pd.read_csv(
    
    
      'air_tianjin_2019.csv', header=None, names=[
    
    
      "Date", 
    
    
      "Quality_grade", 
    
    
      "AQI", 
    
    
      "AQI_rank", 
    
    
      "PM"])

然后获取日期和AQI数据，储存在列表变量中，以方便绘制图像：

   
   
     
    
    
      attr = df[
    
    
      'Date']v1 = df[
    
    
      'AQI']

接着定义标题，绘制曲线并保存为网页即可：

   
   
     
      
    
    line = Line(
    
    
      "2019年天津AQI全年走势图", title_pos=
    
    
      'center', title_top=
    
    
      '18', width=
    
    
      800, height=
    
    
      400)
    
    
      
line.
    
    
      add(
    
    
      "", attr, v1, mark_line=[
    
    
      'average'], is_fill=True, area_color=
    
    
      "#000", area_opacity=
    
    
      0.3, mark_point=[
    
    
      "max", 
    
    
      "min"], mark_point_symbol=
    
    
      "circle", mark_point_symbolsize=
    
    
      25)
    
    
      
line.render(
    
    
      "2019年天津AQI全年走势图.html")

最终的效果图如下可见

图2-2 2019年天津AQI全年走势图

根据图2-2可知，在2019年度，天津的空气质量峰值分别是在1月、2月、11月和12月，即主要集中在春冬季，考虑到可能是春冬季通风较差，且节日较多，过多的节日烟花和汽车人员流动造成了空气质量变差。

（2）天津月均AQI走势图

   
   
     
    
    
      air_tianjin_2019_AQI_month
    
    
      .py

为了体现出每月的平均空气质量变化，我们绘制了月均走势图。

首先同样的是读取数据：

   
   
     
    
    
      df = pd.read_csv(
    
    
      'air_tianjin_2019.csv', header=None, names=[
    
    
      "Date", 
    
    
      "Quality_grade", 
    
    
      "AQI", 
    
    
      "AQI_rank", 
    
    
      "PM"])

接着获取日期和空气质量数据，并加以处理，去除日期中间的“-”：

   
   
     
      
    
    dom = df
    
    
      [['Date', 'AQI']]
    
    
      
list1 = []
    
    
      

    
    
      for j 
    
    
      in dom[
    
    
      'Date']:
    
    
      
    
    
    
      time = j.split(
    
    
      '-')[
    
    
      1]
    
    
      
    list1.append(
    
    
      time)
    
    
      
df[
    
    
      'month'] = list1

接着计算每月空气质量的平均值

   
   
     
      
    
    month_message = df.groupby(['month'])
    
    
      
month_com = month_message['AQI'].agg(['mean'])
    
    
      
month_com.reset_index(inplace=True)
    
    
      
month_com_last = month_com.sort_index()
    
    
      
attr = [
    
    
      "{}".format(str(i) + '月') for i in range(1, 13)]
    
    
      
v1 = np.array(month_com_last['mean'])
    
    
      
v1 = [
    
    
      "{}".format(int(i)) for i in v1]

然后绘制走势图：

line = Line("2019年天津月均AQI走势图", title_pos='center', title_top='18', width=800, height=400)
line.add("", attr, v1, mark_point=["max", "min"])
line.render("2019年天津月均AQI走势图.html")

最终的效果图如下可见：

图2-3 2019年天津月均AQI走势图

（3）天津季度AQI箱形图

代码在air_tianjin_2019_AQI_season.py中

绘制天津季度空气质量箱型图，步骤如下：

读取爬取下来的数据：

   
   
     
    
    
      df = pd.read_csv(
    
    
      'air_tianjin_2019.csv', header=None, names=[
    
    
      "Date", 
    
    
      "Quality_grade", 
    
    
      "AQI", 
    
    
      "AQI_rank", 
    
    
      "PM"])

接着按照月份分季，可以分为四个季度：

   
   
     
      
    
    dom = df
    
    
      [['Date', 'AQI']]
    
    
      
data = 
    
    
      [[], [], [], []]
    
    
      
dom1, dom2, dom3, dom4 = data
    
    
      

    
    
      for i, j 
    
    
      in zip(dom[
    
    
      'Date'], dom[
    
    
      'AQI']):
    
    
      

    
    
      time = i.split(
    
    
      '-')[
    
    
      1]
    
    
      

    
    
      if 
    
    
      time 
    
    
      in [
    
    
      '01', 
    
    
      '02', 
    
    
      '03']:
    
    
      
        dom1.append(j)
    
    
      
elif 
    
    
      time 
    
    
      in [
    
    
      '04', 
    
    
      '05', 
    
    
      '06']:
    
    
      
        dom2.append(j)
    
    
      
elif 
    
    
      time 
    
    
      in [
    
    
      '07', 
    
    
      '08', 
    
    
      '09']:
    
    
      
        dom3.append(j)
    
    
      

    
    
      else:
    
    
      
        dom4.append(j)

然后定义箱型图的标题，横纵坐标等绘制箱型图：

   
   
     
      
    
    boxplot = Boxplot(
    
    
      "2019年天津季度AQI箱形图", title_pos='center', title_top='18', width=800, height=400)
    
    
      
x_axis = ['第一季度', '第二季度', '第三季度', '第四季度']
    
    
      
y_axis = [dom1, dom2, dom3, dom4]
    
    
      
_yaxis = boxplot.prepare_data(y_axis)
    
    
      
boxplot.add(
    
    
      "", x_axis, _yaxis)
    
    
      
boxplot.render(
    
    
      "2019年天津季度AQI箱形图.html")

最终得到绘制的箱型图如下可见：

图2-4 2019年天津季度AQI箱形图

KNN算法预测

整体的代码流程分为两个部分，一部分是建立test.py程序用来将CSV文件转为符合标准的TXT数据存储；另一部分是K均值聚类的数据分类。

（1）数据生成TXT

代码在test.py中

首先读入数据，存出入列表为x何y。同时因为y的值为汉字，需要转换为数字：

   
   
     
    
    
      # 文件的名字
    
    
      
FILENAME1 = 
    
    
      "air_tianjin_2019.csv"
    
    
      

    
    
      # 禁用科学计数法
    
    
      
pd.set_option(
    
    
      'float_format', 
    
    
      lambda x: 
    
    
      '%.3f' % x)
    
    
      
np.set_printoptions(threshold=np.inf)
    
    
      

    
    
      # 读取数据
    
    
      
data = pd.read_csv(FILENAME1)
    
    
      
rows, clos = data.shape
    
    
      

    
    
      # DataFrame转化为array
    
    
      
DataArray = data.values
    
    
      
Y=[]
    
    
      
y = DataArray[:, 
    
    
      1]
    
    
      

    
    
      for i 
    
    
      in y:
    
    
      
    
    
    
      if i==
    
    
      "良":
    
    
      
        Y.append(
    
    
      0)
    
    
      
    
    
    
      if i==
    
    
      "轻度污染":
    
    
      
        Y.append(
    
    
      1)
    
    
      
    
    
    
      if i==
    
    
      "优":
    
    
      
        Y.append(
    
    
      2)
    
    
      
    
    
    
      if i==
    
    
      "严重污染":
    
    
      
        Y.append(
    
    
      3)
    
    
      
    
    
    
      if i==
    
    
      "重度污染":
    
    
      
        Y.append(
    
    
      4)
    
    
      
print(Y)
    
    
      
print(len(y))
    
    
      
X = DataArray[:, 
    
    
      2:
    
    
      5]
    
    
      
print(X[
    
    
      1])

然后将存储的数据写入TXT，其中要注意换行和加“,”：

   
   
     
    
    
      for i 
    
    
      in range(
    
    
      len(Y)):
    
    
      
   f=
    
    
      open(
    
    
      "data.txt",
    
    
      "a+")
    
    
      
   
    
    
      for j 
    
    
      in range(
    
    
      3):
    
    
      
        f.
    
    
      write(str(X[i][j])+
    
    
      ",")
    
    
      
   f.
    
    
      write(str(Y[i])+
    
    
      "\n")
    
    
      

    
    
      print(
    
    
      "data.txt数据生成")

（2）K均值聚类

代码在KNearestNeighbor.py中。

首先是读取数据：

   
   
     
      
    
    def loadDataset(self,filename, split, trainingSet, testSet):  # 加载数据集  split以某个值为界限分类train和test
    
    
      
    with open(filename, 'r') as csvfile:
    
    
      
        lines = csv.reader(csvfile)   #读取所有的行
    
    
      
        dataset = list(lines)     #转化成列表
    
    
      
        for x in range(len(dataset)-1):
    
    
      
            for y in range(3):
    
    
      
                dataset[
    
    
      x][
    
    
      y] = float(dataset[
    
    
      x][
    
    
      y])
    
    
      
            if random.random() 
    
    
      < split:   # 将所有数据加载到train和test中
                trainingSet.append(dataset[x])
            else:
                testSet.append(dataset[x])

定义计算距离的函数

   
   
     
    
    
      def calculateDistance(self,testdata, traindata, length):   
    
    
      # 计算距离
    
    
      
    distance = 
    
    
      0     
    
    
      # length表示维度 数据共有几维
    
    
      
    
    
    
      for x 
    
    
      in range(length):
    
    
      
        distance += pow((int(testdata[x])-traindata[x]), 
    
    
      2)
    
    
      
    
    
    
      return math.sqrt(distance)

对每个数据文档测量其到每个质心的距离，并把它归到最近的质心的类。

   
   
     
    
    
      def getNeighbors(self,trainingSet, testInstance, k):  
    
    
      # 返回最近的k个边距
    
    
      
    distances = []
    
    
      
    length = len(testInstance)
    
    
      -1
    
    
      
    
    
    
      for x 
    
    
      in range(len(trainingSet)):   
    
    
      #对训练集的每一个数计算其到测试集的实际距离
    
    
      
        dist = self.calculateDistance(testInstance, trainingSet[x], length)
    
    
      
        print(
    
    
      '训练集:{}-距离:{}'.format(trainingSet[x], dist))
    
    
      
        distances.append((trainingSet[x], dist))
    
    
      
    distances.sort(key=operator.itemgetter(
    
    
      1))   
    
    
      # 把距离从小到大排列
    
    
      
    print(distances)
    
    
      
    neighbors = []
    
    
      
    
    
    
      for x 
    
    
      in range(k):   
    
    
      #排序完成后取前k个距离
    
    
      
        neighbors.append(distances[x][
    
    
      0])
    
    
      
        print(neighbors)
    
    
      
        
    
    
      return neighbors

决策函数，根据少数服从多数，决定归类到哪一类：

   
   
     
      
    
    def getResponse(self,neighbors):  # 根据少数服从多数，决定归类到哪一类
    
    
      
    classVotes = {}
    
    
      
    for x in range(len(neighbors)):
    
    
      
        response = neighbors[
    
    
      x][
    
    
      -1]  # 统计每一个分类的多少
    
    
      
        if response in classVotes:
    
    
      
            classVotes[response] += 1
    
    
      
        else:
    
    
      
            classVotes[response] = 1
    
    
      
    print(classVotes.items())
    
    
      
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True) #reverse按降序的方式排列
    
    
      
    return sortedVotes[
    
    
      0][
    
    
      0]

计算模型准确度

   
   
     
      
    
    def getAccuracy(self,testSet, predictions):  # 准确率计算
    
    
      
    correct = 0
    
    
      
    for x in range(len(testSet)):
    
    
      
        if testSet[
    
    
      x][
    
    
      -1] == predictions[x]:   #predictions是预测的和testset实际的比对
    
    
      
            correct += 1
    
    
      
    print('共有{}个预测正确，共有{}个测试数据'.format(correct,len(testSet)))
    
    
      
    return (correct/float(len(testSet)))*100.0

接着整个模型的训练，种子数定义等等：

   
   
     
    
    
      def Run(self):
    
    
      
    trainingSet = []
    
    
      
    testSet = []
    
    
      
    split = 
    
    
      0.75
    
    
      
    self.loadDataset(
    
    
      r'data.txt', split, trainingSet, testSet)   
    
    
      #数据划分
    
    
      
    print(
    
    
      'Train set: ' + str(len(trainingSet)))
    
    
      
    print(
    
    
      'Test set: ' + str(len(testSet)))
    
    
      
    
    
    
      #generate predictions
    
    
      
    predictions = []
    
    
      
    k = 
    
    
      5    
    
    
      # 取最近的5个数据
    
    
      
    
    
    
      # correct = []
    
    
      
    
    
    
      for x 
    
    
      in range(len(testSet)):    
    
    
      # 对所有的测试集进行测试
    
    
      
        neighbors = self.getNeighbors(trainingSet, testSet[x], k)   
    
    
      #找到5个最近的邻居
    
    
      
        result = self.getResponse(neighbors)    
    
    
      # 找这5个邻居归类到哪一类
    
    
      
        predictions.append(result)
    
    
      
        
    
    
      # print('predictions: ' + repr(predictions))
    
    
      
        
    
    
      # print('>predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
    
    
      
    
    
    
      # print(correct)
    
    
      
    accuracy = self.getAccuracy(testSet,predictions)
    
    
      
    print(
    
    
      'Accuracy: ' + repr(accuracy) + 
    
    
      '%')

最终模型的准确度为90%。

图2-10 模型运行结果图

源码地址：https://pan.baidu.com/s/1Vcc_bHQMHmQpe-F6A-mFdQ

提取码：qvy7

作者简介：李秋键，CSDN博客专家，CSDN达人课作者。硕士在读于中国矿业大学，开发有taptap竞赛获奖等。

更多精彩推荐
☞阿里辟谣：不会高薪聘请黑阿里网站的人；苹果欲用 iPhone 替代身份证和护照；Python 3.9.0b4 发布| 极客头条
☞放弃美帝 80 万年薪，回国找工作时竟遇到这个难题...
☞马斯克身家超马云，网友：看完他的履历后一点也不惊讶
☞阿里巴巴副总裁司罗：达摩院如何搭建NLP技术体系？
☞数说DApp：DeFi和DEX迅猛增长或令以太坊超越比特币
☞数据库怎么选择？终于有人讲明白了
     
     
       
      
      
        
       
       
         
        
        
          
         
         
           
          
          
             
              
            
         
         
           
         
         
           
            
          
          点分享
         
         
           
        
        
          
        
        
          
         
         
           
          
          
             
              
               
              
            
          
          
            
              点点赞 
            
         
         
           
         
         
           
          
          
             
              
               
              
            
          
          
            
              点在看