一次性付费进群,长期免费索取教程,没有付费教程。
教程列表见微信公众号底部菜单
进微信群回复公众号:微信群;QQ群:460500587
微信公众号:计算机与网络安全
ID:Computer-network
本文将从网站上获取免费的代理服务器。使用Scrapy获取代理服务器后,一一验证哪些代理服务器可用,最终将可用的代理服务器保存到文件。
1 # -*- coding: utf-8 -*-
2
3 # Define here the models for your scraped items
4 #
5 # See documentation in:
6 # http://doc.scrapy.org/en/latest/topics/items.html
7
8 import scrapy
9
10
11 class GetproxyItem(scrapy.Item):
12 # define the fields for your item here like:
13 # name = scrapy.Field()
14 ip = scrapy.Field()
15 port = scrapy.Field()
16 type = scrapy.Field()
17 loction = scrapy.Field()
18 protocol = scrapy.Field()
图1 scrapy shell
图2 response数据
图3 页面源代码
图4 proxy360Spider测试选择器
1 # -*- coding: utf-8 -*-
2 import scrapy
3 from getProxy.items import GetproxyItem
4
5 class Proxy360Spider(scrapy.Spider):
6 name = "proxy360Spider"
7 allowed_domains = ["proxy360.com"]
8 nations=['Brazil','China','America','Taiwan','Japan','Thailand','Vietnam','bahrein']
9 start_urls = []
10 for nation in nations:
11 start_urls.append('http://www.proxy360.cn/Region/' + nation)
12
13
14 def parse(self, response):
15 subSelector = response.xpath('//div[@class="proxylistitem" and @name="list_proxy_ip"]')
16 items = []
17 for sub in subSelector:
18 item = GetproxyItem()
19 item['ip'] = sub.xpath('.//span[1]/text()').extract()[0]
20 item['port'] = sub.xpath('.//span[2]/text()').extract()[0]
21 item['type'] = sub.xpath('.//span[3]/text()').extract()[0]
22 item['loction'] = sub.xpath('.//span[4]/text()').extract()[0]
23 item['protocol'] = 'HTTP'
24 item['source'] = 'proxy360'
25 items.append(item)
26 return items
1 # -*- coding: utf-8 -*-
2
3 # Define your item pipelines here
4 #
5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
7
8 import codecs
9
10 class GetproxyPipeline(object):
11 def process_item(self, item, spider):
12 fileName = 'proxy.txt'
13 with codecs.open(fileName, 'a', 'utf-8') as fp:
14 fp.write("{'%s': '%s://%s:%s'}||\t %s \t %s \t %s \r\n"
15 %(item['protocol'].strip(), item['protocol'].strip(),item['ip'].strip(),item['port'].strip(),item['type'].strip(),item['loction'].strip(),item['source'].strip()))
在14行中加入了两条竖线,是为了以后能方便地将代理字典({'http':'http://1.2.3.4:8080'})从字符串中分离出来。在15行中,写入文件时使用了strip()函数去除所得数据左右两边的空格。
1 # -*- coding: utf-8 -*-
2
3 # Scrapy settings for getProxy project
4 #
5 # For simplicity, this file contains only the most important settings by
6 # default. All the other settings are documented here:
7 #
8 # http://doc.scrapy.org/en/latest/topics/settings.html
9 #
10
11 BOT_NAME = 'getProxy'
12
13 SPIDER_MODULES = ['getProxy.spiders']
14 NEWSPIDER_MODULE = 'getProxy.spiders'
15
16 # Crawl responsibly by identifying yourself (and your website) on the User-Agent
17 #USER_AGENT = 'getProxy (+http://www.yourdomain.com)'
18
19
20 ### user add
21 ITEM_PIPELINES = {
22 'getProxy.pipelines.GetproxyPipeline':100
23 }
图5 scrapy crawl proxy360Spider
图6 scrapy shell
1 # -*- coding: utf-8 -*-
2
3 # Scrapy settings for getProxy project
4 #
5 # For simplicity, this file contains only the most important settings by
6 # default. All the other settings are documented here:
7 #
8 # http://doc.scrapy.org/en/latest/topics/settings.html
9 #
10
11 BOT_NAME = 'getProxy'
12
13 SPIDER_MODULES = ['getProxy.spiders']
14 NEWSPIDER_MODULE = 'getProxy.spiders'
15
16 # Crawl responsibly by identifying yourself (and your website) on the User-Agent
17 #USER_AGENT = 'getProxy (+http://www.yourdomain.com)'
18
19
20 ### user add
21 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko)'
22 ITEM_PIPELINES = {
23 'getProxy.pipelines.GetproxyPipeline':100
24 }
图7 修改headers后scrapy shell
图8 xiciSpider测试选择器
图9 scrapy crawl xiciSpider
微信公众号:计算机与网络安全
ID:Computer-network