Python 代理爬取网站数据
生活随笔
收集整理的這篇文章主要介紹了
Python 代理爬取网站数据
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
代理IP通過https://www.kuaidaili.com/free/ 獲取,我使用的的是http?協議的代理。根據自己需求選擇http或者https 協議的頁面。
訪問量會有增長,但效果不是非常理想,后面找時間在研究下、廢話不多說,直接上代碼。
# -*- coding:utf-8 -*-import requestsimport randomimport timeimport reuser_agent_list=['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)','Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1','Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0','Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]count=0def Get_proxy_ip():headers = {'Host': "www.kuaidaili.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36','Accept': r'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'}req=requests.get(r'https://www.kuaidaili.com/free/inha/16/',headers=headers)html=req.textproxy_list=[]IP_list=re.findall(r'\d+\.\d+\.\d+\.\d+',html)port_lits=re.findall(r'<td data-title="PORT">\d+</td>',html)for i in range(len(IP_list)):ip=IP_list[i]port=re.sub(r'<td data-title="PORT">|</td>','',port_lits[i])proxy='%s:%s' %(ip,port)proxy_list.append(proxy)return proxy_listdef Proxy_read(proxy_list,user_agent_list,i):proxy_ip=proxy_list[i]print ('當前代理ip:%s'%proxy_ip)user_agent = random.choice(user_agent_list)print('當前代理user_agent:%s'%user_agent)sleep_time = random.randint(1,5)print('等待時間:%s s' %sleep_time)time.sleep(sleep_time)print('開始獲取')headers = {'User-Agent': user_agent}proxies={'http': proxy_ip}url='https://www.baidu.com' #blog 地址try:req = requests.get(url, headers=headers, proxies=proxies, timeout=6,verify=False)html=req.textprint (html)except Exception as e:print(e)print('******打開失敗!******')else:global countcount += 1print('OK!總計成功%s次!' % count)if __name__ == '__main__':proxy_list = Get_proxy_ip()for i in range(100):Proxy_read(proxy_list, user_agent_list, i)
?
總結
以上是生活随笔為你收集整理的Python 代理爬取网站数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python的Xpath介绍和语法详解
- 下一篇: Python使用Redis实现IP代理池