python2异步编程_python异步编程入门
這幾天看代碼,總是會接觸到很多異步編程,之前只想著實現功能,從來沒考慮過代碼的運行快慢問題,故學習一番。
從0到1,了解python異步編程的演進
1、urllib與requests爬蟲
requests對請求做了優化,因此比urllib快一點。
Requests是Python中的HTTP客戶端庫,網絡請求更加直觀方便,它與Urllib最大的區別就是在爬取數據的時候連接方式的不同。urllb爬取完數據是直接斷開連接的,而requests爬取數據之后可以繼續復用socket,并沒有斷開連接。
在python2.7版本下,Python urllib模塊分為兩部分,urllib和urllib2。Python3.5 版本下將python2.7版本的urllib和urllib2 合并在一起成一個新的urllib。
urllib:
#-*- coding:utf-8 -*-
import urllib.request
import ssl
from lxml import etree
url = 'https://movie.douban.com/top250'
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_1)
def fetch_page(url):
response = urllib.request.urlopen(url, context=context)
return response
def parse(url):
response = fetch_page(url)
page = response.read()
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
for url in fetch_list:
response = fetch_page(url)
page = response.read()
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
print(i, title)
def main():
parse(url)
if __name__ == '__main__':
main()
requests代替標準庫urllib:
import requests
from lxml import etree
from time import time
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
for url in fetch_list:
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
2、lxml庫與正則表達式進行解析
lxml庫進行解析需要一定時間,但依賴正則表達式的程序會更加難以維護,擴展性不高。
常見的組合是Requests+BeautifulSoup(解析網絡文本的工具庫),解析工具常見的還有正則,xpath。
將lxml庫換成標準的re庫:
#-*- coding:utf-8 -*-
import requests
from time import time
import re
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
fetch_list = set()
result = []
for title in re.findall(rb'(.*)', page):
result.append(title)
for postfix in re.findall(rb'
fetch_list.add(url + postfix.decode())
for url in fetch_list:
response = fetch_page(url)
page = response.content
for title in re.findall(rb'
result.append(title)
for i, title in enumerate(result, 1):
title = title.decode()
# print(i, title)
3、進階:多進程和多線程
網絡應用方面的編程(如上例中的爬蟲),通常瓶頸都在IO層面,解決等待讀寫的問題比提高文本解析速度來的更有性價比。
程序切換—CPU時間的分配:操作系統自動為每個程序分配一些 CPU/內存/磁盤/鍵盤/顯示器 等資源的使用時間,過期后自動切換到下一個程序。當然,被切換的程序,如果沒有執行完,它的狀態會被保存起來,方便下次輪詢到的時候繼續執行。
1)進程:進程就是“程序切換”的第一種方式。進程,是執行中的計算機程序。也就是說,每個代碼在執行的時候,首先本身即是一個進程。一個進程具有:就緒,運行,中斷,僵死,結束等狀態(不同操作系統不一樣)。每個程序,本身首先是一個進程。
2)線程:線程,也是“程序切換”的一種方式。線程,是在進程中執行的代碼。一個進程下可以運行多個線程,這些線程之間共享主進程內申請的操作系統資源。在一個進程中啟動多個線程的時候,每個線程按照順序執行。現在的操作系統中,也支持線程搶占,也就是說其它等待運行的線程,可以通過優先級,信號等方式,將運行的線程掛起,自己先運行。線程,必須在一個存在的進程中啟動運行。線程使用進程獲得的系統資源,不會像進程那樣需要申請CPU等資源。
3)線程與進程的區別:線程一般以并發執行,正是由于這種并發和數據共享機制,使多任務間的協作成為可能。進程一般以并行執行,這種并行能使得程序能同時在多個CPU上運行。
4)協程:協程,也是”程序切換“的一種。簡單說,協程也是線程,只是協程的調度并不是由操作系統調度,而是自己”協同調度“。也就是”協程是不通過操作系統調度的線程“。協程,又稱微線程。協程間是協同調度的,這使得并發量數萬以上的時候,協程的性能是遠遠高于線程。注意這里也是“并發”,不是“并行”。
多線程有效地解決了阻塞等待的問題。
#-*- coding:utf-8 -*-
import requests
from lxml import etree
from time import time
from threading import Thread
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
def fetch_content(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
threads = []
for url in fetch_list:
t = Thread(target=fetch_content, args=[url])
t.start()
threads.append(t)
for t in threads:
t.join()
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
多進程,用4個進程的進程池來并行處理網絡數據。
#-*- coding:utf-8 -*-
import requests
from lxml import etree
from time import time
from concurrent.futures import ProcessPoolExecutor
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def fetch_content(url):
response = fetch_page(url)
page = response.content
return page
def parse(url):
page = fetch_content(url)
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
with ProcessPoolExecutor(max_workers=4) as executor:
for page in executor.map(fetch_content, fetch_list):
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
這里多進程帶來的優點(cpu處理)并沒有得到體現,反而創建和調度進程帶來的開銷要遠超出它的正面效應,拖了一把后腿。即便如此,多進程帶來的效益相比于之前單進程單線程的模型要好得多。
多進程和多線程除了創建的開銷大之外還有一個難以根治的缺陷,就是處理進程之間或線程之間的協作問題,因為是依賴多進程和多線程的程序在不加鎖的情況下通常是不可控的,而協程則可以完美地解決協作問題,由用戶來決定協程之間的調度。
基于gevent的異步程序:
#-*- coding:utf-8 -*-
import requests
from lxml import etree
from time import time
import gevent
from gevent import monkey
monkey.patch_all()
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def fetch_content(url):
response = fetch_page(url)
page = response.content
return page
def parse(url):
page = fetch_content(url)
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
jobs = [gevent.spawn(fetch_content, url) for url in fetch_list]
gevent.joinall(jobs)
[job.value for job in jobs]
for page in [job.value for job in jobs]:
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
gevent給予了我們一種以同步邏輯來書寫異步程序的能力,看monkey.patch_all()這段代碼,它是整個程序實現異步的黑科技,當我們給程序打了猴子補丁后,Python程序在運行時會動態地將一些網絡庫(例如socket,thread)替換掉,變成異步的庫。使得程序在進行網絡操作的時候都變成異步的方式去工作,效率就自然提升很多了。
4、python Async/Await
Python需要一個獨立的標準庫來支持協程,于是就有了后來的asyncio。
把同步的requests庫改成了支持asyncio的aiohttp庫,使用3.5的async/await語法編寫協程版本的例子。
#-*- coding:utf-8 -*-
from lxml import etree
from time import time
import asyncio
import aiohttp
url = 'https://movie.douban.com/top250'
async def fetch_content(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
async def parse(url):
page = await fetch_content(url)
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
tasks = [fetch_content(url) for url in fetch_list]
pages = await asyncio.gather(*tasks)
for page in pages:
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
def main():
loop = asyncio.get_event_loop()
start = time()
for i in range(5):
loop.run_until_complete(parse(url))
end = time()
print ('Cost {} seconds'.format((end - start) / 5))
loop.close()
速度快,且提高了程序的可讀性。
Python Async/Await入門指南
留坑待續......
總結
以上是生活随笔為你收集整理的python2异步编程_python异步编程入门的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 欧美股市最新行情,美股周五开盘吗
- 下一篇: python去重且顺序不变_Python