爬虫的单线程+多任务异步协程:asyncio 3.6
生活随笔
收集整理的這篇文章主要介紹了
爬虫的单线程+多任务异步协程:asyncio 3.6
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
單線程+多任務(wù)異步協(xié)程:asyncio 3.6
- 事件循環(huán)
- 無限循環(huán)的對(duì)象.事件循環(huán)中最終需要將一些 特殊的函數(shù)(被async關(guān)鍵字修飾的函數(shù)) 注冊(cè)在該對(duì)象中.
- 協(xié)程
- 本質(zhì)上是一個(gè)對(duì)象.可以把協(xié)程對(duì)象(特殊的函數(shù))注冊(cè)到事件循環(huán)中
- 任務(wù)對(duì)象
- 就是對(duì)協(xié)程對(duì)象進(jìn)一步的封裝.
- 綁定回調(diào): task.add_done_callback(func)
- func(task):task參數(shù)表示的就是綁定的任務(wù)對(duì)象
- task.result():返回就是任務(wù)對(duì)象對(duì)應(yīng)的特殊函數(shù)內(nèi)部的返回值
- 回調(diào)多被用作于爬蟲中的解析方法
- await
- 在任務(wù)對(duì)象對(duì)應(yīng)的特殊函數(shù)內(nèi)部的實(shí)現(xiàn)語句中,如果出現(xiàn)了阻塞的操作,則必須使用await進(jìn)行修飾
- 異步操作的體現(xiàn)
- 當(dāng)將多個(gè)協(xié)程對(duì)象(特殊的函數(shù))注冊(cè)到事件循環(huán)中后,事件循環(huán)開啟后,則會(huì)循環(huán)執(zhí)行其內(nèi)部的協(xié)程對(duì)象們
- 假如事件循環(huán)對(duì)象在執(zhí)行某一個(gè)協(xié)程對(duì)象時(shí),發(fā)生了阻塞,則事件循環(huán)對(duì)象不會(huì)等待阻塞結(jié)束,反而會(huì)執(zhí)行下一個(gè)協(xié)程對(duì)象
- aiohttp:支持異步的網(wǎng)絡(luò)請(qǐng)求模塊
- 中文文檔
- https://www.cntofu.com/book/127/aiohttp%E6%96%87%E6%A1%A3/ClientUsage.md
- 環(huán)境安裝 pip或者直接pycharm安裝都可以
- 如何進(jìn)行UA偽裝:
- session.get(url,headers)
- 參數(shù)的封裝
- session.get(url,headers,data/params)
- 代理IP方式:
- session.get(url,proxy="http://ip:port")
- 中文文檔
簡(jiǎn)單示例
import asyncio #特殊的函數(shù):該函數(shù)調(diào)用后,函數(shù)內(nèi)部的程序語句不會(huì)被執(zhí)行,但是該函數(shù)調(diào)用會(huì)返回一個(gè)協(xié)程對(duì)象 async def test():print('i am test()')print('i am test()')print('i am test()')#調(diào)用該特殊函數(shù),讓其返回一個(gè)協(xié)程對(duì)象 c = test()#創(chuàng)建一個(gè)事件循環(huán)對(duì)象 loop = asyncio.get_event_loop()#將協(xié)程對(duì)象注冊(cè)到事件循環(huán)對(duì)象中,并且開啟事件循環(huán) loop.run_until_complete(c)print(c)任務(wù)對(duì)象的使用
import asyncio #特殊的函數(shù):該函數(shù)調(diào)用后,函數(shù)內(nèi)部的程序語句不會(huì)被執(zhí)行,但是該函數(shù)調(diào)用會(huì)返回一個(gè)協(xié)程對(duì)象 async def test():print('i am test()')#調(diào)用該特殊函數(shù),讓其返回一個(gè)協(xié)程對(duì)象 c = test()#將協(xié)程對(duì)象封裝到任務(wù)對(duì)象中 task = asyncio.ensure_future(c)#創(chuàng)建一個(gè)事件循環(huán)對(duì)象 loop = asyncio.get_event_loop()#將任務(wù)對(duì)象注冊(cè)到事件循環(huán)對(duì)象中,并且開啟事件循環(huán) loop.run_until_complete(task)任務(wù)對(duì)象綁定回調(diào)函數(shù)
import asyncio #特殊的函數(shù):該函數(shù)調(diào)用后,函數(shù)內(nèi)部的程序語句不會(huì)被執(zhí)行,但是該函數(shù)調(diào)用會(huì)返回一個(gè)協(xié)程對(duì)象 async def test():print('i am test()')return 'hello bobo'#任務(wù)對(duì)象的回調(diào)函數(shù),參數(shù)task表示的就是任務(wù)對(duì)象 def func(task):# print('i am task callback!')print(task.result()) #返回的是任務(wù)對(duì)象對(duì)應(yīng)的特殊函數(shù)的返回值#調(diào)用該特殊函數(shù),讓其返回一個(gè)協(xié)程對(duì)象 c = test()#將協(xié)程對(duì)象封裝到任務(wù)對(duì)象中 task = asyncio.ensure_future(c)#給任務(wù)對(duì)象綁定一個(gè)回調(diào)函數(shù) task.add_done_callback(func)#創(chuàng)建一個(gè)事件循環(huán)對(duì)象 loop = asyncio.get_event_loop()#將任務(wù)對(duì)象注冊(cè)到事件循環(huán)對(duì)象中,并且開啟事件循環(huán) loop.run_until_complete(task)多任務(wù)異步協(xié)程
import asyncio import time #函數(shù)內(nèi)部不可以出現(xiàn)不支持異步模塊的代碼 #該函數(shù)內(nèi)部的異步操作必須使用await進(jìn)行修飾 async def request(url):print('正在下載:',url)# time.sleep(2) #time模塊是一個(gè)不支持異步的模塊await asyncio.sleep(2) #asyncio模塊中提供的一個(gè)支持異步的阻塞方法print(url,'下載完畢!')return url#創(chuàng)建一個(gè)回調(diào)函數(shù) def callback(task):#返回的是任務(wù)對(duì)象對(duì)應(yīng)的特殊函數(shù)的返回值print(task.result())urls = ['www.1.com','www.2.com','www.3.com','www.4.com', ] #記錄開始時(shí)間 start = time.time() #任務(wù)列表 tasks = [] for url in urls:#調(diào)用該特殊函數(shù),讓其返回一個(gè)協(xié)程對(duì)象c = request(url)#將協(xié)程對(duì)象封裝到任務(wù)對(duì)象中task = asyncio.ensure_future(c)# 給任務(wù)對(duì)象綁定回調(diào)task.add_done_callback(callback)#將任務(wù)對(duì)象添加到列表中tasks.append(task)#創(chuàng)建一個(gè)事件循環(huán)對(duì)象 loop = asyncio.get_event_loop() #將任務(wù)對(duì)象列表注冊(cè)到事件循環(huán)對(duì)象中,并且開啟事件循環(huán) loop.run_until_complete(asyncio.wait(tasks)) ##記錄結(jié)束時(shí)間 print(time.time()-start)單線程+多任務(wù)異步協(xié)程的爬蟲
import asyncio import requests import time import aiohttp from lxml import etree urls = ['http://localhost:5000/bobo','http://localhost:5000/jay','http://localhost:5000/tom','http://localhost:5000/bobo','http://localhost:5000/jay','http://localhost:5000/tom' ]# async def get_page(url): # #requests模塊是一個(gè)不支持異步的模塊,解決方法就是使用一個(gè)支持異步的模塊進(jìn)行請(qǐng)求發(fā)送 # page_text = requests.get(url=url).text # return page_textasync def get_page(url):#使用aiohttp進(jìn)行請(qǐng)求發(fā)送#實(shí)例化了一個(gè)發(fā)送網(wǎng)絡(luò)請(qǐng)求的對(duì)象async with aiohttp.ClientSession() as session:#該函數(shù)內(nèi)部的異步操作必須使用await進(jìn)行修飾async with await session.get(url) as response:#獲取響應(yīng)數(shù)據(jù)(頁面源碼數(shù)據(jù))page_text = await response.text()# print(page_text)return page_text #數(shù)據(jù)解析的操作需要在回調(diào)函數(shù)中實(shí)現(xiàn) def parse(task):page_text = task.result()tree = etree.HTML(page_text)parse_data = tree.xpath('//body/text()')[0]print(parse_data)start = time.time() tasks = [] for url in urls:#調(diào)用該特殊函數(shù),讓其返回一個(gè)協(xié)程對(duì)象c = get_page(url)#將協(xié)程對(duì)象封裝到任務(wù)對(duì)象中task = asyncio.ensure_future(c)# 給任務(wù)對(duì)象綁定回調(diào)task.add_done_callback(parse)#將任務(wù)對(duì)象添加到列表中tasks.append(task) #創(chuàng)建一個(gè)事件循環(huán)對(duì)象 loop = asyncio.get_event_loop() #將任務(wù)對(duì)象列表注冊(cè)到事件循環(huán)對(duì)象中,并且開啟事件循環(huán) loop.run_until_complete(asyncio.wait(tasks))print(time.time()-start)單線程+多任務(wù)異步協(xié)程的應(yīng)用
#爬取喜馬拉雅中的相聲音頻 import requests import aiohttp import asyncio #通用的url模板 url = 'https://www.ximalaya.com/revision/play/album?albumId=19366477&pageNum=%d&sort=1&pageSize=2' headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' } #獲取了所有即將被下載的音頻連接 urls = [] for page in range(1,3):new_url = format(url%page)dic_obj = requests.get(url=new_url,headers=headers).json()for dic in dic_obj['data']['tracksAudioPlay']:audio_url = dic['src']urls.append(audio_url)#特殊的函數(shù):該函數(shù)調(diào)用后,函數(shù)內(nèi)部的程序語句不會(huì)被執(zhí)行,但是該函數(shù)調(diào)用會(huì)返回一個(gè)協(xié)程對(duì)象 async def get_audio_data(url):#使用aiohttp進(jìn)行請(qǐng)求發(fā)送#實(shí)例化了一個(gè)發(fā)送網(wǎng)絡(luò)請(qǐng)求的對(duì)象async with aiohttp.ClientSession() as s:#該函數(shù)內(nèi)部的異步操作必須使用await進(jìn)行修飾async with await s.get(url=url,headers=headers) as response:audio_data = await response.read() #read()返回的是二進(jìn)制形式的響應(yīng)數(shù)據(jù)return {'data':audio_data,'url':url}#任務(wù)對(duì)象的回調(diào)函數(shù),進(jìn)行數(shù)據(jù)的持久化存儲(chǔ) def saveData(task):dic_obj = task.result()name = dic_obj['url'].split('/')[-1]data = dic_obj['data']with open(name,'wb') as fp:fp.write(data)print(name+'下載完畢!')tasks = [] for url in urls:#調(diào)用該特殊函數(shù),讓其返回一個(gè)協(xié)程對(duì)象c = get_audio_data(url)#將協(xié)程對(duì)象封裝到任務(wù)對(duì)象中task = asyncio.ensure_future(c)# 給任務(wù)對(duì)象綁定回調(diào)函數(shù)task.add_done_callback(saveData)#將任務(wù)對(duì)象添加到列表中tasks.append(task) #創(chuàng)建一個(gè)事件循環(huán)對(duì)象 loop = asyncio.get_event_loop() #將任務(wù)對(duì)象列表注冊(cè)到事件循環(huán)對(duì)象中,并且開啟事件循環(huán) loop.run_until_complete(asyncio.wait(tasks))
轉(zhuǎn)載于:https://www.cnblogs.com/Godisgirl/p/11025195.html
總結(jié)
以上是生活随笔為你收集整理的爬虫的单线程+多任务异步协程:asyncio 3.6的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: openlayer中的投影
- 下一篇: 关于云开发新服务“实时数据推送”,你需要