當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python多线程爬取斗图啦数据

發(fā)布時(shí)間：2024/4/15 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 python多线程爬取斗图啦数据小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python多線程爬取斗圖啦網(wǎng)的表情數(shù)據(jù)

使用到的技術(shù)點(diǎn)

requests請(qǐng)求庫(kù)
re 正則表達(dá)式
pyquery解析庫(kù),python實(shí)現(xiàn)的jquery
threading 線程
queue 隊(duì)列

''' 斗圖啦多線程方式'''import requests,time,re,os from pyquery import PyQuery as jq from requests.exceptions import RequestException from urllib import request # 導(dǎo)入線程類 import threading # 導(dǎo)入隊(duì)列類 from queue import Queue head = {"User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",} # 創(chuàng)建項(xiàng)目文件夾 pt=os.path.dirname(os.path.abspath(__file__)) path = os.path.join(pt, "斗圖啦") if not os.path.exists(path):os.mkdir(path)''' 生產(chǎn)者類繼承自多線程類threading.Thread 重寫init方法和run方法 ''' class Producer(threading.Thread):def __init__(self,img_queue,url_queue,*args,**kwargs):super(Producer, self).__init__(*args,*kwargs)self.img_queue=img_queueself.url_queue=url_queuedef run(self):while True:if self.url_queue.empty():# 如果沒有url了直接退出循環(huán)breakurl=self.url_queue.get()self.parse_page(url)## 解析數(shù)據(jù)方法def parse_page(self,url):res=requests.get(url,headers=head)doc=jq(res.text)# print(res.text)# 查詢到所有的a標(biāo)簽items= doc.find(".page-content a").items()for a in items:title=a.find("p").text()src=a.find("img.img-responsive").attr("data-original")# 分割路徑拿到擴(kuò)展名pathtype= os.path.splitext(src)[1]# 使用正則表達(dá)式去掉特殊字符patitle=re.sub(r'[\.。，\?？\*!！\/~]',"",title)filename = patitle + pathtypefilepath=os.path.join(path,filename)# 添加到消費(fèi)者隊(duì)列循環(huán)下載圖片self.img_queue.put((filepath,src))''' 消費(fèi)者和生產(chǎn)者一樣的道理 ''' class Customer(threading.Thread):def __init__(self,img_queue,url_queue,*args,**kwargs):super(Customer, self).__init__(*args,**kwargs)self.img_queue=img_queueself.url_queue=url_queuedef run(self):while True:if self.img_queue.empty() and self.url_queue.empty():#如果沒有url并且圖片下載完成直接退出break# 在隊(duì)列中拿到路徑和圖片鏈接filepath,src=self.img_queue.get()print('%s開始下載,鏈接%s' % (filepath, src))# 請(qǐng)求圖片img = requests.get(src)# 寫入本地 content表示二進(jìn)制數(shù)據(jù),text是文本數(shù)據(jù)with open(filepath, "wb")as f:f.write(img.content)# request.urlretrieve(src,os.path.join(path,filename))print('%s下載完成' % filepath)def main():# 構(gòu)建url隊(duì)列和img隊(duì)列url_queue=Queue(100000)img_queue=Queue(100000)# 構(gòu)建url 爬取1到100頁(yè)的數(shù)據(jù)for i in range(1,101):url="https://www.doutula.com/photo/list/?page="+str(i)url_queue.put(url)# 添加到生產(chǎn)者隊(duì)列中 # 開啟5個(gè)線程線程執(zhí)行生產(chǎn)者for i in range(5):t=Producer(img_queue,url_queue)t.start()# 開啟3個(gè)線程線程執(zhí)行消費(fèi)者for i in range(3):t=Customer(img_queue,url_queue)t.start()if __name__ == '__main__':print("爬蟲調(diào)度啟動(dòng)---------")main()print("爬蟲調(diào)度完成---------")

轉(zhuǎn)載于:https://www.cnblogs.com/HiLzd/p/11246116.html

總結(jié)

以上是生活随笔為你收集整理的python多线程爬取斗图啦数据的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：字典类型
下一篇： [转]25个增强iOS应用程序性能的提示