當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）...

發(fā)布時(shí)間：2024/1/23 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）... 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

接著之前的MonkeyLei：Python-爬取頁面內(nèi)容（涉及urllib、requests、UserAgent、Json等）繼續(xù)練習(xí)下多線程，線程池模擬..

我想這樣：

1. 創(chuàng)建一個線程池，線程池?cái)?shù)量可以定為初始化16大小（如果無可用線程，則再次分配16個線程加入到線程池 - 目前線程編號有重復(fù)）

2. 然后url列表裝載到一個隊(duì)列Queue里面

3. 接下來遍歷url列表數(shù)量（無需獲取url，只是為了啟動一個線程來處理url），同時(shí)啟動一個線程（該線程會從隊(duì)列里面去獲取url進(jìn)行爬取）

4（attention）. 然后主線程等待子線程運(yùn)行完畢（過程中加入了運(yùn)行線程是否活著的判斷，如果運(yùn)行了就不用join了）

5（attention）. 網(wǎng)絡(luò)請求添加了超時(shí)請求，github模擬會比較慢，懶得等

So，看代碼

thread_pool.py

#!/usr/bin/python3 # -*- coding: UTF-8 -*- # 文件名：thread_pool.pyfrom threading import Thread from queue import Queue import time as Time from urllib import requesttread_pool_len = 16 threads_pool = [] running_thread = [] url_list = ['http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael','https://github.com/FanChael', ]# url列表長度 url_len = len(url_list) # 創(chuàng)建隊(duì)列并初始化 queue = Queue(url_len) for url in url_list:queue.put(url)# 偽裝瀏覽器 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }# 自定義線程 class my_thread(Thread):def __init__(self):Thread.__init__(self)def run(self):if not queue.empty():print(self.getName(), '運(yùn)行中')data = ''try:req = request.Request(queue.get(), None, headers)with request.urlopen(req, timeout=5) as uf:while True:data_temp = uf.read(1024)if not data_temp:breakdata += data_temp.decode('utf-8', 'ignore')# print('線程', self.getName(), '獲取數(shù)據(jù)=', data)except Exception as err:print(self.getName(), str(err))else:pass# 初始化線程池 def init_thread(count):thread_count = len(threads_pool)for i in range(thread_count, count):thead = my_thread()thead.setName('第' + str(i) + '號線程')threads_pool.append(thead)# 獲取可用線程 - 優(yōu)化思路：每次都遍歷一遍效率低，可以封裝對象，設(shè)置標(biāo)示位，執(zhí)行結(jié)束后改變標(biāo)志位狀態(tài)；但這樣還是要循環(huán)一遍；此時(shí)取到一定數(shù)量或者快到頭了，然后再從頭遍歷 def get_available():for c_thread in threads_pool:if not c_thread.isAlive():threads_pool.remove(c_thread)return c_thread# 擴(kuò)容線程init_thread(tread_pool_len)return get_available()if __name__ == '__main__':# 初始化線程池init_thread(tread_pool_len)# 啟動時(shí)間start_time = Time.time()# 啟動線程去從隊(duì)列獲取url執(zhí)行請求for i in range(url_len):a_thread = get_available()if a_thread:running_thread.append(a_thread)a_thread.start()# 主線程等所有子線程運(yùn)行完畢f(xié)or t in running_thread:if t.isAlive():t.join()# 結(jié)束時(shí)間end_time = Time.time()print(len(running_thread), '個線程, ', '運(yùn)行時(shí)間: ', end_time - start_time, '秒')print('空余線程數(shù): ', len(threads_pool))

Result :

D:PycharmProjectspython_studyvenv3.xScriptspython.exe D:/PycharmProjects/python_study/protest/thread_pool.py 第0號線程運(yùn)行中第1號線程運(yùn)行中第2號線程運(yùn)行中第3號線程運(yùn)行中第4號線程運(yùn)行中第5號線程運(yùn)行中第6號線程運(yùn)行中第7號線程運(yùn)行中第8號線程運(yùn)行中第9號線程運(yùn)行中第0號線程運(yùn)行中第1號線程運(yùn)行中第2號線程運(yùn)行中第1號線程 <urlopen error timed out> 第2號線程 The read operation timed out 13 個線程, 運(yùn)行時(shí)間: 20.04409170150757 秒空余線程數(shù): 7Process finished with exit code 0

工程練習(xí)地址： https://gitee.com/heyclock/doc/tree/master/Python/python_study

補(bǔ)充....這個地方我還會去看哈主流的線程池爬蟲方案（其中官方線程池的用法參考： python線程池 ThreadPoolExecutor 的用法及實(shí)戰(zhàn)），然后學(xué)習(xí)下，然后補(bǔ)充

threadpoolexecutor_practice.py

#!/usr/bin/python3 # -*- coding: UTF-8 -*- # 文件名：threadpoolexecutor_practice.pyfrom concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED, ALL_COMPLETED, as_completed from urllib import request# 偽裝瀏覽器 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }url_list = ['http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael','https://github.com/FanChael', ]def spider(url_path):data_html = ''try:req = request.Request(url_path, None, headers)# 爬到內(nèi)容不對的還需要結(jié)合selenium等獲取動態(tài)js內(nèi)容with request.urlopen(req, timeout=5) as uf:while True:data_temp = uf.read(1024)if not data_temp:breakdata_html += data_temp.decode('utf-8', 'ignore')# 爬到的數(shù)據(jù)可以本地或者數(shù)據(jù)庫 - 總之進(jìn)行一系列后續(xù)處理print(url_path, " 完成")except Exception as err:print(str(err))else:passreturn data_html# 創(chuàng)建一個最大容量為1的線程 executor = ThreadPoolExecutor(max_workers=16)if __name__ == '__main__':tasks = []# 執(zhí)行蜘蛛并加入執(zhí)行列表for url in url_list:# 執(zhí)行函數(shù)，并傳入?yún)?shù)task = executor.submit(spider, url)tasks.append(task)# 等待方式1：結(jié)束# wait(tasks, return_when=ALL_COMPLETED)# 等待方式2：結(jié)束for future in as_completed(tasks):# spider方法無返回，則返回為Nonedata = future.result()print(f"main:{data[0:10]}")# 等待方式3: 結(jié)束 - 替代submit并伴隨等待！# for data in executor.map(spider, url_list):# print(data)print('結(jié)束啦')

用官方的線程池，更簡單一些，別人都做好了處理線程的管理。其實(shí)點(diǎn)擊進(jìn)去看看源碼，大概也知道，也有類似的擴(kuò)容處理，然后調(diào)用封裝，任務(wù)也都是放到的隊(duì)列里面的。比如下面一段源碼：

線程池練習(xí)，更好的封裝，比如（你自己初步實(shí)現(xiàn)，然后可以包裝起來獨(dú)立模塊，外部提供參數(shù)運(yùn)行）https://blog.csdn.net/Key_book/article/details/80258022

OK，先醬紫...下一步數(shù)據(jù)庫連接，正則匹配學(xué)哈。。差不多公司項(xiàng)目就可以看看了....具體其他的再深入...

附錄：https://blog.csdn.net/Key_book/article/details/80258022 - python爬蟲之urllib,偽裝,超時(shí)設(shè)置,異常處理

總結(jié)

以上是生活随笔為你收集整理的pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）...的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： idea导入nodejs插件_sbt 项
下一篇： rf调用的python函数报错_Robo