當前位置：首頁 > 编程语言 > python >内容正文

python

python selenium爬虫_详解基于python +Selenium的爬虫

發布時間：2023/12/19 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 python selenium爬虫_详解基于python +Selenium的爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

詳解基于python +Selenium的爬蟲

一.背景

1. Selenium

Selenium 是一個用于web應用程序自動化測試的工具，直接運行在瀏覽器當中，支持chrome、firefox等主流瀏覽器。可以通過代碼控制與頁面上元素進行交互（點擊、輸入等），也可以獲取指定元素的內容。

2.優劣

劣勢：

相比于抓包→構造請求→解析返回值的爬蟲，由于Selenium需要生成一個瀏覽器環境，所有操作（與元素交互、獲取元素內容等）均需要等待頁面加載完畢后才可以繼續進行，所以速度相比構造請求的慢很多。
對于為了反爬做了特殊處理的展示內容，如字體加密（參考貓眼）、圖片替換數字（參考自如）等，可能取不到想要的數據。

使用圖片替換數字的自如：

做了字體加密的貓眼：

優勢：

不需要做復雜的抓包、構造請求、解析數據等，開發難度相對要低一些。
其訪問參數跟使用瀏覽器的正常用戶一模一樣，訪問行為也相對更像正常用戶，不容易被反爬蟲策略命中。
生成的瀏覽器環境可以自動運行 JS 文件，所以不用擔心如何逆向混淆過的JS文件生成用作人機校驗的參數，如馬蜂窩酒店評論的人機校驗參數_sn，網易云音樂評論的人機校驗參數params、encSecKey。可以自行抓包查看。
如果需要抓取同一個前端頁面上面來自不同后端接口的信息，如OTA酒店詳情頁的酒店基礎信息、價格、評論等，使用Selenium可以在一次請求中同時完成對三個接口的調用，相對方便。

二、實現

1.環境

python3.6 + Macos

2.依賴包

Selenium

安裝的時候是大寫的 S ，import的時候是小寫 s。

pip install Selenium

3.瀏覽器驅動（webdriver）

加載瀏覽器環境需要下載對應的瀏覽器驅動，此處選擇 Chrome。

下載地址：http://npm.taobao.org/mirrors/chromedriver/ ，選擇合適的版本下載解壓后放在隨便一個位置即可。

4.hello world

這時候可以通過webdriver自帶的一些的一些方法獲取元素內容或者與元素進行交互。

#返回ID = js_block_beijing_city_7810的元素信息 hotel_info = driver.find_element_by_id('js_block_beijing_city_7810') print(hotel_info.text) #返回展示在列表頁的酒店信息 #同理，可以find_element_by_[class_name|name] 等，均可完成查詢。

也可以通過方法 find_elements查找符合某條件的一組元素，以列表的形式返回。

#當需要查詢的唯一標識帶有空格時，可以使用find_elements_by_css_selector，否則會報錯。 hotel_list = driver.find_elements_by_css_selector("[class='b_result_box js_list_block b_result_commentbox']") print(hotel_list) #返回酒店列表的全部信息。

5.關閉圖片加載

在不需要抓取圖片的情況下，可以設置不加載圖片，節約時間，這樣屬于調整本地設置，在傳參上并不會有異常。

from selenium import webdriverchrome_opt = webdriver.ChromeOptions() prefs={"profile.managed_default_content_settings.images":2} chrome_opt.add_experimental_option("prefs",prefs)path = '' #驅動路徑 browser_noPic = webdriver.Chrome(executable_path=path,chrome_options=chrome_opt)

三、使用webdriver與元素進行交互

1.模擬鼠標點擊

hotel_info = driver.find_element_by_id("js_plugin_tag_beijing_city_7810") hotel.info.click() #進入酒店詳情頁

2.模擬鍵盤輸入

hotel_search = driver.find_element_by_id("jxQ") hotel_search.send_keys("如") hotel_search.send_keys("如家") #由于搜索框輸入的第一個字會被選中，所以需要第二次才能完整輸入，當然也可以模擬按鍵盤的 →(右鍵)取消選中后再次輸入。

3.模擬下拉

webdriver中對鼠標的操作的方法封裝在ActionChains類中，使用前要先導入ActionChains類：

from selenium.webdriver.common.action_chains import ActionChains"""在頁面頂部、底部個找了一個元素，并模擬鼠標從頂到底的滑動""" start = driver.find_element_by_class_name('e_above_header') target = driver.find_element_by_class_name('qn_footer') ActionChains(driver).drag_and_drop(start,target).perform()

此外，webdiver還提供豐富的交互功能，比如鼠標懸停、雙擊、按住左鍵等等，此處不展開介紹。

四、一個完整的模擬瀏覽器爬蟲

from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains import time'''這里填剛剛下載的驅動的路徑''' path = '/Users/./Desktop/chromedriver' driver = webdriver.Chrome(executable_path=path)url = 'http://hotel.qunar.com/city/beijing_city/' driver.get(url) time.sleep(6) #等待頁面加載完再進行后續操作"""在頁面頂部、底部個找了一個元素，并模擬鼠標從頂到底的滑動""" start = driver.find_element_by_class_name('e_above_header') target = driver.find_element_by_class_name('qn_footer') ActionChains(driver).drag_and_drop(start,target).perform()time.sleep(5) #等待頁面加載完再進行后續操作hotel_link_list = driver.find_elements_by_css_selector("[class='item_price js_hasprice']") print("在此頁面共有酒店",len(hotel_link_list),"家") windows = driver.window_handles#此處可以爬整個頁面任何想要想要的元素 list_hotel_info=[] def hotel_info_clawer():list_hotel_info.append([driver.find_element_by_class_name("info").text,driver.find_element_by_class_name("js-room-table").text,driver.find_element_by_class_name("dt-module").text])for i in range(len(hotel_link_list)):hotel_link_list[i].click()driver.switch_to.window(windows[-1]) #切換到剛打開的酒店詳情頁hotel_info_clawer()driver.close() #關閉已經爬完的酒店詳情頁 print("已經抓取酒店",i,"家")#后面可以補充翻頁繼續抓取的部分

五、使用截圖+OCR抓取關鍵數據

對于做了特殊處理的信息，如上述的貓眼電影的票房信息、自如的價格等，不適用于直接獲取制定元素的信息進行抓取，可以使用截圖+OCR的方式抓取此類數據。

以自如的房租為例：

from selenium import webdriver'''這里填剛剛下載的驅動的路徑''' path = '/Applications/Google Chrome.app/Contents/chromedriver' driver = webdriver.Chrome(executable_path=path)url = 'http://www.ziroom.com/z/vr/61715463.html' driver.get(url)price = diver.find_element_by_class_name('room_price')print(price.text)#由于自如的價格用圖片做了替換，這樣并不能獲取到實際價格，需要獲取圖片再做ocr處理"對指定元素部分截圖再保存" price.screenshot('/Users/./Desktop/price.png')

安裝ocr工具：

Tesseract是一個開源的OCR引擎，能識別100多種語言（中，英，韓，日，德，法...等等），但是Tesseract對手寫的識別能力較差，僅適用于打印字體。

//僅安裝tesseract，不安裝訓練工具和其他語音包，需要識別中文的話得額外下載 //下載地址：https://github.com/tesseract-ocr/tessdata brew install tesseract

使用Tesseract：

tesseract ~/price.png result //識別圖片并將結果存在result里面

在python下使用Tesseract：

首先安裝依賴包：pip install pytesseract

import pytesseract from PIL import Image# open image image = Image.open('price.png') code = pytesseract.image_to_string(image) print(code)

六、待填坑

操作鼠標滑動滑塊

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的python selenium爬虫_详解基于python +Selenium的爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：树木、人行道都已建模可见，苹果 Appl
下一篇： websocket python爬虫_p