當前位置：首頁 > 编程语言 > python >内容正文

python

[python爬虫] selenium爬取局部动态刷新网站（URL始终固定）

發布時間：2024/6/1 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python爬虫] selenium爬取局部动态刷新网站（URL始终固定）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在爬取網站過程中，通常會遇到局部動態刷新情況，當你點擊“下一頁”或某一頁時，它的數據就進行刷新，但其頂部的URL始終不變。這種局部動態刷新的網站，怎么爬取數據呢？某網站數據顯示如下圖所示，當點擊“第五頁”之時，其URL始終不變，傳統的網站爬取方法是無法拼接這類鏈接的，所以本篇文章主要解決這個問題。

本文主要采用Selenium爬取局部動態刷新的網站，獲取“下一頁”按鈕實現自動點擊跳轉，再依次爬取每一頁的內容。希望對您有所幫助，尤其是遇到同樣問題的同學，如果文章中出現錯誤或不足之處，還請海涵~

一. Selenium爬取第一頁信息

首先，我們嘗試使用Selenium爬取第一頁的內容，采用瀏覽器右鍵“審查”元素，可以看到對應的HTML源代碼，如下圖所示，可以看到，每一行工程信息都位于<table class="table table-hover">節點下的<tr>...</tr>中。

然后我們再展開其中一個<tr>...</tr>節點，看它的源碼詳情，如下圖所示，包括公告標題、發布時間、項目所在地。如果我們需要抓取公告標題，則定位<div class="div_title text_view">節點，再獲取標題內容和超鏈接。

完整代碼如下：

# coding=utf-8 from selenium import webdriver import re import time import osprint "start" #打開Firefox瀏覽器設定等待加載時間 driver = webdriver.Firefox()#定位節點 url = 'http:/www.xxxx.com/' print url driver.get(url) content = driver.find_elements_by_xpath("//div[@class='div_title text_view']") for u in content:print u.text

輸出內容如下圖所示：

PS：由于網站安全問題，我不直接給出網址，主要給出爬蟲的核心思想。同時，下面的代碼我也沒有給出網址，但思路一樣，請大家替換成自己的局部刷新網址進行測試。

二. Selenium實現局部動態刷新爬取

接下來我們想爬取第2頁的網站內容，其代碼步驟如下：
? ? 1.定位驅動：driver = webdriver.Firefox()
? ? 2.訪問網址：driver.get(url)
? ? 3.定位節點獲取第一頁內容并爬取：driver.find_elements_by_xpath()
? ? 4.獲取“下一頁”按鈕，并調用click()函數點擊跳轉
? ? 5.爬取第2頁的網站內容：driver.find_elements_by_xpath()

其核心步驟是獲取“下一頁”按鈕，并調用Selenium自動點擊按鈕技術，從而實現跳轉，之后再爬取第2頁內容。“下一頁”按鈕的源代碼如下圖所示：

其中，“下一頁”按鈕始終在第9個<li>...</li>位置，則核心代碼如下：
nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a")
nextPage.click()

完整代碼如下：

# coding=utf-8 from selenium import webdriver import re import time import osprint "start" driver = webdriver.Firefox()url = 'http://www.XXXX.com/' print url driver.get(url) #項目名稱 titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']") for u in titles:print u.text #超鏈接 urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a") for u in urls:print u.get_attribute("href") #時間 times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]") for u in times:print u.text #地點 places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]") for u in places:print u.text#點擊下一頁 nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a") print nextPage.text nextPage.click() time.sleep(5)#爬取第2頁數據 content = driver.find_elements_by_xpath("//div[@class='div_title text_view']") for u in content:print u.text 輸出內容如下所示，可以看到第二頁的內容也爬取成功了，并且作者爬取了公告主題、超鏈接、發布時間、發布地點。
>>> start http://www.xxxx.com/ 觀山湖區依法治國普法教育基地（施工）中標候選人公示興義市2017年農村公路生命防護工程一期招標公告安龍縣市政廣場地下停車場10kV線路遷改、10kV臨時用電、10kV電纜敷設及400V電纜敷設工程施工公開競爭性談判公告劍河縣小香雞種苗孵化場建設項目（場坪工程）中標公示安龍縣棲鳳生態濕地走廊建設項目（原冰河步道A、B段）10kV線路、400V線路、220V線路及變壓器遷改工程施工招標鎮寧自治縣2017年簡嘎鄉農村飲水安全鞏固提升工程(施工)招標廢標公示 S313線安龍縣城至普坪段道路改擴建工程勘察招標公告 S313線安龍縣城至普坪段道路改擴建工程勘察招標公告貴州中煙工業有限責任公司2018物資公開招標-卷煙紙中標候選人公示冊亨縣者樓河納福新區河段生態治理項目（上游一標段）初步設計招標公告 http://www.gzzbw.cn/historydata/view/?id=116163 http://www.gzzbw.cn/historydata/view/?id=114995 http://www.gzzbw.cn/historydata/view/?id=115720 http://www.gzzbw.cn/historydata/view/?id=116006 http://www.gzzbw.cn/historydata/view/?id=115719 http://www.gzzbw.cn/historydata/view/?id=115643 http://www.gzzbw.cn/historydata/view/?id=114966 http://www.gzzbw.cn/historydata/view/?id=114965 http://www.gzzbw.cn/historydata/view/?id=115400 http://www.gzzbw.cn/historydata/view/?id=116031 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 未知興義市安龍縣未知安龍縣未知安龍縣安龍縣未知冊亨縣下一頁冊亨縣丫他鎮板其村埃近1～2組村莊綜合整治項目冊亨縣者樓河納福新區河段生態治理項目（上游一標段）勘察招標公告惠水縣撤并建制村通硬化路施工總承包中標候選人公示冊亨縣丫他鎮板街村村莊綜合整治項目施工招標招標公告鎮寧自治縣農村環境整治工程項目（環翠街道辦事處）施工（三標段）(二次）（項目名稱）交易結果公示丫他鎮生態移民附屬設施建設項目劍河縣城市管理辦公室的劍河縣仰阿莎主題文化廣場護坡綠化工程中標公示冊亨縣者樓河納福新區河段生態治理項目（上游一標段）施工圖設計冊亨縣2017年巖架城市棚戶區改造項目配套基礎設施建設項目中標公示數字甕安地理空間框架建設項目 >>> Firefox成功跳轉到第2頁，此時你增加一個循環則可以跳轉很多頁，并爬取信息，詳見第三個步驟。

三. Selenium爬取詳情頁面

上面爬取了每行公告信息的詳情頁面超鏈接(URL)，本來我準備采用BeautifulSoup爬蟲爬取詳情頁面信息的，但是被攔截了，詳情頁面如下圖所示：

這里作者繼續定義另一個Selenium Firefox驅動進行爬取，完整代碼如下： # coding=utf-8 from selenium import webdriver from selenium.webdriver.common.keys import Keys import re import time import osprint "start" #打開Firefox瀏覽器 driver = webdriver.Firefox() driver2 = webdriver.Firefox()url = 'http://www.xxxx.com/' print url driver.get(url) #項目名稱 titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']") #超鏈接 urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a") num = [] for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") #print con.textnum.append(con.text) #時間 times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]") #地點 places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結果 print len(num) i = 0 while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""i = i + 1#點擊下一頁 j = 0 while j<5:nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a")print nextPage.textnextPage.click()time.sleep(5)#項目名稱titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']")#超鏈接urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a")num = []for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") num.append(con.text)#時間times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]")#地點places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結果print len(num)i = 0while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""i = i + 1j = j + 1

注意作者定義了一個while循環，一次性輸出一條完整的招標信息，代碼如下：

print len(num) i = 0 while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""i = i + 1

輸出結果如下圖所示：

其中一條完整的結果如下所示：

觀山湖區依法治國普法教育基地（施工）中標候選人公示 http://www.gzzbw.cn/historydata/view/?id=116163 2017-12-22 未知觀山湖區依法治國普法教育基地（施工）中標候選人公示來源: 貴州百利工程建設咨詢有限公司發布時間: 2017-12-22 根據法律、法規、規章和招標文件的規定，觀山湖區司法局、貴陽觀山湖投資（集團）旅游文化產業發展有限公司（代建）的觀山湖區依法治國普法教育基地（施工）（項目編號：BLZB01201744）已于2017年 12月22日進行談判，根據談判小組出具的競爭性談判報告，現公示下列內容：第一中標候選人：貴州鴻友誠建筑安裝有限公司中標價：1930000.00（元）工期：57日歷天第二中標候選人：貴州隆瑞建設有限公司中標價：1940000.00（元）工期：60日歷天第三中標候選人：鳳岡縣建筑工程有限責任總公司中標價：1953285.00（元）工期：60日歷天中標結果公示至2017年12月25日。招標人：貴陽觀山湖投資（集團）有限公司招標代理人：貴州百利工程建設咨詢有限公司 2017年12月22日最后讀者可以結合MySQLdb庫，將爬取的內容存儲至本地中。同時，如果您爬取的內容需要設置時間，比如2015年的數據，則在爬蟲開始之前設置time.sleep(5)暫定函數，手動點擊2015年或輸入關鍵字，再進行爬取。也建議讀者采用Selenium技術來自動跳轉，而詳情頁面采用BeautifulSoup爬取。

# coding=utf-8 from selenium import webdriver from selenium.webdriver.common.keys import Keys import selenium.webdriver.support.ui as ui import re import time import os import codecs from bs4 import BeautifulSoup import urllib import MySQLdb#存儲數據庫 #參數:公告名稱發布時間發布地點發布內容 def DatabaseInfo(title,url,fbtime,fbplace,content): try: conn = MySQLdb.connect(host='localhost',user='root', passwd='123456',port=3306, db='20180426ztb') cur=conn.cursor() #數據庫游標 #報錯:UnicodeEncodeError: 'latin-1' codec can't encode character conn.set_character_set('utf8') cur.execute('SET NAMES utf8;') cur.execute('SET CHARACTER SET utf8;') cur.execute('SET character_set_connection=utf8;') #SQL語句智聯招聘(zlzp) sql = '''insert into ztb (title, url, fbtime, fbplace, content) values(%s, %s, %s, %s, %s);'''cur.execute(sql, (title,url,fbtime,fbplace,content)) print '數據庫插入成功' #異常處理 except MySQLdb.Error,e: print "Mysql Error %d: %s" % (e.args[0], e.args[1]) finally: cur.close() conn.commit() conn.close()print "start" #打開Firefox瀏覽器 driver = webdriver.Firefox() driver2 = webdriver.Firefox()url = 'http://www.gzzbw.cn/historydata/' print url driver.get(url) #項目名稱 titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']") #超鏈接 urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a") num = [] for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") #print con.textnum.append(con.text) #時間 times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]") #地點 places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結果 print len(num) i = 0 while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""DatabaseInfo(titles[i].text, urls[i].get_attribute("href"), times[i].text,places[i].text, num[i])i = i + 1#點擊下一頁 j = 0 while j<100:nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a")print nextPage.textnextPage.click()time.sleep(5)#項目名稱titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']")#超鏈接urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a")num = []for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") num.append(con.text)#時間times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]")#地點places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結果print len(num)i = 0while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""DatabaseInfo(titles[i].text, urls[i].get_attribute("href"), times[i].text,places[i].text, num[i])i = i + 1print u"已爬取頁碼:", (j+2)j = j + 1 存儲至數據庫：

最后希望文章對您有所幫助，尤其是要爬取局部刷新的同學，
如果文章中出現錯誤或不足之處，還請海涵~

(By:Eastmount 2018-04-26 早上11點半?http://blog.csdn.net/eastmount/?)

總結

以上是生活随笔為你收集整理的[python爬虫] selenium爬取局部动态刷新网站（URL始终固定）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Python安装MySQL库详解（解决M
下一篇： websocket python爬虫_p