當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python简单网页爬取

發(fā)布時(shí)間：2023/12/29 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python简单网页爬取小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

使用Python爬取簡單數(shù)據(jù)
閑暇時(shí)間學(xué)習(xí)Python,不管以后能否使用，就算了解計(jì)算機(jī)語言知識。
因?yàn)橛幸稽c(diǎn)Java基礎(chǔ)，所以Python的基本語法就很快的過了一遍，表達(dá)或許有點(diǎn)混亂，以后慢慢改進(jìn)。
一、導(dǎo)入爬取網(wǎng)頁所需的包。

from bs4 import BeautifulSoup #網(wǎng)頁解析 import xlwt #excel import re #正則表達(dá)式 import urllib.request,urllib.error #指定url,獲取網(wǎng)頁數(shù)據(jù)

二、Python屬于腳本語言，沒有類似Java的主入口（main）,對于這里理解不是很深，就是給這個(gè)類添加一個(gè)主入口的意思吧。

if __name__ == '__main__':main()

三、接著在定義主函數(shù)main()，主函數(shù)里應(yīng)包括

所需爬取的網(wǎng)頁地址

得到網(wǎng)頁數(shù)據(jù)，進(jìn)行解析舍取

將得到的數(shù)據(jù)保存在excel中

def main():#指定所需爬取網(wǎng)頁路徑basePath = "https://www.duquanben.com/"#獲取路徑dataList = getData(basePath)#保存數(shù)據(jù)saveData(dataList)

四、需對爬取網(wǎng)頁進(jìn)行數(shù)據(jù)的采集

因?yàn)槭褂玫腜ycharm軟件來進(jìn)行爬取，首先需要進(jìn)行下偽裝，將瀏覽器的代理信息取出待解析網(wǎng)頁數(shù)據(jù)時(shí)，使用此信息進(jìn)行偽裝

五、定義獲取數(shù)據(jù)方法

進(jìn)入網(wǎng)頁取數(shù)據(jù)，需得到網(wǎng)頁認(rèn)可（解析網(wǎng)頁）

def getData(basePath):#解析數(shù)據(jù)html = uskURL(basePath)

uskURL方法有點(diǎn)類似于死方法，根據(jù)瀏覽器的不同，改變下用戶代理人信息即可

def uskURL(basePath):heard = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0" #偽裝為瀏覽器}req=urllib.request.Request(basePath,headers=heard,method="GET")html = ""try:response=urllib.request.urlopen(req)html = response.read()except urllib.error.URLError as e:if hasattr(e,"code"):print(e.code)if hasattr(e,"reason"):print(e.reason)return html

3、準(zhǔn)備集合裝載數(shù)據(jù)，解析網(wǎng)頁數(shù)據(jù)，匹對正則表達(dá)式
可以看出爬取的數(shù)據(jù)由

標(biāo)簽包裹，所以只需遍歷循環(huán)此標(biāo)簽即可。

#正則表達(dá)式定義為全局變量 link = re.compile(r'<h5><a href="(.*)" target="_blank">') author = re.compile(r'作者：(.*)') content = re.compile(r'<p><a href="(.*)" target="_blank">(.*?)</a></p>',re.S) #re.S表示忽略換行符等def getData(basePath):#解析數(shù)據(jù)html = uskURL(basePath)#解析網(wǎng)頁數(shù)據(jù)bs = BeautifulSoup(html,"html.parser")#t_list=bs.find_all("div",class_="hot-img") #因?yàn)閏lass是一個(gè)類別，所以需要加一個(gè)下劃線，不然會(huì)報(bào)錯(cuò)<div class="hot-img">#print(t_list)# 裝數(shù)據(jù)的集合datalist = []for item in bs.find_all("div",class_="hot-img"):data = [] #另準(zhǔn)備一個(gè)集合裝取數(shù)據(jù)item = str(item) #轉(zhuǎn)化為字符串linklist = re.findall(link, item) #findall（1，2）1表示正則表達(dá)式，2表示所要匹對的字符串#print(linklist)data.append(linklist)authorlist = re.findall(author,item)data.append(authorlist)#print(authorlist)contentlist = re.findall(content,item)[0][1] #contentlist里我們只需要第二個(gè)數(shù)據(jù)，將他看作為二維數(shù)組，后面對應(yīng)取值即可if contentlist == "": #無字符串時(shí)，根據(jù)自己想法而定data.append("暫無簡介")else:data.append(contentlist)datalist.append(data)#print(datalist)return datalist

六、將得到的數(shù)據(jù)保存在excel中

def saveData(dataList):Book=xlwt.Workbook(encoding="utf-8",style_compression=0)#style_compression:表示是否壓縮，不常用sheet=Book.add_sheet("小說.xls",cell_overwrite_ok=True)#cell_overwrite_ok，表示是否可以覆蓋單元格line = ("詳情鏈接","筆名","簡介")for item in range(len(line)): #此處循環(huán)如果line里只有一個(gè)字符串，那么生成的xls里，只會(huì)出現(xiàn)一個(gè)‘詳’字#print(len(line))sheet.write(0,item,line[item])#wirte(row, col, *args)for i in range(len(dataList)):#第一次循環(huán)應(yīng)是將行數(shù)，有多少數(shù)據(jù)有多少行data=dataList[i] #每一條數(shù)據(jù)應(yīng)該放在一行里，所以將在一次進(jìn)行for循環(huán)for j in range(len(line)):sheet.write(i+1,j,data[j])Book.save("測試.xls")

總結(jié)

以上是生活随笔為你收集整理的Python简单网页爬取的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： GEWorker界面展示及功能组成介绍，
下一篇： python批量爬取下载网易云音乐