當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

java爬虫工具xpath提取,2020-07-16--爬虫数据提取--xpath

發布時間：2023/12/4 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了 java爬虫工具xpath提取,2020-07-16--爬虫数据提取--xpath 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

xpath

全稱 XML Path Language 是一門在XML文檔中查找信息的語言最初是用來搜尋XML文檔的但是它同樣適用于HTML文檔的搜索

XPath 的選擇功能十分強大，它提供了非常簡潔的路徑選擇表達式，另外還提供了超過100個內置函數，用于字符串，數值，時間的匹配以及節點和序列的處理

XPath 于1999年11月16日成為W3C標準被設計為供XSLT、XPointer、以及其它XML解析軟件使用

常用節點選擇工具 Chrome插件 XPath Helper(下載crx擴展程序進行安裝)

xpath工作原理就是通過對hmtl代碼標簽以及屬性和css樣式的抓取獲取相應數據，所以要使用xpath必須做到對html代碼了解。

常用規則

nodename 選取此節點的所有子節點

/ 從當前節點選取直接子節點

// 從當前節點選取子孫節點

. 選取當前節點

.. 選取當前節點的父節點

@ 選取屬性

安裝lxml

在終端

pipinstall lxml==4.5.0

實例

from lxml import etree #導入lxml的etree模塊

text='''

first item
second item
third item
fourth item
fifth item

'''

html = etree.HTML(text)#調用HTML類,初始化構造一個XPath解析對象

print(html) #

result = etree.tostring(html) #將HTML對象轉為字節數組

print(result.decode('utf-8')) #解碼輸出字符串

print('*'*50)

'''

也可以獲取本地的文件進行解析

'''

html1 = etree.parse('a.html',parser=etree.HTMLParser())

print(html) #

result1 = etree.tostring(html1)

print(result1.decode('utf-8'))

a.html:

Title

first item
second item
third item
fourth item
fifth item

1.1所有節點

一般會用//開頭的XPath規則來選取所有符合要求的節點

例如：

from lxml import etree

html = etree.parse('a.html',etree.HTMLParser())

result1 = html.xpath('//*') #查找所有節點

print(result1)

*代表匹配所有節點返回一個列表每個元素是Element類型其后是節點名

1.2指定節點

result2 = html.xpath('//li') #查找li元素所有節點

將html文檔中所有的li標簽查找出來

1.3子節點

通過/或者// 查找元素子節點或子孫節點

例如：查找li節點的所有直接子節點a

result3 = html.xpath('//li/a') #查找li標簽下的子節點a

print(result3)

查找li節點下所有子孫節點a

result4 = html.xpath('//li//a') #li標簽下的所有a節點

print(result4)

1.4父節點

查找href="link4.html"的a標簽的父節點的class值

result5 = html.xpath('//a[@href="link4.html"]/../@class')

print(result5) # ['item-1']

也可以通過parent:: 獲取其父節點

# 查找href="link4.html"的a標簽的所有父節點的class值，parent::*表示所有父節點，*可以替換指定標簽

result5 = html.xpath('//a[@href="link4.html"]/parent::*/@class')

print(result5) # ['item-1']

1.5屬性過濾

選取class為item-0 的li節點

result6 = html.xpath('//li[@class="item-1"]')

print(result6)

1.6文本獲取

用xpath中text()方法獲取節點中的文本

獲取li節點中的文本

獲取指定li標簽下a標簽的文本

e1 = html.xpath('//li[@class="item-1"]/a/text()') #返回list，['second item', 'fourth item']

print(e1)

獲取指定li標簽下所有的文本

e2 = html.xpath('//li[@class="item-0"]//text()')

print(e2)#返回三個結果

獲取p標簽中的文本

e3 = html.xpath('//p/text()')

print(e3)

1.7屬性獲取

獲取指定li標簽下所有a標簽的href屬性值

e4 = html.xpath('//li[@class="item-1"]/a/@href')

print(e4)

1.8屬性多值

當一個標簽有多個屬性值時，怎么查找。

使用contains()函數:

包含任意一個屬性即可匹配

text='''

first item

'''

html = etree.HTML(text)

result = html.xpath('//li[contains(@class,"li")]/a/text()')

print(result)#返回結果是["first item"]

1.9多屬性

有時需要匹配一個標簽的多個屬性時，采用運算符進行連接

text='''

first itemsecond item

'''

html = etree.HTML(text)

result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')

print(result) #返回結果是["first item"]

1.10常見運行算符

查找多個元素標簽

查找p標簽和li標簽

e4 = html.xpath('//p|//li')

1.11按序選擇

有時候選擇某些屬性可能同時匹配了多個節點但是想要其中某個節點

如第一個節點或者最后一個節點

'''排序'''

# 第一個li元素

e7 = html.xpath('//li[1]/a/@href')

print(e7) #['link1.html']

#最后一個li元素

e8 = html.xpath('//li[last()]/a/@href')

print(e8) #['link5.html']

#前兩個li元素

e9 = html.xpath('//li[position()<3]/a/@href')

print(e9) #['link1.html', 'link2.html']

#倒數第三個

e10 = html.xpath('//li[last()-2]/a/@href')

print(e10)

1.12 節點軸選擇

xpath提供了很多節點軸選擇方法包括子元素，兄弟元素，父元素，祖先元素等

'''節點軸選擇'''

from lxml import etree

text='''

11first item
second item
third item
fourth item
fifth item

'''

html = etree.HTML(text)

result = html.xpath('//li[1]/ancestor::*')

print(result)# 獲取第一個li所有祖先節點包括html body div ul

result = html.xpath('//li[1]/ancestor::div')

print(result)#限定條件 div

result = html.xpath('//li[1]/attribute::*')

print(result)#獲取所有屬性值返回li節點所有屬性值

result = html.xpath('//li[1]/child::a[@href="link1.html"]')

print(result)#獲取所有直接子節點限定條件href = link1.html

result = html.xpath('//li[1]/descendant::span')

print(result)# 獲取所有子孫節點限定span節點不包含a節點

result = html.xpath('//li[1]/following::*[2]/text()')

print(result)#獲取當前節點之后的所有節點雖然加了* 但又加了索引選擇只獲取第二個后續節點

result = html.xpath('//li[1]/following-sibling::*')

print(result)#獲取當前節點之后的所有同級節點

實戰爬取百度校花吧

分析：

分析校花吧的url以及網頁結構：

分析可得：我們要爬取的內容以及網頁是分頁顯示的。

總結

以上是生活随笔為你收集整理的java爬虫工具xpath提取,2020-07-16--爬虫数据提取--xpath的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Java实验二猜数字游戏,JAVA-第2
下一篇： java内存四大区,jvm基础-内存区域