當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

基于bs4库的HTML内容查找方法

發布時間：2023/12/18 HTML 31 豆豆

生活随笔收集整理的這篇文章主要介紹了基于bs4库的HTML内容查找方法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、信息提取實例

提取HTML中所有的URL鏈接

思路：1）搜索到所有的<a>標簽

2）解析<a>標簽格式，提取href后的鏈接內容

>>> import requests
>>> r= requests.get("https://python123.io/ws/demo.html")
>>> demo=r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\nThe demo python introduces several python courses.\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.\r\n</body></html>'
>>> from bs4 import BeautifulSoup

soup=BeautifulSoup(demo,'html.parser')

>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>


The demo python introduces several python courses.



Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.

</body>
</html>

>>> for link in soup.find_all('a'):
... print(link.get("href"))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

二、基于bs4庫的HTML內容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)可以在soup的變量中去查找里面的信息

返回一個列表類型，存儲查找的結果

1、name:對標簽名稱的檢索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[The demo python introduces several python courses., <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):　　#如果給出的標簽名稱是True，將顯示當前soup的所有標簽信息
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>> import re

>>> for tag in soup.find_all(re.compile('b')):　　#正則表達式庫所反饋的結果是指以b開頭的所有的信息作為查找的要素
... print(tag.name)
...
body
b

2、attrs：對標簽屬性值的檢索字符串，可標注屬性檢索

>>> soup.find_all('p','course')
[Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.]

>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

3、recursive：是否對子孫全部檢索，默認True

說明從soup根節點開始，他的兒子節點層面上是沒有a標簽的，a標簽應該在子孫的后續節點

4、string：<>...</>中字符串區域的檢索字符串

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
</body></html>
>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>> import re
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>

<tag>(..) 等價于 <tag>.find_all(..)

soup(..)等價于soup.find_all(..)

七個擴展方法

<>.find()

<>.find_parents()

<>.find_parent()

<>.find_next_siblings()

<>.find_next_sibling()

<>.find_previous_siblings()

<>.find_previous_sibling()

轉載于:https://www.cnblogs.com/suitcases/p/11232139.html

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的基于bs4库的HTML内容查找方法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：为Java应用程序加上退出事件处理（Sh
下一篇：用NSoup解析HTML