當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫beautifulsoup_python爬虫beautifulsoup解析html方法

發布時間：2025/3/15 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫beautifulsoup_python爬虫beautifulsoup解析html方法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

用BeautifulSoup 解析html和xml字符串

實例：

#!/usr/bin/python

# -*- coding: UTF-8 -*-

from bs4 import BeautifulSoup

import re

#待分析字符串

html_doc = """

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie

and

Tillie;

and they lived at the bottom of a well.

...

"""

# html字符串創建BeautifulSoup對象

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

#輸出第一個 title 標簽

print soup.title

#輸出第一個 title 標簽的標簽名稱

print soup.title.name

#輸出第一個 title 標簽的包含內容

print soup.title.string

#輸出第一個 title 標簽的父標簽的標簽名稱

print soup.title.parent.name

#輸出第一個 p 標簽

print soup.p

#輸出第一個 p 標簽的 class 屬性內容

print soup.p['class']

#輸出第一個 a 標簽的 href 屬性內容

print soup.a['href']

'''

soup的屬性可以被添加,刪除或修改. 再說一次, soup的屬性操作方法與字典一樣

'''

#修改第一個 a 標簽的href屬性為 http://www.baidu.com/

soup.a['href'] = 'http://www.baidu.com/'

#給第一個 a 標簽添加 name 屬性

soup.a['name'] = u'百度'

#刪除第一個 a 標簽的 class 屬性為

del soup.a['class']

##輸出第一個 p 標簽的所有子節點

print soup.p.contents

#輸出第一個 a 標簽

print soup.a

#輸出所有的 a 標簽，以列表形式顯示

print soup.find_all('a')

#輸出第一個 id 屬性等于 link3 的 a 標簽

print soup.find(id="link3")

#獲取所有文字內容

print(soup.get_text())

#輸出第一個 a 標簽的所有屬性信息

print soup.a.attrs

for link in soup.find_all('a'):

#獲取 link 的 href 屬性內容

print(link.get('href'))

#對soup.p的子節點進行循環輸出

for child in soup.p.children:

print(child)

#正則匹配，名字中帶有b的標簽

for tag in soup.find_all(re.compile("b")):

print(tag.name)

爬蟲設計思路：

詳細手冊：

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

到此這篇關于python爬蟲beautifulsoup解析html方法的文章就介紹到這了,更多相關beautifulsoup解析html內容請搜索以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持！

總結

以上是生活随笔為你收集整理的python爬虫beautifulsoup_python爬虫beautifulsoup解析html方法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： mysql5.0操作手册_MySQL 操
下一篇： java基本数据类型的标识符_java基