當前位置：首頁 > 编程语言 > python >内容正文

python

jsoup 获取html中body内容_python爬虫之下载盗墓笔记（bs4解析HTML）

發布時間：2024/9/19 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 jsoup 获取html中body内容_python爬虫之下载盗墓笔记（bs4解析HTML）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言：

最近一個作業用到爬蟲，我爬取的網站是拉勾網，返回的是json格式，我就用字典的形式獲取數據了

這次順便把bs4解析返回的HTML格式也熟悉一下

爬了一個簡單的網站：http://www.seputu.com

學習了下https://www.cnblogs.com/insane-Mr-Li/p/9117005.html的內容，自己動手開始搞了，基本原理差不多

記下主要用法：

通過檢查元素可以看到每一節的鏈接和名字都在<li></li>里存著了

所以第一步通過bs4找到這些<li></li>

import requests from bs4 import BeautifulSoup url='http://www.seputu.com' response = requests.get(url) req_parser = BeautifulSoup(response.text,features="html.parser")#<class 'bs4.BeautifulSoup'> li = req_parser.find_all('li')#<class 'bs4.element.ResultSet'> #li = req_parser.findAll('li')#等價上一句

接下來獲取鏈接和名字，獲取有兩種方法，大同小異：

1.用find方法，li的類型是<class 'bs4.element.ResultSet'>，i的類型是<class 'bs4.element.Tag'>，沒有find_all方法

name_list=[] href_list=[] for i in li:try:href=i.find('a')['href']name=i.find('a').textname_list.append(name)href_list.append(href)except:pass

2.轉化 li類型為<class 'bs4.BeautifulSoup'>，繼續使用find_all方法在li結果里搜索

temp = BeautifulSoup(str(li),features="html.parser")#進行進一步的字符解析因為獲取要素類型的值時必須進行這一步 a = temp.find_all('a') name_list=[] href_list=[] for i in a:name=i.stringhref=i['href']name_list.append(name)href_list.append(href)

此處獲取<a></a>之間的內容是通過屬性text或者string獲取

還可以通過findChildren方法獲取

i.find('a').findChildren(text=True)[0]

有了名字和鏈接，接下來就是從鏈接里找文字了：

同樣通過檢查文字元素所在位置發現小說文字都是在<div class="content-body">的<p></p>中

response=requests.get(href_list[page]) req_parser= BeautifulSoup(response.content.decode('utf-8'),features="html.parser") div= req_parser.find_all('div',class_="content-body") #div= req_parser.find_all('div',{"class":"content-body")#等價上一句

后面再從div里找p，跟前面的道理是一樣的，就不贅述了。

完整代碼：

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup url='http://www.seputu.com' response = requests.get(url) req_parser = BeautifulSoup(response.content.decode('utf-8'),features="html.parser") li = req_parser.find_all('li') temp = BeautifulSoup(str(li),features="html.parser")#進行進一步的字符解析因為獲取要素類型的值時必須進行這一步 a = temp.find_all('a') name_list=[] href_list=[] for i in a:name=i.stringhref=i['href']name_list.append(name)href_list.append(href) def download(page):response=requests.get(href_list[page])req_parser= BeautifulSoup(response.content.decode('utf-8'),features="html.parser")div= req_parser.find_all('div',class_="content-body")temp = BeautifulSoup(str(div),features="html.parser")temp=temp.find_all('p')text = []for i in temp:temp=i.stringif temp!=None:print(temp.encode('gbk','ignore').decode('gbk','ignore'))text.append(temp)with open('novel.txt','a+',encoding='utf-8') as f:f.write(name_list[page])f.write('n')for i in text:f.write(i)f.write('n')for i in range(len(href_list)):try:download(i)except:passprint('%d is over'%i)

最后爬下來的txt文件有9000多行

python爬蟲之下載盜墓筆記（bs4解析HTML）_fff_zrx的博客-CSDN博客?blog.csdn.net

總結

以上是生活随笔為你收集整理的jsoup 获取html中body内容_python爬虫之下载盗墓笔记（bs4解析HTML）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：熬夜之后怎么补救
下一篇： python登录代码思路_终于找到一个思