當前位置：首頁 > 编程语言 > python >内容正文

python

《python初级爬虫》（一）

發布時間：2023/12/16 python 21 豆豆

生活随笔收集整理的這篇文章主要介紹了《python初级爬虫》（一）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言

python初級爬蟲只需要掌握以下四個技術

find 字符串函數
列表切片list[-x:-y]
文件讀寫操作
循環體while

原理：
網頁上的任何東西都對應著源代碼，所以爬蟲的原理就是對網頁上的源代碼的爬取和訪問兩部分。
第一步：1 先對待爬取東西的代碼截取，對于單篇文章而言

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a>

這是文章對應的代碼部分，我們需要切取出所需的url為
http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
第二步：訪問該url并存盤
導入urllib內建庫
content = urllib.urlopen(url).read()
filename=”xxx”
輸出HTML
with open(filename,”w”) as f:
f.write(content)
或者輸出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)
但是對于首頁的所有文章，則是讀取首頁的所有內容urllib.urlopen(url).read()，并在所讀取的內容中截取文章的url并存盤。

數據源：韓寒博客
http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

第一部分：下載單篇博文存儲本地

第一步：分析html源文件
郵件審查元素可以看到網頁的源文件（chrome快捷鍵F12）,然后在源文件中查找文章名字：寫給那個茶水妹的《乘風破浪》誕生…，在body中使用快捷鍵ctrl+F 查找：

找到其文章名字段落規則為：

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a>

則在此字符串查詢所需要的部分。

第二步：代碼處理以及url的提取
導入urllib內建庫
content = urllib.urlopen(url).read()
filename=”xxx”
輸出HTML
with open(filename,”w”) as f:
f.write(content)
或者輸出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)

代碼實現

import urllib# 使用轉義符 str0 ="<a title=\"\" target=\"_blank\" href=\"http://blog.sina.com.cn/s/blog_4701280b0102wrup.html\">寫給那個茶水妹的《乘風破浪》誕生…</a>" # 其規則是 href="連接" ">題目</a>"# 截取題目索引 title_1 = str0.find(r">") title_2 = str0.find(r"</a>") title = str0[title_1+1:title_2] print title # 截取http連接索引 href = str0.find(r"href=") html = str0.find(r".html") # 截取 url url = str0[href+6:html+5]# 讀出的是html碼,type是str content = urllib.urlopen(url).read()m = url.find("blog_") filename = url[m:] filename_1 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/" + filename# 寫成html with open(filename_1 ,"w+") as f:f.write(content) # 寫成txt文件 with open(filename_1 + ".txt","w+") as f:f.write(content) # 存成題目的txt文件 # 因為編譯器是utf-8編碼，則需要unicode編譯 with open(unicode(filename_0 + title + ".txt","utf-8"),"w+") as f:f.write(content)

輸出結果

第二部分:爬取首頁的全部文章存入本地

同爬取單篇文章相類似，爬取首頁的所有文章則需要讀取首頁的全部源代碼，并對文章url進行截取，分析文章url

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a> <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《論電影的七個元素》——關于我對電…</a>

則需要在整體所讀內容中找到第一篇文章的規律。特殊字段為

# -*- coding: utf-8 -*- # auther : santi # function : # time :import urllib import time # 直接讀取首頁的全部內容 str0 = "http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html" con = urllib.urlopen(str0).read()# 將con打印出來觀察其文章題目規則 # <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102elmo.html">2013年09月27日</a> # <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wruo.html">寫給那個茶水妹的《乘風破浪》誕生…</a>with open(r"F:\python\PyCharmWorkpalce\Crawler\pacong_data\context.txt","w") as f:f.write(con)url_all = [""] * 60 url_name = [""] * 60 index = con.find("<a title=\"\" target=\"_blank") href = con.find("href=\"http:",index) html = con.find(".html\">",href) title = con.find("</a>",html)i = 0 # find函數找不到會返回-1，則說明全部爬取，直接跳出while循環。 while index != -1 and href != -1 and html != -1 and title != -1 or i < 50 :url_all[i] = con[href+6:html+5]url_name[i] = con[html+7:title]print "finding... " + url_all[i]index = con.find("<a title=\"\" target=\"_blank",title)href = con.find("href=\"http:",index)html = con.find(".html\">",href)title = con.find("</a>",html)i += 1else:print "Find End!"# 本地存儲 # http://blog.sina.com.cn/s/blog_4701280b0102wrup.htmlm_0 = url_all[0].find("blog_") m_1 = url_all[0].find(".html")+5 filename_0 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/"j = 0 while j < i:filename_1 = url_all[j][m_0:m_1]content = urllib.urlopen(url_all[j]).read()print "downloading.... " + filename_1with open(filename_0 + filename_1,"w+") as f:f.write(content)with open(filename_0 + filename_1 + ".txt","w+") as f:f.write(content)with open(unicode(filename_0 + url_name[j] + ".txt","utf-8"),"w+") as f:f.write(content)time.sleep(15)j += 1print "Download article finished! "

輸出結果：

finding... http://blog.sina.com.cn/s/blog_4701280b0102wrup.html finding... http://blog.sina.com.cn/s/blog_4701280b0102wruo.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eohi.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eo83.html finding... http://blog.sina.com.cn/s/blog_4701280b0102elmo.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eksm.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ek51.html finding... http://blog.sina.com.cn/s/blog_4701280b0102egl0.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ef4t.html finding... http://blog.sina.com.cn/s/blog_4701280b0102edcd.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ecxd.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eck1.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ec39.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eb8d.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eb6w.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eau0.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e85j.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7wj.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7vx.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7er.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e63p.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e5np.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4qq.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4gf.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4c3.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e490.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e42a.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e3v6.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e3nr.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e150.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e11n.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0th.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0p3.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0l4.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0ib.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0hj.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0fm.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0eu.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0ak.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e07s.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e074.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e06b.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e061.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e02q.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz9f.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz84.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz5s.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dyao.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dxmp.html Find End! downloading.... blog_4701280b0102wrup.html downloading.... blog_4701280b0102wruo.html downloading.... blog_4701280b0102eohi.html downloading.... blog_4701280b0102eo83.html downloading.... blog_4701280b0102elmo.html downloading.... blog_4701280b0102eksm.html downloading.... blog_4701280b0102ek51.html downloading.... blog_4701280b0102egl0.html downloading.... blog_4701280b0102ef4t.html downloading.... blog_4701280b0102edcd.html downloading.... blog_4701280b0102ecxd.html downloading.... blog_4701280b0102eck1.html downloading.... blog_4701280b0102ec39.html downloading.... blog_4701280b0102eb8d.html downloading.... blog_4701280b0102eb6w.html downloading.... blog_4701280b0102eau0.html downloading.... blog_4701280b0102e85j.html downloading.... blog_4701280b0102e7wj.html downloading.... blog_4701280b0102e7vx.html downloading.... blog_4701280b0102e7pk.html downloading.... blog_4701280b0102e7er.html downloading.... blog_4701280b0102e63p.html downloading.... blog_4701280b0102e5np.html downloading.... blog_4701280b0102e4qq.html downloading.... blog_4701280b0102e4gf.html downloading.... blog_4701280b0102e4c3.html downloading.... blog_4701280b0102e490.html downloading.... blog_4701280b0102e42a.html downloading.... blog_4701280b0102e3v6.html downloading.... blog_4701280b0102e3nr.html downloading.... blog_4701280b0102e150.html downloading.... blog_4701280b0102e11n.html downloading.... blog_4701280b0102e0th.html downloading.... blog_4701280b0102e0p3.html downloading.... blog_4701280b0102e0l4.html downloading.... blog_4701280b0102e0ib.html downloading.... blog_4701280b0102e0hj.html downloading.... blog_4701280b0102e0fm.html downloading.... blog_4701280b0102e0eu.html downloading.... blog_4701280b0102e0ak.html downloading.... blog_4701280b0102e07s.html downloading.... blog_4701280b0102e074.html downloading.... blog_4701280b0102e06b.html downloading.... blog_4701280b0102e061.html downloading.... blog_4701280b0102e02q.html downloading.... blog_4701280b0102dz9f.html downloading.... blog_4701280b0102dz84.html downloading.... blog_4701280b0102dz5s.html downloading.... blog_4701280b0102dyao.html downloading.... blog_4701280b0102dxmp.html Download article finished!

下載示意圖：

總結

以上是生活随笔為你收集整理的《python初级爬虫》（一）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： C# 画个实心圆
下一篇： Python练习_数据类型_day4

python

《python初级爬虫》（一）

前言

第一部分 ：下載單篇博文存儲本地

第二部分:爬取首頁的全部文章存入本地

總結

第一部分：下載單篇博文存儲本地