《python初级爬虫》(一)
前言
python初級爬蟲只需要掌握以下四個技術
- find 字符串函數
- 列表切片list[-x:-y]
- 文件讀寫操作
- 循環體while
原理:
網頁上的任何東西都對應著源代碼, 所以爬蟲的原理就是對網頁上的源代碼的爬取和訪問兩部分。
第一步:1 先對待爬取東西的代碼截取,對于單篇文章而言
這是文章對應的代碼部分,我們需要切取出所需的url為
http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
第二步:訪問該url并存盤
導入urllib內建庫
content = urllib.urlopen(url).read()
filename=”xxx”
輸出HTML
with open(filename,”w”) as f:
f.write(content)
或者輸出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)
但是對于首頁的所有文章,則是讀取首頁的所有內容urllib.urlopen(url).read(),并在所讀取的內容中截取文章的url并存盤。
數據源:韓寒博客
http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html
第一部分 :下載單篇博文存儲本地
第一步:分析html源文件
郵件審查元素可以看到網頁的源文件(chrome快捷鍵F12),然后在源文件中查找文章名字:寫給那個茶水妹的《乘風破浪》誕生…,在body中使用快捷鍵ctrl+F 查找:
找到其文章名字段落規則為:
則在此字符串查詢所需要的部分。
第二步:代碼處理以及url的提取
導入urllib內建庫
content = urllib.urlopen(url).read()
filename=”xxx”
輸出HTML
with open(filename,”w”) as f:
f.write(content)
或者輸出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)
代碼實現
import urllib# 使用轉義符 str0 ="<a title=\"\" target=\"_blank\" href=\"http://blog.sina.com.cn/s/blog_4701280b0102wrup.html\">寫給那個茶水妹的《乘風破浪》誕生…</a>" # 其規則是 href="連接" ">題目</a>"# 截取題目索引 title_1 = str0.find(r">") title_2 = str0.find(r"</a>") title = str0[title_1+1:title_2] print title # 截取http連接索引 href = str0.find(r"href=") html = str0.find(r".html") # 截取 url url = str0[href+6:html+5]# 讀出的是html碼,type是str content = urllib.urlopen(url).read()m = url.find("blog_") filename = url[m:] filename_1 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/" + filename# 寫成html with open(filename_1 ,"w+") as f:f.write(content) # 寫成txt文件 with open(filename_1 + ".txt","w+") as f:f.write(content) # 存成題目的txt文件 # 因為編譯器是utf-8編碼,則需要unicode編譯 with open(unicode(filename_0 + title + ".txt","utf-8"),"w+") as f:f.write(content)輸出結果
第二部分:爬取首頁的全部文章存入本地
同爬取單篇文章相類似,爬取首頁的所有文章則需要讀取首頁的全部源代碼,并對文章url進行截取,分析文章url
<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a></span> <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《論電影的七個元素》——關于我對電…</a></span>則需要在整體所讀內容中找到第一篇文章的規律。特殊字段為
# -*- coding: utf-8 -*- # auther : santi # function : # time :import urllib import time # 直接讀取首頁的全部內容 str0 = "http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html" con = urllib.urlopen(str0).read()# 將con打印出來觀察其文章題目規則 # <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102elmo.html">2013年09月27日</a></span> # <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wruo.html">寫給那個茶水妹的《乘風破浪》誕生…</a></span>with open(r"F:\python\PyCharmWorkpalce\Crawler\pacong_data\context.txt","w") as f:f.write(con)url_all = [""] * 60 url_name = [""] * 60 index = con.find("<a title=\"\" target=\"_blank") href = con.find("href=\"http:",index) html = con.find(".html\">",href) title = con.find("</a></span>",html)i = 0 # find函數找不到會返回-1,則說明全部爬取,直接跳出while循環。 while index != -1 and href != -1 and html != -1 and title != -1 or i < 50 :url_all[i] = con[href+6:html+5]url_name[i] = con[html+7:title]print "finding... " + url_all[i]index = con.find("<a title=\"\" target=\"_blank",title)href = con.find("href=\"http:",index)html = con.find(".html\">",href)title = con.find("</a></span>",html)i += 1else:print "Find End!"# 本地存儲 # http://blog.sina.com.cn/s/blog_4701280b0102wrup.htmlm_0 = url_all[0].find("blog_") m_1 = url_all[0].find(".html")+5 filename_0 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/"j = 0 while j < i:filename_1 = url_all[j][m_0:m_1]content = urllib.urlopen(url_all[j]).read()print "downloading.... " + filename_1with open(filename_0 + filename_1,"w+") as f:f.write(content)with open(filename_0 + filename_1 + ".txt","w+") as f:f.write(content)with open(unicode(filename_0 + url_name[j] + ".txt","utf-8"),"w+") as f:f.write(content)time.sleep(15)j += 1print "Download article finished! "輸出結果:
finding... http://blog.sina.com.cn/s/blog_4701280b0102wrup.html finding... http://blog.sina.com.cn/s/blog_4701280b0102wruo.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eohi.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eo83.html finding... http://blog.sina.com.cn/s/blog_4701280b0102elmo.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eksm.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ek51.html finding... http://blog.sina.com.cn/s/blog_4701280b0102egl0.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ef4t.html finding... http://blog.sina.com.cn/s/blog_4701280b0102edcd.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ecxd.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eck1.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ec39.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eb8d.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eb6w.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eau0.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e85j.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7wj.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7vx.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7er.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e63p.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e5np.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4qq.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4gf.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4c3.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e490.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e42a.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e3v6.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e3nr.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e150.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e11n.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0th.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0p3.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0l4.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0ib.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0hj.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0fm.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0eu.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0ak.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e07s.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e074.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e06b.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e061.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e02q.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz9f.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz84.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz5s.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dyao.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dxmp.html Find End! downloading.... blog_4701280b0102wrup.html downloading.... blog_4701280b0102wruo.html downloading.... blog_4701280b0102eohi.html downloading.... blog_4701280b0102eo83.html downloading.... blog_4701280b0102elmo.html downloading.... blog_4701280b0102eksm.html downloading.... blog_4701280b0102ek51.html downloading.... blog_4701280b0102egl0.html downloading.... blog_4701280b0102ef4t.html downloading.... blog_4701280b0102edcd.html downloading.... blog_4701280b0102ecxd.html downloading.... blog_4701280b0102eck1.html downloading.... blog_4701280b0102ec39.html downloading.... blog_4701280b0102eb8d.html downloading.... blog_4701280b0102eb6w.html downloading.... blog_4701280b0102eau0.html downloading.... blog_4701280b0102e85j.html downloading.... blog_4701280b0102e7wj.html downloading.... blog_4701280b0102e7vx.html downloading.... blog_4701280b0102e7pk.html downloading.... blog_4701280b0102e7er.html downloading.... blog_4701280b0102e63p.html downloading.... blog_4701280b0102e5np.html downloading.... blog_4701280b0102e4qq.html downloading.... blog_4701280b0102e4gf.html downloading.... blog_4701280b0102e4c3.html downloading.... blog_4701280b0102e490.html downloading.... blog_4701280b0102e42a.html downloading.... blog_4701280b0102e3v6.html downloading.... blog_4701280b0102e3nr.html downloading.... blog_4701280b0102e150.html downloading.... blog_4701280b0102e11n.html downloading.... blog_4701280b0102e0th.html downloading.... blog_4701280b0102e0p3.html downloading.... blog_4701280b0102e0l4.html downloading.... blog_4701280b0102e0ib.html downloading.... blog_4701280b0102e0hj.html downloading.... blog_4701280b0102e0fm.html downloading.... blog_4701280b0102e0eu.html downloading.... blog_4701280b0102e0ak.html downloading.... blog_4701280b0102e07s.html downloading.... blog_4701280b0102e074.html downloading.... blog_4701280b0102e06b.html downloading.... blog_4701280b0102e061.html downloading.... blog_4701280b0102e02q.html downloading.... blog_4701280b0102dz9f.html downloading.... blog_4701280b0102dz84.html downloading.... blog_4701280b0102dz5s.html downloading.... blog_4701280b0102dyao.html downloading.... blog_4701280b0102dxmp.html Download article finished!下載示意圖:
總結
以上是生活随笔為你收集整理的《python初级爬虫》(一)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: C# 画个实心圆
- 下一篇: Python练习_数据类型_day4