日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

《python初级爬虫》(一)

發布時間:2023/12/16 python 21 豆豆
生活随笔 收集整理的這篇文章主要介紹了 《python初级爬虫》(一) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言

python初級爬蟲只需要掌握以下四個技術

  • find 字符串函數
  • 列表切片list[-x:-y]
  • 文件讀寫操作
  • 循環體while

原理:
網頁上的任何東西都對應著源代碼, 所以爬蟲的原理就是對網頁上的源代碼的爬取和訪問兩部分。
第一步:1 先對待爬取東西的代碼截取,對于單篇文章而言

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a>

這是文章對應的代碼部分,我們需要切取出所需的url為
http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
第二步:訪問該url并存盤
導入urllib內建庫
content = urllib.urlopen(url).read()
filename=”xxx”
輸出HTML
with open(filename,”w”) as f:
f.write(content)
或者輸出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)
但是對于首頁的所有文章,則是讀取首頁的所有內容urllib.urlopen(url).read(),并在所讀取的內容中截取文章的url并存盤。

數據源:韓寒博客
http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

第一部分 :下載單篇博文存儲本地

第一步:分析html源文件
郵件審查元素可以看到網頁的源文件(chrome快捷鍵F12),然后在源文件中查找文章名字:寫給那個茶水妹的《乘風破浪》誕生…,在body中使用快捷鍵ctrl+F 查找:

找到其文章名字段落規則為:

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a>

則在此字符串查詢所需要的部分。

第二步:代碼處理以及url的提取
導入urllib內建庫
content = urllib.urlopen(url).read()
filename=”xxx”
輸出HTML
with open(filename,”w”) as f:
f.write(content)
或者輸出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)

代碼實現

import urllib# 使用轉義符 str0 ="<a title=\"\" target=\"_blank\" href=\"http://blog.sina.com.cn/s/blog_4701280b0102wrup.html\">寫給那個茶水妹的《乘風破浪》誕生…</a>" # 其規則是 href="連接" ">題目</a>"# 截取題目索引 title_1 = str0.find(r">") title_2 = str0.find(r"</a>") title = str0[title_1+1:title_2] print title # 截取http連接索引 href = str0.find(r"href=") html = str0.find(r".html") # 截取 url url = str0[href+6:html+5]# 讀出的是html碼,type是str content = urllib.urlopen(url).read()m = url.find("blog_") filename = url[m:] filename_1 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/" + filename# 寫成html with open(filename_1 ,"w+") as f:f.write(content) # 寫成txt文件 with open(filename_1 + ".txt","w+") as f:f.write(content) # 存成題目的txt文件 # 因為編譯器是utf-8編碼,則需要unicode編譯 with open(unicode(filename_0 + title + ".txt","utf-8"),"w+") as f:f.write(content)

輸出結果

第二部分:爬取首頁的全部文章存入本地

同爬取單篇文章相類似,爬取首頁的所有文章則需要讀取首頁的全部源代碼,并對文章url進行截取,分析文章url

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">寫給那個茶水妹的《乘風破浪》誕生…</a></span> <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《論電影的七個元素》——關于我對電…</a></span>

則需要在整體所讀內容中找到第一篇文章的規律。特殊字段為

# -*- coding: utf-8 -*- # auther : santi # function : # time :import urllib import time # 直接讀取首頁的全部內容 str0 = "http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html" con = urllib.urlopen(str0).read()# 將con打印出來觀察其文章題目規則 # <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102elmo.html">2013年09月27日</a></span> # <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wruo.html">寫給那個茶水妹的《乘風破浪》誕生…</a></span>with open(r"F:\python\PyCharmWorkpalce\Crawler\pacong_data\context.txt","w") as f:f.write(con)url_all = [""] * 60 url_name = [""] * 60 index = con.find("<a title=\"\" target=\"_blank") href = con.find("href=\"http:",index) html = con.find(".html\">",href) title = con.find("</a></span>",html)i = 0 # find函數找不到會返回-1,則說明全部爬取,直接跳出while循環。 while index != -1 and href != -1 and html != -1 and title != -1 or i < 50 :url_all[i] = con[href+6:html+5]url_name[i] = con[html+7:title]print "finding... " + url_all[i]index = con.find("<a title=\"\" target=\"_blank",title)href = con.find("href=\"http:",index)html = con.find(".html\">",href)title = con.find("</a></span>",html)i += 1else:print "Find End!"# 本地存儲 # http://blog.sina.com.cn/s/blog_4701280b0102wrup.htmlm_0 = url_all[0].find("blog_") m_1 = url_all[0].find(".html")+5 filename_0 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/"j = 0 while j < i:filename_1 = url_all[j][m_0:m_1]content = urllib.urlopen(url_all[j]).read()print "downloading.... " + filename_1with open(filename_0 + filename_1,"w+") as f:f.write(content)with open(filename_0 + filename_1 + ".txt","w+") as f:f.write(content)with open(unicode(filename_0 + url_name[j] + ".txt","utf-8"),"w+") as f:f.write(content)time.sleep(15)j += 1print "Download article finished! "

輸出結果:

finding... http://blog.sina.com.cn/s/blog_4701280b0102wrup.html finding... http://blog.sina.com.cn/s/blog_4701280b0102wruo.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eohi.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eo83.html finding... http://blog.sina.com.cn/s/blog_4701280b0102elmo.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eksm.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ek51.html finding... http://blog.sina.com.cn/s/blog_4701280b0102egl0.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ef4t.html finding... http://blog.sina.com.cn/s/blog_4701280b0102edcd.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ecxd.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eck1.html finding... http://blog.sina.com.cn/s/blog_4701280b0102ec39.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eb8d.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eb6w.html finding... http://blog.sina.com.cn/s/blog_4701280b0102eau0.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e85j.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7wj.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7vx.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e7er.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e63p.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e5np.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4qq.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4gf.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e4c3.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e490.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e42a.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e3v6.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e3nr.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e150.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e11n.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0th.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0p3.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0l4.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0ib.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0hj.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0fm.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0eu.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e0ak.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e07s.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e074.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e06b.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e061.html finding... http://blog.sina.com.cn/s/blog_4701280b0102e02q.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz9f.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz84.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dz5s.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dyao.html finding... http://blog.sina.com.cn/s/blog_4701280b0102dxmp.html Find End! downloading.... blog_4701280b0102wrup.html downloading.... blog_4701280b0102wruo.html downloading.... blog_4701280b0102eohi.html downloading.... blog_4701280b0102eo83.html downloading.... blog_4701280b0102elmo.html downloading.... blog_4701280b0102eksm.html downloading.... blog_4701280b0102ek51.html downloading.... blog_4701280b0102egl0.html downloading.... blog_4701280b0102ef4t.html downloading.... blog_4701280b0102edcd.html downloading.... blog_4701280b0102ecxd.html downloading.... blog_4701280b0102eck1.html downloading.... blog_4701280b0102ec39.html downloading.... blog_4701280b0102eb8d.html downloading.... blog_4701280b0102eb6w.html downloading.... blog_4701280b0102eau0.html downloading.... blog_4701280b0102e85j.html downloading.... blog_4701280b0102e7wj.html downloading.... blog_4701280b0102e7vx.html downloading.... blog_4701280b0102e7pk.html downloading.... blog_4701280b0102e7er.html downloading.... blog_4701280b0102e63p.html downloading.... blog_4701280b0102e5np.html downloading.... blog_4701280b0102e4qq.html downloading.... blog_4701280b0102e4gf.html downloading.... blog_4701280b0102e4c3.html downloading.... blog_4701280b0102e490.html downloading.... blog_4701280b0102e42a.html downloading.... blog_4701280b0102e3v6.html downloading.... blog_4701280b0102e3nr.html downloading.... blog_4701280b0102e150.html downloading.... blog_4701280b0102e11n.html downloading.... blog_4701280b0102e0th.html downloading.... blog_4701280b0102e0p3.html downloading.... blog_4701280b0102e0l4.html downloading.... blog_4701280b0102e0ib.html downloading.... blog_4701280b0102e0hj.html downloading.... blog_4701280b0102e0fm.html downloading.... blog_4701280b0102e0eu.html downloading.... blog_4701280b0102e0ak.html downloading.... blog_4701280b0102e07s.html downloading.... blog_4701280b0102e074.html downloading.... blog_4701280b0102e06b.html downloading.... blog_4701280b0102e061.html downloading.... blog_4701280b0102e02q.html downloading.... blog_4701280b0102dz9f.html downloading.... blog_4701280b0102dz84.html downloading.... blog_4701280b0102dz5s.html downloading.... blog_4701280b0102dyao.html downloading.... blog_4701280b0102dxmp.html Download article finished!

下載示意圖:

總結

以上是生活随笔為你收集整理的《python初级爬虫》(一)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。