當前位置：首頁 > 编程语言 > python >内容正文

python

python抓取QQ空间博客文章

發布時間：2023/12/16 python 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 python抓取QQ空间博客文章小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

? ? ? ?作者：華亮

? ? ? ??轉載請說明出處：http://blog.csdn.net/cedricporter

外面掛著臺風，下午把人人相冊的爬蟲寫了，晚上偶無聊又把QQ空間的博客的爬蟲寫了，默認只抓取提供的Q號的空間，可以在main.py里面填上Q號，也可以加個循環弄很多個Q號....博客里面的圖片就木有理它了，要下載回來也很簡單。有空再完善了。

# -*-coding:utf-8-*- # Filename: main.py # 作者：華亮 #from QQ import QQif __name__ == '__main__':# 第一個參數為QQ號，第二個為保存文件名QQ.DownloadBlog('414112390', 'blog.txt')

# -*-coding:utf-8-*- # Filename: QQ.py # 作者：華亮 #import urllib import urllib2 import re from HTMLParser import HTMLParser# 獲取QQ空間博客列表 class QQBlogList(HTMLParser):in_key_div = Falsein_ul = Falsein_li = Falsein_a = FalseblogList = []lasturl = ''def handle_starttag(self, tag, attrs):attrs = dict(attrs)if tag == 'div' and 'class' in attrs and attrs['class'] == 'bloglist':self.in_key_div = Trueelif self.in_key_div:if tag == 'ul':self.in_ul = Trueelif self.in_ul and tag == 'li':self.in_li = Trueelif self.in_li and tag == 'a' and 'href' in attrs:self.in_a = Trueself.lasturl = attrs['href']def handle_data(self, data):if self.in_a:self.blogList.append((data, self.lasturl))def handle_endtag(self, tag):if self.in_key_div and tag == 'div':self.in_key_div = Falseelif self.in_ul and tag == 'ul':self.in_ul = Falseelif self.in_li and tag == 'li':self.in_li = Falseelif self.in_a and tag == 'a':self.in_a = Falseclass QQ: '''QQ作者：華亮說明：自動下載QQ空間博客文章'''@staticmethod def DownloadBlog(qq, filename = None):print 'Start'blogurl = 'http://qz.qq.com/%s/bloglist?page=0' % qqQQ.__Download(blogurl, filename) print 'End'@staticmethoddef __Download(starturl, filename):url = starturlcookieFile = urllib2.HTTPCookieProcessor()opener = urllib2.build_opener(cookieFile) # 獲取所有頁的文章路徑while True:req = urllib2.Request(url)result = opener.open(req) text = result.read() qq = QQBlogList() qq.feed(text)qq.close() nextpagePattern = re.compile(r'<a href="(.*?)" title="下一頁" class="bt_next"><span>下一頁</span></a>') nextpage = nextpagePattern.search(text)if nextpage:url = nextpage.group(1) else:break if not filename:filename = "blog.txt"file = open(filename, 'w') # 下載文章blogContentPattern = re.compile(r'<div class="entry_content">(.*?)</div>', re.S) for title, url in qq.blogList:print 'Downloading', titlereq = urllib2.Request(url)result = opener.open(req)file.write('\n' + title + '\n')ret = blogContentPattern.search( result.read() )if ret:file.write(ret.group(1).replace('<p>', '\n'))file.close()

總結

以上是生活随笔為你收集整理的python抓取QQ空间博客文章的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：益盟操盘手编译的指标破解
下一篇： python桌面程序臃肿_危险的转变：P