當前位置：首頁 > 编程语言 > python >内容正文

python

Python-crawler-citeulike

發布時間：2024/4/13 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python-crawler-citeulike 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

之前裝過beautifulsoup,這次要裝lxml，用easy_install裝：到python/scripts目錄下，運行easy_install lxml，自動安裝

-----------分界線--------------

之前直接用urlopen(url)，拒絕訪問，403forbidden

模仿真實上網，添加cookie （轉自http://www.yihaomen.com/article/python/210.htm）

import re import random import socket import urllib2 import cookielib from bs4 import BeautifulSoup import lxmlERROR = {'0':'Can not open the url,checck you net','1':'Creat download dir error','2':'The image links is empty','3':'Download faild','4':'Build soup error,the html is empty','5':'Can not save the image to your disk',}class BrowserBase(object): def __init__(self):socket.setdefaulttimeout(20)def speak(self,name,content):print '[%s]%s' %(name,content)def openurl(self,url):"""打開網頁"""cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())self.opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)urllib2.install_opener(self.opener)user_agents = ['Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11','Opera/9.25 (Windows NT 5.1; U; en)','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)','Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12','Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7","Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",] agent = random.choice(user_agents)self.opener.addheaders = [("User-agent",agent),("Accept","*/*"),('Referer','http://www.google.com')]try:res = self.opener.open(url)# print res.read()except Exception,e:self.speak(str(e)+url)raise Exceptionelse:return res

----------------分界線-------------------

用beautifulsoup解析html文件（教程參考：http://beautifulsoup.readthedocs.org/zh_CN/latest/#）

soup = BeautifulSoup(res, "lxml) 生成beautifulsoup對象，是一棵由html里的tag作節點的樹對象。

soup = BeautifulSoup(res,"lxml") tag = soup.find(id ="showtexform") #body.form( id ="showtexform") return tag.contents[1].contents[1]['value']

beautifulsoup的搜索方法：find()，find_all()：

? ?1. 字符串：查找與字符串完整匹配的內容，soup.find_all('b')；找b標簽

? ?2.?正則表達式：通過正則表達式的?match()?來匹配，soup.find_all(re.compile('^b'))；找b打頭的標簽

? ?3. 列表

? ?......

tag的屬性的操作方法與字典相同: tag['value']

tag的?.contents?屬性可以將tag的子節點以列表的方式輸出

轉載于:https://www.cnblogs.com/yuchenkit/p/5369763.html

總結

以上是生活随笔為你收集整理的Python-crawler-citeulike的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [bmgr]android应用数据备份以
下一篇： linux中redis的主从