當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取微博热搜显示到折线图_Python爬取新浪微博热搜榜-Go语言中文社区

發(fā)布時(shí)間：2024/1/8 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取微博热搜显示到折线图_Python爬取新浪微博热搜榜-Go语言中文社区小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

我們?nèi)绾闻廊∵@50條熱搜呢？今天寫一個(gè)簡(jiǎn)單的方法供感興趣的朋友們參考！

引用庫：

requests

json

lxml.etree

bs4.BeautifulSoup引用方法如下：

如果沒有下載的需要自行下載，下載根據(jù)python版本而異，方法就不贅述了。

獲取網(wǎng)頁源碼：headers={

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

}

data = {

'cate':'realtimehot'

}

try:

r = requests.get('http://s.weibo.com/top/summary?',params=data,headers=headers)

print(r.url)

if r.status_code == 200:

html = r.text

except:

html = ""User-Agent根據(jù)自己瀏覽器的控制臺(tái)去查看一下就行，源碼保存在html中。

lxml解析：這里主要用到lxml中的etree包，其中xpath方法可以獲取到包括script這樣的節(jié)點(diǎn)。我們查看這個(gè)熱搜榜網(wǎng)頁的源碼，可以發(fā)現(xiàn)詳細(xì)的列表內(nèi)容并沒有寫在靜態(tài)頁面中，而是寫在script中，如圖

也就是說，我們需要解析到script中的這段代碼，從中提取有用的信息

這里我用兩種方法來解析，一種是BeautifulSoup，另一種是lxml。代碼如下def parseMethod(id,html):

if id == 'bs':

soup = BeautifulSoup(html,'lxml')

sc = soup.find_all('script')[14].string

start = sc.find("(")

substr = sc[start+1:-1]

text = json.loads(substr)#str轉(zhuǎn)dict

rxml = text["html"]#打印dict的key值,包含pid,js,css,html

soupnew = BeautifulSoup(rxml,'lxml')

tr = soupnew.find_all('tr',attrs={'action-type':'hover'})

elif id == 'lxml':

selector = etree.HTML(html)

tt = selector.xpath('//script/text()')

htm = tt[8]

start = htm.find("(")

substr = htm[start+1:-1]

text = json.loads(substr)#str轉(zhuǎn)dict

rxml = text["html"]#打印dict的key值,包含pid,js,css,html

et = etree.HTML(rxml)

tr = et.xpath(u'//tr[@action-type="hover"]')

else:

pass

return tr根據(jù)傳入的id選擇不同的解析方式，兩種方法都不難，應(yīng)該很容易看懂，簡(jiǎn)單描述就是：

先獲取含有realtimehot的script(到源碼中去數(shù)第幾個(gè)，在bs中是第16個(gè)，lxml中是10個(gè)，因?yàn)閤path選了text()，有的空標(biāo)簽的就過濾掉了，因此只有10個(gè))

對(duì)script的字符串進(jìn)行索引，找到“(”的位置，然后提取()內(nèi)的子串

用json.loads()把字符串解析為字典，共有pid,js,css,html四個(gè)鍵

提取key為html的value值，然后再用bs或者lxml解析一次

提取

標(biāo)簽存入list

寫入txt：def lxmldata(tr):

for t in tr:

id = eval(t.find(u".//td[@class='td_01']").find(u".//em").text)

title = t.find(u".//p[@class='star_name']").find(u".//a").text

num = eval(t.find(u".//p[@class='star_num']").find(u".//span").text)

yield {

'index' : id,

'title' : title,

'num' : num

}

def bsdata(tr):

for t in tr:

id = eval(t.find('em').string)

title = t.find(class_='star_name').find('a').string

num = eval(t.find(class_='star_num').string)

yield {

'index' : id,

'title' : title,

'num' : num

}

def output(id,tr):

with open("weibohotnews.txt","w",encoding='utf-8') as f:

if id == 'bs':

for i in bsdata(tr):

f.write(str(dict(i))+'n')

elif id == 'lxml':

for i in lxmldata(tr):

f.write(str(dict(i))+'n')

else:

pass同樣根據(jù)id來選擇解析方式，兩種方法也都很清楚，去不同的位置獲取需要的值就OK。

最后寫一個(gè)main函數(shù)，調(diào)用以上方法，如下def main():

url = 'http://s.weibo.com/top/summary?'

method = 'lxml'

html = input(url)

tr = parseMethod(method,html)

output(method,tr)

main()

結(jié)果：

總結(jié)

以上是生活随笔為你收集整理的python爬取微博热搜显示到折线图_Python爬取新浪微博热搜榜-Go语言中文社区的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。