當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

实战项目 — 爬取中国票房网年度电影信息并保存在csv

發布時間：2023/12/16 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了实战项目 — 爬取中国票房网年度电影信息并保存在csv 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

有一個難點是遇到了 ?pandas返回的數據類型，DataFrame（）函數處理才能寫入到csv文件里

import pandas as pd import requests from bs4 import BeautifulSoup import csvurl = "http://www.cbooo.cn/year?year=2018" datas = requests.get(url).text# 解析url soup = BeautifulSoup(datas,'lxml') # 獲取數據集合，find_all 返回的是集合類型，所以取[0], 找table標簽下的屬性是 id：tbContent moives_tables = soup.find_all('table',{'id':'tbContent'})[0] # 獲取每一個子節點 tr標簽 moives = moives_tables.findAll('tr')# 獲取電影名字，電影名字在每個tr標簽里面的第一個td標簽里面，由于是有多個td所以要用for遍歷 names = [ tr.find_all('td')[0].a.get('title') for tr in moives[1:]] # 獲取電影的詳情頁url地址，而且下面提供給獲取導演使用，因為導演信息不在主頁面上 hrefs = [ tr.find_all('td')[0].a.get('href') for tr in moives[1:]] # 獲取電影類型， types = [ tr.find_all('td')[1].string for tr in moives[1:]] # 獲取票房數據 box_offices = [ int(tr.find_all('td')[2].string) for tr in moives[1:]] # 獲取平均票價 Average_fare = [ tr.find_all('td')[3].string for tr in moives[1:]] # 獲取上映日期 show_time = [ tr.find_all('td')[6].string for tr in moives[1:]]# 構建個獲取詳情頁的導演的函數 def getInfo(url):# 請求榜上的電影詳情頁datas = requests.get(url).textsoup = BeautifulSoup(datas, 'lxml')# 獲取導演，由于數據是帶換行的，所以要用replace("\n","") 取消換行daoyan = soup.select('dl.dltext dd')[0].get_text().replace("\n","")return daoyandirectors = [getInfo(url) for url in hrefs]# 數據拼接,得到的數據類型是 <class 'pandas.core.frame.DataFrame'> ，所以要用 DataFrame() 函數來寫入excel df = pd.DataFrame({'name': names,'href': hrefs,'type': types,'box_office': box_offices,'Average_fare': Average_fare,'show_time': show_time,'directors': directors }) try:# 打開和創建excel，設置保存路徑，如果不定義路徑，默認存儲到py文件目錄with open("D://box_office_01.csv", 'w', newline="") as f:result = pd.DataFrame()result['name'] = namesresult['href'] = hrefsresult['type'] = typesresult['box_office'] = box_officesresult['Average_fare'] = Average_fareresult['show_time'] = show_timeresult['directors'] = directors# 這個步驟是把上面的格式寫入excel，而且路徑要和上面定義的一樣result.to_csv('D://box_office_01.csv')f.close()print('finish')except Exception as e:print("error" + str(e))

優化了一下代碼，主要是提升了 csv寫入效率和記錄了程序運行時間

import pandas as pd import requests from bs4 import BeautifulSoup import time# 計算開始時間 start_time = time.time() url = "http://www.cbooo.cn/year?year=2018" datas = requests.get(url).text# 解析url soup = BeautifulSoup(datas,'lxml') # 獲取數據集合，find_all 返回的是集合類型，所以取[0], 找table標簽下的屬性是 id：tbContent moives_tables = soup.find_all('table',{'id':'tbContent'})[0] # 獲取每一個子節點 tr標簽 moives = moives_tables.findAll('tr')# 獲取電影名字，電影名字在每個tr標簽里面的第一個td標簽里面，由于是有多個td所以要用for遍歷 names = [ tr.find_all('td')[0].a.get('title') for tr in moives[1:]] # 獲取電影的詳情頁url地址，而且下面提供給獲取導演使用，因為導演信息不在主頁面上 hrefs = [ tr.find_all('td')[0].a.get('href') for tr in moives[1:]] # 獲取電影類型， types = [ tr.find_all('td')[1].string for tr in moives[1:]] # 獲取票房數據 box_offices = [ int(tr.find_all('td')[2].string) for tr in moives[1:]] # 獲取平均票價 Average_fare = [ tr.find_all('td')[3].string for tr in moives[1:]] # 獲取上映日期 show_time = [ tr.find_all('td')[6].string for tr in moives[1:]]# 構建個獲取詳情頁的導演的函數 def getInfo(url):# 請求榜上的電影詳情頁datas = requests.get(url).textsoup = BeautifulSoup(datas, 'lxml')# 獲取導演，由于數據是帶換行的，所以要用replace("\n","") 取消換行daoyan = soup.select('dl.dltext dd')[0].get_text().replace("\n","")return daoyandirectors = [getInfo(url) for url in hrefs]# 數據拼接,得到的數據類型是 <class 'pandas.core.frame.DataFrame'> ，所以要用 DataFrame() 函數來寫入excel df = pd.DataFrame({'name': names,'href': hrefs,'type': types,'box_office': box_offices,'Average_fare': Average_fare,'show_time': show_time,'directors': directors })try:df.to_csv('D://box_office_02.csv')print("done") except Exception as e:print("error" + str(e))print("finish, 消耗時間: %f s" % (time.time() - start_time))

繼續優化中，

轉載于:https://www.cnblogs.com/chen-jun552/p/11310187.html

總結

以上是生活随笔為你收集整理的实战项目 — 爬取中国票房网年度电影信息并保存在csv的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：计算机教师考核方案,教师校园网使用考核方
下一篇：一切成功源于积累——20140219 混