當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫豆瓣电影top250

發布時間：2023/12/8 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫豆瓣电影top250 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

??我的另一篇博客，Python爬蟲豆瓣讀書評分9分以上榜單

??有了上次的基礎，這次簡單爬了下豆瓣上電影TOP250，鏈接豆瓣電影TOP250。

??打開鏈接，查看網頁源代碼，查找我們需要的信息的字段標簽，本次以標題、概要、評分、圖片為目標，分別進行處理、獲取并保存。（當然最根本的前提依然是通過url獲取到網頁的源代碼）

??本實例完整代碼請移步github：

??https://github.com/selfcon/douban_movie_scraper_python

??推薦正則表達式在線檢測工具：

??http://tool.oschina.net/regex/

1、源代碼html

# 獲取網頁源代碼 def getHtml(url):page = urllib.request.urlopen(url);html = page.read();return html;

2、標題title

??從源代碼中可以發現，標題有多個，而我們需要的是首標題。因此需要對通過正則表達式獲得的結果進行相應的處理。

# 通過正則表達式獲取該網頁下的每部電影的title def getName(html):nameList = re.findall(r'<span.*?class="title">(.*?)</span>', html, re.S);global topnumnewNameList = [];for index,item in enumerate(nameList):if item.find("&nbsp") == -1:#通過檢測&gt或者&nbsp這種HTML轉義符，只保留第一個標題newNameList.append("Top " + str(topnum) + " " + item);topnum += 1;return newNameList;

3、概要introduction

??通過源代碼可以找到相應的標簽，編寫正則表達式（ps：由于有的電影沒有概要介紹，所以在最后的數據存儲中沒存儲該屬性）

# 通過正則表達式獲取該網頁下的每部電影的introduction def getInfo(html):infoList = re.findall(r'<span.*?class="inq">(.*?)</span>', html, re.S);return infoList;

4、評分rating

# 通過正則表達式獲取該網頁下的每部電影的rating_num def getScore(html):scoreList = re.findall(r'<span.*?class="rating_num".*?property="v:average">(.*?)</span>', html, re.S);return scoreList;

5、圖片img

# 通過正則表達式獲取該網頁下的每部電影的img def getImg(html):imgList = re.findall(r'<img.*?alt=.*?src="(https.*?)".*?class.*?>', html, re.S);return imgList;

6、翻頁page

??我們發現一共250條記錄，每頁10條，共25頁

# 實現翻頁,每頁25個 for page in range(0,250,25):url = "https://movie.douban.com/top250?start={}".format(page)html = getHtml(url).decode("UTF-8");namesUrl.extend(getName(html));scoresUrl.extend(getScore(html));infosUrl.extend(getInfo(html));imgsUrl.extend(getImg(html));

7、打印print

# 將獲得的信息進行打印，并存給列表allinfo，方便存儲 allInfo = []; if len(namesUrl) == len(scoresUrl) == len(imgsUrl):length = len(namesUrl);for i in range(0,length):print(namesUrl[i]+" , score = "+scoresUrl[i]+" ,\n imgUrl="+imgsUrl[i]);tmp = [];tmp.append(namesUrl[i]);tmp.append(scoresUrl[i]);tmp.append(imgsUrl[i]);allInfo.append(tmp);

8、存儲store

# 將獲得的數據進行存儲 def save_to_csv(list_tmp):with open('D:/movie.csv','w+',newline='') as fp:a = csv.writer(fp,delimiter=',');a.writerow(['name','score','imgurl']);a.writerows(list_tmp);

9、結果result

------至所有正在努力奮斗的程序猿們！加油！！
有碼走遍天下無碼寸步難行
1024 - 夢想，永不止步!
愛編程不愛Bug
愛加班不愛黑眼圈
固執但不偏執
瘋狂但不瘋癲
生活里的菜鳥
工作中的大神
身懷寶藏，一心憧憬星辰大海
追求極致，目標始于高山之巔
一群懷揣好奇，夢想改變世界的孩子
一群追日逐浪，正在改變世界的極客
你們用最美的語言，詮釋著科技的力量
你們用極速的創新，引領著時代的變遷

——樂于分享，共同進步，歡迎補充
——Treat Warnings As Errors
——Any comments greatly appreciated
——Talking is cheap, show me the code
——誠心歡迎各位交流討論！QQ:1138517609
——CSDN：https://blog.csdn.net/u011489043
——簡書：https://www.jianshu.com/u/4968682d58d1
——GitHub：https://github.com/selfconzrr

總結

以上是生活随笔為你收集整理的Python爬虫豆瓣电影top250的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：饱和度调整
下一篇：读书笔记:《流畅的Python》第21章

python

Python爬虫豆瓣电影top250

1、源代碼html

2、標題title

3、概要introduction

4、評分rating

5、圖片img

6、翻頁page

7、打印print