當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫实战（三）：定时爬取数据存入SqlServer

發布時間：2024/1/1 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫实战（三）：定时爬取数据存入SqlServer 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

🌹前言

🏆🏆作者介紹：Python領域優質創作者、華為云享專家、阿里云專家博主、2021年CSDN博客新星Top6
🔥🔥本文已收錄于Python爬蟲實戰100例專欄：《Python爬蟲實戰100例》
📝?📝?此專欄文章是專門針對Python爬蟲實戰案例從基礎爬蟲到進階爬蟲，歡迎免費訂閱

爬取目標（效果展示）

效果展示：

爬取的內容是：標題、榜單、熱度值、新聞類型、時間戳、url地址等

準備工作

建表：

CREATE TABLE "WB_HotList" ("id" INT IDENTITY(1,1) PRIMARY key,"batch" NVARCHAR(MAX),"daydate" SMALLDATETIME,"star_word" NVARCHAR(MAX),"title" NVARCHAR(MAX),"category" NVARCHAR(MAX),"num" NVARCHAR(MAX),"subject_querys" NVARCHAR(MAX),"flag" NVARCHAR(MAX),"icon_desc" NVARCHAR(MAX),"raw_hot" NVARCHAR(MAX),"mid" NVARCHAR(MAX),"emoticon" NVARCHAR(MAX),"icon_desc_color" NVARCHAR(MAX),"realpos" NVARCHAR(MAX),"onboard_time" SMALLDATETIME,"topic_flag" NVARCHAR(MAX),"ad_info" NVARCHAR(MAX),"fun_word" NVARCHAR(MAX),"note" NVARCHAR(MAX),"rank" NVARCHAR(MAX),"url" NVARCHAR(MAX) )

為防止，字段給的不夠，直接給個MAX！

代碼分析

第一步

發送請求，獲取網頁信息

提供了數據的接口，所以我們直接訪問接口就行，如下圖（json格式）：

# 接口地址：https://weibo.com/ajax/statuses/hot_band

def __init__(self) :self.url = "https://weibo.com/ajax/statuses/hot_band"self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"} # 發送請求，獲取相應 def parse_url(self):response = requests.get(self.url,headers=self.headers)time.sleep(2) # 休息兩秒return response.content.decode()

第二步

解析數據，提取我們所需要的數據

接口中的數據格式化如下（只需提取我們所需要的）：

for i in range(50):ban_list = json_data['data']['band_list'][i]batch = f'第{a}批'try:star_word = ban_list['star_word']except Exception as e:print(e)try:title = ban_list['word']except Exception as e:print(e)try:category = ban_list['category']except Exception as e:print(e)try:num = ban_list['num']except Exception as e:print(e)try:subject_querys = ban_list['subject_querys']except Exception as e:print(e)try:flag = ban_list['flag']except Exception as e:print(e)try:icon_desc = ban_list['icon_desc']except Exception as e:print(e) try:raw_hot = ban_list['raw_hot']except Exception as e:print(e) try:mid = ban_list['mid']except Exception as e:print(e) try:emoticon = ban_list['emoticon']except Exception as e:print(e)try:icon_desc_color = ban_list['icon_desc_color']except Exception as e:print(e)try:realpos = ban_list['realpos']except Exception as e:print(e)try:onboard_time = ban_list['onboard_time']onboard_time = datetime.datetime.fromtimestamp(onboard_time)except Exception as e:print(e)try:topic_flag = ban_list['topic_flag']except Exception as e:print(e)try:ad_info = ban_list['ad_info']except Exception as e:print(e)try:fun_word = ban_list['fun_word']except Exception as e:print(e) try:note = ban_list['note']except Exception as e:print(e) try:rank = ban_list['rank'] + 1except Exception as e:print(e) try:url = json_data['data']['band_list'][i]['mblog']['text']url = re.findall('href="(.*?)"',url)[0]

第三步

數據庫的batch用于判斷，每次插入的批次（50個一批），如果爬蟲斷了，寫個方法還能接著上次的批次

如圖：

# 把數據庫batch列存入列表并返回（用于判斷批次號） def batch(self):conn=pymssql.connect('.', 'sa', 'yuan427', 'test')cursor=conn.cursor()cursor.execute("select batch from WB_HotList") #向數據庫發送SQL命令rows=cursor.fetchall()batchlist=[]for list in rows:batchlist.append(list[0]) return batchlist

第四步

把數據存入數據庫

# 連接數據庫服務,創建游標對象 db = pymssql.connect('.', 'sa', 'yuan427', 'test') #服務器名,賬戶,密碼,數據庫名 if db:print("連接成功!") cursor= db.cursor()try:# 插入sql語句sql = "insert into test4(batch,daydate,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time, \topic_flag,ad_info,fun_word,note,rank,url) values (%s,getdate(),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"# 執行插入操作cursor.execute(sql,(batch,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time,topic_flag,ad_info, \fun_word,note,rank,url))db.commit()print('成功載入......' )except Exception as e:db.rollback()print(str(e))# 關閉游標，斷開數據庫 cursor.close() db.close()

完整代碼

import requests,pymssql,time,json,re,datetime from threading import Timerclass Spider:def __init__(self) :self.url = "https://weibo.com/ajax/statuses/hot_band"self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}# 發送請求，獲取相應 def parse_url(self):response = requests.get(self.url,headers=self.headers)time.sleep(2)return response.content.decode()# 解析數據，入庫def parse_data(self,data,a):json_data = json.loads(data)# 連接數據庫服務,創建游標對象db = pymssql.connect('.', 'sa', 'yuan427', 'test') #服務器名,賬戶,密碼,數據庫名 cursor= db.cursor()for i in range(50):ban_list = json_data['data']['band_list'][i]batch = f'第{a}批'try:star_word = ban_list['star_word']except Exception as e:print(e)try:title = ban_list['word']except Exception as e:print(e)try:category = ban_list['category']except Exception as e:print(e)try:num = ban_list['num']except Exception as e:print(e)try:subject_querys = ban_list['subject_querys']except Exception as e:print(e)try:flag = ban_list['flag']except Exception as e:print(e)try:icon_desc = ban_list['icon_desc']except Exception as e:print(e) try:raw_hot = ban_list['raw_hot']except Exception as e:print(e) try:mid = ban_list['mid']except Exception as e:print(e) try:emoticon = ban_list['emoticon']except Exception as e:print(e)try:icon_desc_color = ban_list['icon_desc_color']except Exception as e:print(e)try:realpos = ban_list['realpos']except Exception as e:print(e)try:onboard_time = ban_list['onboard_time']onboard_time = datetime.datetime.fromtimestamp(onboard_time)except Exception as e:print(e)try:topic_flag = ban_list['topic_flag']except Exception as e:print(e)try:ad_info = ban_list['ad_info']except Exception as e:print(e)try:fun_word = ban_list['fun_word']except Exception as e:print(e) try:note = ban_list['note']except Exception as e:print(e) try:rank = ban_list['rank'] + 1except Exception as e:print(e) try:url = json_data['data']['band_list'][i]['mblog']['text']url = re.findall('href="(.*?)"',url)[0]except Exception as e:print(e)try:# 插入sql語句sql = "insert into test4(batch,daydate,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time, \topic_flag,ad_info,fun_word,note,rank,url) values (%s,getdate(),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"# 執行插入操作cursor.execute(sql,(batch,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time,topic_flag,ad_info, \fun_word,note,rank,url))db.commit()print('成功載入......' )except Exception as e:db.rollback()print(str(e))# 關閉游標，斷開數據庫cursor.close()db.close()# 把數據庫batch列存入列表并返回（用于判斷批次號）def batch(self):conn=pymssql.connect('.', 'sa', 'yuan427', 'test')cursor=conn.cursor()cursor.execute("select batch from WB_HotList") #向數據庫發送SQL命令rows=cursor.fetchall()batchlist=[]for list in rows:batchlist.append(list[0]) return batchlist # 實現主要邏輯 def run(self, a):# 根據數據庫批次號給定a的值batchlist = self.batch()if len(batchlist) != 0:batch = batchlist[len(batchlist) -1]a = re.findall('第(.*?)批',batch)a = int(a[0]) + 1data = self.parse_url()self.parse_data(data,a)a +=1# 定時調用t = Timer(1800, self.run, (a, )) # 1800表示1800秒，半小時調用一次t.start()if __name__ == "__main__": spider = Spider()spider.run(1)

啟動

因為需要一直運行，所以就在 cmd 掛著

運行成功后，去數據庫看看：

O了O了！！！

有講的不對的地方，希望各位大佬指正！！！，如果有不明白的地方評論區留言回復！兄弟們來個點贊有空就更新爬蟲實戰！！！

總結

以上是生活随笔為你收集整理的Python爬虫实战（三）：定时爬取数据存入SqlServer的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： qt中opengl窗口的创建
下一篇： qpython3 安装库_qpython