日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

Python爬虫实战(三):定时爬取数据存入SqlServer

發布時間:2024/1/1 python 29 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Python爬虫实战(三):定时爬取数据存入SqlServer 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目錄

  • 🌹前言
  • 爬取目標(效果展示)
  • 準備工作
  • 代碼分析
    • 第一步
    • 第二步
    • 第三步
    • 第四步
    • 完整代碼
    • 啟動

🌹前言

  • 🏆🏆作者介紹:Python領域優質創作者、華為云享專家、阿里云專家博主、2021年CSDN博客新星Top6

  • 🔥🔥本文已收錄于Python爬蟲實戰100例專欄:《Python爬蟲實戰100例》
  • 📝?📝?此專欄文章是專門針對Python爬蟲實戰案例從基礎爬蟲到進階爬蟲,歡迎免費訂閱

爬取目標(效果展示)

效果展示

爬取的內容是:標題、榜單、熱度值、新聞類型、時間戳、url地址等

準備工作

建表

CREATE TABLE "WB_HotList" ("id" INT IDENTITY(1,1) PRIMARY key,"batch" NVARCHAR(MAX),"daydate" SMALLDATETIME,"star_word" NVARCHAR(MAX),"title" NVARCHAR(MAX),"category" NVARCHAR(MAX),"num" NVARCHAR(MAX),"subject_querys" NVARCHAR(MAX),"flag" NVARCHAR(MAX),"icon_desc" NVARCHAR(MAX),"raw_hot" NVARCHAR(MAX),"mid" NVARCHAR(MAX),"emoticon" NVARCHAR(MAX),"icon_desc_color" NVARCHAR(MAX),"realpos" NVARCHAR(MAX),"onboard_time" SMALLDATETIME,"topic_flag" NVARCHAR(MAX),"ad_info" NVARCHAR(MAX),"fun_word" NVARCHAR(MAX),"note" NVARCHAR(MAX),"rank" NVARCHAR(MAX),"url" NVARCHAR(MAX) )

為防止,字段給的不夠,直接給個MAX!

代碼分析

第一步

發送請求,獲取網頁信息

提供了數據的接口,所以我們直接訪問接口就行,如下圖(json格式):

# 接口地址:https://weibo.com/ajax/statuses/hot_band

def __init__(self) :self.url = "https://weibo.com/ajax/statuses/hot_band"self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"} # 發送請求,獲取相應 def parse_url(self):response = requests.get(self.url,headers=self.headers)time.sleep(2) # 休息兩秒return response.content.decode()

第二步

解析數據,提取我們所需要的數據

接口中的數據格式化如下(只需提取我們所需要的):

for i in range(50):ban_list = json_data['data']['band_list'][i]batch = f'第{a}批'try:star_word = ban_list['star_word']except Exception as e:print(e)try:title = ban_list['word']except Exception as e:print(e)try:category = ban_list['category']except Exception as e:print(e)try:num = ban_list['num']except Exception as e:print(e)try:subject_querys = ban_list['subject_querys']except Exception as e:print(e)try:flag = ban_list['flag']except Exception as e:print(e)try:icon_desc = ban_list['icon_desc']except Exception as e:print(e) try:raw_hot = ban_list['raw_hot']except Exception as e:print(e) try:mid = ban_list['mid']except Exception as e:print(e) try:emoticon = ban_list['emoticon']except Exception as e:print(e)try:icon_desc_color = ban_list['icon_desc_color']except Exception as e:print(e)try:realpos = ban_list['realpos']except Exception as e:print(e)try:onboard_time = ban_list['onboard_time']onboard_time = datetime.datetime.fromtimestamp(onboard_time)except Exception as e:print(e)try:topic_flag = ban_list['topic_flag']except Exception as e:print(e)try:ad_info = ban_list['ad_info']except Exception as e:print(e)try:fun_word = ban_list['fun_word']except Exception as e:print(e) try:note = ban_list['note']except Exception as e:print(e) try:rank = ban_list['rank'] + 1except Exception as e:print(e) try:url = json_data['data']['band_list'][i]['mblog']['text']url = re.findall('href="(.*?)"',url)[0]

第三步

數據庫的batch用于判斷,每次插入的批次(50個一批),如果爬蟲斷了,寫個方法還能接著上次的批次

如圖:

# 把數據庫batch列存入列表并返回(用于判斷批次號) def batch(self):conn=pymssql.connect('.', 'sa', 'yuan427', 'test')cursor=conn.cursor()cursor.execute("select batch from WB_HotList") #向數據庫發送SQL命令rows=cursor.fetchall()batchlist=[]for list in rows:batchlist.append(list[0]) return batchlist

第四步

把數據存入數據庫

# 連接數據庫服務,創建游標對象 db = pymssql.connect('.', 'sa', 'yuan427', 'test') #服務器名,賬戶,密碼,數據庫名 if db:print("連接成功!") cursor= db.cursor()try:# 插入sql語句sql = "insert into test4(batch,daydate,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time, \topic_flag,ad_info,fun_word,note,rank,url) values (%s,getdate(),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"# 執行插入操作cursor.execute(sql,(batch,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time,topic_flag,ad_info, \fun_word,note,rank,url))db.commit()print('成功載入......' )except Exception as e:db.rollback()print(str(e))# 關閉游標,斷開數據庫 cursor.close() db.close()

完整代碼

import requests,pymssql,time,json,re,datetime from threading import Timerclass Spider:def __init__(self) :self.url = "https://weibo.com/ajax/statuses/hot_band"self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}# 發送請求,獲取相應 def parse_url(self):response = requests.get(self.url,headers=self.headers)time.sleep(2)return response.content.decode()# 解析數據,入庫def parse_data(self,data,a):json_data = json.loads(data)# 連接數據庫服務,創建游標對象db = pymssql.connect('.', 'sa', 'yuan427', 'test') #服務器名,賬戶,密碼,數據庫名 cursor= db.cursor()for i in range(50):ban_list = json_data['data']['band_list'][i]batch = f'第{a}批'try:star_word = ban_list['star_word']except Exception as e:print(e)try:title = ban_list['word']except Exception as e:print(e)try:category = ban_list['category']except Exception as e:print(e)try:num = ban_list['num']except Exception as e:print(e)try:subject_querys = ban_list['subject_querys']except Exception as e:print(e)try:flag = ban_list['flag']except Exception as e:print(e)try:icon_desc = ban_list['icon_desc']except Exception as e:print(e) try:raw_hot = ban_list['raw_hot']except Exception as e:print(e) try:mid = ban_list['mid']except Exception as e:print(e) try:emoticon = ban_list['emoticon']except Exception as e:print(e)try:icon_desc_color = ban_list['icon_desc_color']except Exception as e:print(e)try:realpos = ban_list['realpos']except Exception as e:print(e)try:onboard_time = ban_list['onboard_time']onboard_time = datetime.datetime.fromtimestamp(onboard_time)except Exception as e:print(e)try:topic_flag = ban_list['topic_flag']except Exception as e:print(e)try:ad_info = ban_list['ad_info']except Exception as e:print(e)try:fun_word = ban_list['fun_word']except Exception as e:print(e) try:note = ban_list['note']except Exception as e:print(e) try:rank = ban_list['rank'] + 1except Exception as e:print(e) try:url = json_data['data']['band_list'][i]['mblog']['text']url = re.findall('href="(.*?)"',url)[0]except Exception as e:print(e)try:# 插入sql語句sql = "insert into test4(batch,daydate,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time, \topic_flag,ad_info,fun_word,note,rank,url) values (%s,getdate(),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"# 執行插入操作cursor.execute(sql,(batch,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time,topic_flag,ad_info, \fun_word,note,rank,url))db.commit()print('成功載入......' )except Exception as e:db.rollback()print(str(e))# 關閉游標,斷開數據庫cursor.close()db.close()# 把數據庫batch列存入列表并返回(用于判斷批次號)def batch(self):conn=pymssql.connect('.', 'sa', 'yuan427', 'test')cursor=conn.cursor()cursor.execute("select batch from WB_HotList") #向數據庫發送SQL命令rows=cursor.fetchall()batchlist=[]for list in rows:batchlist.append(list[0]) return batchlist # 實現主要邏輯 def run(self, a):# 根據數據庫批次號給定a的值batchlist = self.batch()if len(batchlist) != 0:batch = batchlist[len(batchlist) -1]a = re.findall('第(.*?)批',batch)a = int(a[0]) + 1data = self.parse_url()self.parse_data(data,a)a +=1# 定時調用t = Timer(1800, self.run, (a, )) # 1800表示1800秒,半小時調用一次t.start()if __name__ == "__main__": spider = Spider()spider.run(1)

啟動

因為需要一直運行,所以就在 cmd 掛著

運行成功后,去數據庫看看:

O了O了!!!

有講的不對的地方,希望各位大佬指正!!!,如果有不明白的地方評論區留言回復!兄弟們來個點贊有空就更新爬蟲實戰!!!

總結

以上是生活随笔為你收集整理的Python爬虫实战(三):定时爬取数据存入SqlServer的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。