當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

使用Python+selenium 视频及相关数据

發(fā)布時(shí)間：2024/3/13 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Python+selenium 视频及相关数据小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

該程序的功能是從用戶(hù)的主頁(yè)中爬取視頻鏈接、點(diǎn)贊數(shù)、評(píng)論數(shù)等信息。程序使用Selenium和BeautifulSoup庫(kù)模擬滾動(dòng)并提取頁(yè)面的HTML源代碼。然后，使用正則表達(dá)式從HTML源代碼中提取所需的信息。最后，將提取的信息存儲(chǔ)在Pandas DataFrame中，并將其保存為CSV文件。

程序開(kāi)始等待20秒鐘，以允許用戶(hù)登錄其帳戶(hù)。然后，它通過(guò)執(zhí)行JavaScript命令來(lái)模擬滾動(dòng)，并等待頁(yè)面完全加載。然后提取HTML源代碼并關(guān)閉ChromeDriver。然后，程序提取用戶(hù)的名稱(chēng)并創(chuàng)建一個(gè)名為用戶(hù)名稱(chēng)的目錄（如果目錄不存在）。

程序然后從HTML源代碼中提取視頻鏈接，并對(duì)每個(gè)視頻鏈接進(jìn)行迭代。對(duì)于每個(gè)視頻鏈接，程序使用正則表達(dá)式提取視頻的標(biāo)題、點(diǎn)贊數(shù)、評(píng)論數(shù)和其他信息。然后將此信息存儲(chǔ)在Pandas DataFrame中，并將其保存為CSV文件。

完整代碼：

```python import driver as driver import requests # 數(shù)據(jù)請(qǐng)求模塊 import os import re import pandas as pd import json import time from pprint import pprint from selenium import webdriver import random from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC# 地址 url = input("請(qǐng)輸入用戶(hù)主頁(yè)鏈接：")#偽裝成瀏覽器 headers = {'cookie': "********省略*********",'user-agent': "********省略************" }# 初始化ChromeDriver service = Service(ChromeDriverManager().install()) driver = webdriver.Chrome(service=service)# 打開(kāi)網(wǎng)頁(yè) driver.get(url)# 最大化瀏覽器 driver.maximize_window()time.sleep(20) # 等待15秒登陸賬號(hào)# 模擬滾動(dòng) for i in range(30):driver.execute_script('window.scrollBy(0,10000)')time.sleep(random.randint(1,3))# 等待網(wǎng)頁(yè)加載完成 wait = WebDriverWait(driver, 10) wait.until(EC.presence_of_element_located((By.TAG_NAME, 'body')))# 獲取網(wǎng)頁(yè)源代碼 response = driver.execute_script('return document.documentElement.outerHTML') #print(response)# 關(guān)閉ChromeDriver driver.quit()#response = requests.get(url=url, headers=headers) zhang_hao=re.findall('(.+?)', response)[0] # 打印網(wǎng)頁(yè)源代碼zhang_hao_list=[] Dian_zan_list=[] Ping_lun_list=[] Shou_cang_list=[] Fabu_shijian_list=[] Tittle_list=[] lis_list=[]if not os.path.exists(zhang_hao):os.makedirs(zhang_hao)lis = re.findall(r'<a href="/video/(\d+)', response) lis = list(set(lis))for li in lis :print('https://www.douyin.com/video/'+li) print(f"該用戶(hù)共有{len(lis)}條視頻")for li in lis :url2='https://www.douyin.com/video/'+lilis_list.append(url2)response_video = requests.get(url=url2, headers=headers)# 使用正則表達(dá)式匹配抓取發(fā)布時(shí)間點(diǎn)贊評(píng)論收藏try:Fabu_shijian=re.findall('</div>發(fā)布時(shí)間：(.+?)',response_video.text)[0]except:Fabu_shijian=''try:title = re.findall('<title data-react-helmet="true">(.*)?</title>', response_video.text)[0]except:title = liprint(title)match = re.findall('(.+?)', response_video.text)try:Dian_zan=match[0]except:Dian_zan=''try:Ping_lun=match[1]except:Ping_lun=''try:Shou_cang=match[2]except:Shou_cang=''zhang_hao_list.append(zhang_hao)Fabu_shijian_list.append(Fabu_shijian)Tittle_list.append(title)Dian_zan_list.append(Dian_zan)Ping_lun_list.append(Ping_lun)Shou_cang_list.append(Shou_cang)Douyin_df = pd.DataFrame(columns = ['賬號(hào)名','視頻鏈接','標(biāo)題','點(diǎn)贊數(shù)','評(píng)論數(shù)','收藏','發(fā)布時(shí)間']) Douyin_df['賬號(hào)名']=zhang_hao_list Douyin_df['視頻鏈接']=lis_list Douyin_df['標(biāo)題']=Tittle_list Douyin_df['點(diǎn)贊數(shù)']=Dian_zan_list Douyin_df['評(píng)論數(shù)']=Ping_lun_list Douyin_df['收藏']=Shou_cang_list Douyin_df['發(fā)布時(shí)間']=Fabu_shijian_listcurrent_path = os.getcwd() Douyin_df.to_csv(current_path+'/'+str(zhang_hao_list[0])+'.csv',index=False,encoding='utf-8-sig')#視頻下載模塊（比較慢） '''i=0 for li in lis:#獲取視頻網(wǎng)頁(yè)url_video = 'https://www.douyin.com/video/' + liresponse_video = requests.get(url=url_video, headers=headers)#print(response_video.text)try:title = re.findall('<title data-react-helmet="true">(.*)?</title>', response_video.text)[0]except:title = lihtmldata=re.findall('<script id="RENDER_DATA" type="application/json">(.*?)</script',response_video.text)#解碼if len(htmldata) > 0:htmldata = requests.utils.unquote(htmldata[0])else:# Handle the case when htmldata is empty# You can print an error message or skip the current iterationprint("Error: htmldata is empty")continuejson_data=json.loads(htmldata)video_url1="https:"+json_data['44']['aweme']['detail']['video']['bitRateList'][0]['playAddr'][0]['src']video_content = requests.get(url=video_url1, headers=headers).contentwith open(zhang_hao+'/'+str(title)+'.mp4', 'wb') as f:f.write(video_content)i+= 1print(f"完成第{i}個(gè)視頻")'''

要運(yùn)行此程序，需要在計(jì)算機(jī)上安裝 Google Chrome 瀏覽器。此外，需要安裝 ChromeDriver 可執(zhí)行文件并將其添加到系統(tǒng)的 PATH 環(huán)境變量中。您需要安裝的 ChromeDriver 版本取決于您安裝的 Google Chrome 版本。

您可以從以下鏈接下載相應(yīng)版本的 ChromeDriver：[https://sites.google.com/a/chromium.org/chromedriver/downloads]

下載相應(yīng)版本的 ChromeDriver 后，您可以按照操作系統(tǒng)的說(shuō)明將其添加到系統(tǒng)的 PATH 環(huán)境變量中。

例如，在 Windows 上，您可以通過(guò)在命令提示符下運(yùn)行以下命令，將包含 ChromeDriver 可執(zhí)行文件的目錄添加到您的 PATH 中：

setx PATH “%PATH%;C:\path\to\chromedriver\directory”

或者將驅(qū)動(dòng)程序至于python安裝目錄 script文件夾下

運(yùn)行截圖

免責(zé)聲明

本程序僅供教育和研究目的使用。禁止將本程序用于任何其他目的。本程序的作者不對(duì)使用本程序造成的任何損害或法律問(wèn)題負(fù)責(zé)。用戶(hù)承擔(dān)使用本程序所帶來(lái)的所有責(zé)任和風(fēng)險(xiǎn)。使用本程序即表示用戶(hù)同意這些條款和條件。

總結(jié)

以上是生活随笔為你收集整理的使用Python+selenium 视频及相关数据的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： dsadas
下一篇： websocket python爬虫_p