當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python + selenium 爬取淘宝商品列表及商品评论 2021-08-26

發(fā)布時間：2024/1/1 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python + selenium 爬取淘宝商品列表及商品评论 2021-08-26 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Python + selenium 爬取淘寶商品列表及商品評論[2021-08-26]

主要內(nèi)容
- 登錄淘寶
- 獲取商品列表
- 獲取評論信息
- 存入數(shù)據(jù)庫
- 需要提醒

主要內(nèi)容

通過python3.8+ selenium 模擬chrome操作進(jìn)行淘寶商品列表及評論的爬取
還存在以下問題:
需要人掃二維碼登錄以便于繞過反爬機(jī)制(后面再優(yōu)化)
評論爬取耗時比較長,因?yàn)轫撁婕虞d完整后才能進(jìn)行評論的爬取,而各類商品詳情頁的圖片數(shù)量不同,導(dǎo)致加載時間不同,有的甚至要加載1-2min(也可能是公司網(wǎng)限制了購物網(wǎng)站的網(wǎng)速)

整體思路:
通過掃碼登錄淘寶,繞過反爬機(jī)制
通過關(guān)鍵字搜索,獲取商品列表信息
逐一訪問商品詳情頁面,獲取商品評論信息
轉(zhuǎn)df存入數(shù)據(jù)庫(評論信息,滿10個商品存一次)

登錄淘寶

通過selenium登錄淘寶主要有2種方式,一種是在代碼中寫入賬號密碼,并且加入滑塊模擬繞過反爬,我自己覺得有點(diǎn)不靠譜,而且我一開始也是用這種思路,導(dǎo)致賬號被鎖了…所以我現(xiàn)在采取的思路是通過登錄支付寶的登錄頁面,掃描二維碼來間接登錄淘寶,這樣可以不需要滑塊驗(yàn)證,目前還行.

def loginTB(item):# item 為你需要通過淘寶搜索的寶貝關(guān)鍵字browser.get('https://auth.alipay.com/login/index.htm?loginScene=7&goto=https%3A%2F%2Fauth.alipay.com%2Flogin%2Ftaobao_trust_login.htm%3Ftarget%3Dhttps%253A%252F%252Flogin.taobao.com%252Fmember%252Falipay_sign_dispatcher.jhtml%253Ftg%253Dhttps%25253A%25252F%25252Fwww.taobao.com%25252F&params=VFBMX3JlZGlyZWN0X3VybD1odHRwcyUzQSUyRiUyRnd3dy50YW9iYW8uY29tJTJG')# 設(shè)置顯示等待等待搜索框出現(xiàn)wait = WebDriverWait(browser, 180)wait.until(EC.presence_of_element_located((By.ID, 'q')))# 查找搜索框，輸入搜索關(guān)鍵字并點(diǎn)擊搜索text_input = browser.find_element_by_id('q')text_input.send_keys(item)btn = browser.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')btn.click()

獲取商品列表

兩個函數(shù),一個用于翻頁,一個用于獲取商品列表信息,需要嵌套使用

def get_TB_data():page_index = 1data_list = []while page_index > 0 :print("===================正在抓取第{}頁===================".format(page_index))print("當(dāng)前頁面URL：" + browser.current_url)# 解析數(shù)據(jù)data_list += get_item_list(browser.page_source)# 設(shè)置顯示等待等待下一頁按鈕wait = WebDriverWait(browser, 60)try:wait.until(EC.presence_of_element_located((By.XPATH, '//a[@class="J_Ajax num icon-tag"]')))time.sleep(1)try:# 通過動作鏈，滾動到下一頁按鈕元素處write = browser.find_element_by_xpath('//li[@class="item next"]')ActionChains(browser).move_to_element(write).perform()except NoSuchElementException as e:print("爬取完畢！")page_index = 0breaktime.sleep(2)webdriver.ActionChains(browser).move_to_element(write).click(write).perform()page_index += 1return data_list

這里返回一個list,里面包含各商品列表的dic,最后會轉(zhuǎn)df
這里需要注意的是shop_info = {} 一定要在循環(huán)內(nèi),否則因?yàn)閜ython的指引問題,會導(dǎo)致list出錯

def get_item_list(data):xml = etree.HTML(data)product_names = xml.xpath('//img[@class="J_ItemPic img"]/@alt')prices = xml.xpath('//div[@class="price g_price g_price-highlight"]/strong/text()')shop_names = xml.xpath('//div[@class="shop"]/a/span[last()]/text()')dteail_urls = xml.xpath('//div[@class="pic"]/a/@href')sales_volumes = xml.xpath('//div[@class="deal-cnt"]/text()')addresss = xml.xpath('//div[@class="location"]/text()')data_list = []for i in range(len(product_names)):shop_info = {}shop_info['item_name'] = product_names[i]shop_info['price'] = prices[i]shop_info['shop_name'] = shop_names[i]shop_info['salse_volume'] = sales_volumes[i]shop_info['address'] = addresss[i]shop_info['item_url'] = dteail_urls[i]with open('shop_data.json','a',encoding = 'utf-8') as f :f.write(json.dumps(shop_info, ensure_ascii=False) + '\n')data_list.append(shop_info)print('正在爬取第%s件商品'%(i+1))print('商品名稱:%s'%product_names[i])print('商品單價:%s'%prices[i])print('店鋪名稱:%s'%shop_names[i])print('累計(jì)售賣:%s'%sales_volumes[i])print("-"*30)return data_list

獲取評論信息

同樣是2個函數(shù),一個用于獲取評論信息,一個用于總控(逐一切換商品詳情頁及翻頁)

def get_comment(data_list):comment_dic = {}for i in range(len(data_list)):comment_list = []time.sleep(1)print('準(zhǔn)備開始爬取第%s個商品的評論信息'%(i+1))z = 1while z == 1:try:if data_list[i]['item_url'][0] =='/':browser.get('https:'+data_list[i]['item_url'])else:browser.get(data_list[i]['item_url'])time.sleep(3)browser.execute_script('window.scrollTo(0,'+str(100+random.random()*30)+')')browser.find_element_by_xpath('//div[@id="J_TabBarBox"]/ul/li[2]/a').click()comment_list = get_comment_info(browser.page_source)time.sleep(1)#翻頁while True:try:next_page=browser.find_element_by_xpath('//div[@class="rate-page"]/div[@class="rate-paginator"]//a[contains(text(),"下一頁>>")]')browser.execute_script("arguments[0].click();", next_page)comment_list += get_comment_info(browser.page_source)except NoSuchElementException as e:z = 0breakexcept:breakcomment_dic[data_list[i]['item_name']] = comment_listif i > 0 and i % 10 == 0:comment_df = pd.DataFrame(columns=('user_name','comment','com_time','com_add','item_name','insert_time'))for item_name , comments in comment_dic.items():comment_tmp = pd.DataFrame(comments)comment_tmp['item_name'] = item_namecomment_tmp['insert_time'] = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')comment_df = pd.concat([comment_df,comment_tmp])data2mysql(comment_df,'comment_list')comment_dic = {}

獲取評論信息,如果該商品沒有評論則跳過
是否有追評會導(dǎo)致XPATH不一樣,要注意
同時要注意如果評論內(nèi)容里面有moji表情,會導(dǎo)致存入數(shù)據(jù)庫出現(xiàn)問題,所以要剔除

def get_comment_info(text):source = etree.HTML(text)user_name = re.findall('<div class="rate-user-info">(.*?)</div>',text)if len(user_name) > 0:info_list = source.xpath('//div[@class="rate-grid"]/table/tbody/tr') com_list = []for i in range(len(info_list)):item = {}item['user_name'] = user_name[i].replace('<span>','').replace('</span>','')if info_list[i].xpath('./td[1]/div[@class="tm-rate-premiere"]'):item['comment'] = info_list[i].xpath('./td[1]/div[@class="tm-rate-premiere"]//div[@class="tm-rate-content"]/div[@class="tm-rate-fulltxt"]/text()')[0]item['com_time'] = info_list[i].xpath('./td[1]/div[@class="tm-rate-premiere"]/div[@class="tm-rate-tag"]//div[@class="tm-rate-date"]/text()')[0]item['com_add'] = info_list[i].xpath('./td[1]/div[@class="tm-rate-append"]//div[@class="tm-rate-content"]/div[@class="tm-rate-fulltxt"]/text()')[0]else:item['comment'] = info_list[i].xpath('./td[1]/div[@class="tm-rate-content"]/div[@class="tm-rate-fulltxt"]/text()')[0]item['com_time'] = info_list[i].xpath('./td[1]/div[@class="tm-rate-date"]/text()')[0]item['com_add'] = ''item['comment'] = str(bytes(item['comment'], encoding='utf-8').decode('utf-8').encode('gbk', 'ignore').decode('gbk'))item['comment'] = item['comment'].replace(' ','')print('爬取到評論信息')print('用戶名:%s'%item['user_name'])print('評論時間:%s'%item['com_time'])print('評論內(nèi)容:%s'%item['comment'])print('追加評論:%s'%item['com_add'])print("-"*30)com_list.append(item)else:print('此商品沒有評論')return com_list

存入數(shù)據(jù)庫

def data2mysql(df,table_name):engine = ('mysql+pymysql://root:xxxxx@localhost:3306/selenium_taobao_pachong?charset=utf8')df = df.applymap(str)df.to_sql(name = table_name ,con = engine, if_exists = 'append',index = False,index_label = False)

需要提醒

如果被反爬鎖定了,可以嘗試下取消chrome的開發(fā)模式,以及自動檢測來繞過,如果還不行的話,就需要在chrome的驅(qū)動程序上進(jìn)行修改了,但是windows系統(tǒng)好像不太好弄.這也是為什么選擇掃描二維碼的形式進(jìn)行登錄,并且大量使用sleep來放慢速度
u1s1,淘寶技術(shù)還是可以的

chrome_options = webdriver.ChromeOptions(); chrome_options.add_experimental_option("excludeSwitches", ['enable-automation']); chrome_options.add_argument("--disable-blink-features=AutomationControlled") browser = webdriver.Chrome(options=chrome_options)

總結(jié)

以上是生活随笔為你收集整理的Python + selenium 爬取淘宝商品列表及商品评论 2021-08-26的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： STM32实现USB摄像头显示到LCD屏
下一篇： python数学方程计算_用Python