當(dāng)前位置：首頁(yè) >

图片爬取数据解析数据持久化

發(fā)布時(shí)間：2025/3/21 28 豆豆

生活随笔收集整理的這篇文章主要介紹了图片爬取数据解析数据持久化小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

1、圖片下載
2、JS動(dòng)態(tài)渲染
3、數(shù)據(jù)解析
4、持久化存儲(chǔ)

1、圖片下載

百度圖片:http://image.baidu.com/
搜狗圖片:https://pic.sogou.com/

# 圖片爬取: 1).尋找圖片下載的url: elements與network抓包 2).瀏覽器中訪問url, 進(jìn)行驗(yàn)證 3).編寫代碼獲取url 4).請(qǐng)求url地址, 獲取二進(jìn)制流 5).將二進(jìn)制流寫入文件 # 百度圖片: import time import requests from lxml import etree from selenium import webdriver# 實(shí)例化瀏覽器對(duì)象 browser = webdriver.Chrome('./chromedriver.exe')# 訪問網(wǎng)頁(yè)并操控網(wǎng)頁(yè)元素獲取搜索結(jié)果 browser.get('http://image.baidu.com/') input_tag = browser.find_element_by_id('kw') input_tag.send_keys('熊二') search_button = browser.find_element_by_class_name('s_search') search_button.click()# 通過js實(shí)現(xiàn)鼠標(biāo)向下滾動(dòng), 獲取更多頁(yè)面源碼 js = 'window.scrollTo(0, document.body.scrollHeight)' for times in range(3):browser.execute_script(js)time.sleep(3) html = browser.page_source# 解析數(shù)據(jù)獲取圖片連接: tree = etree.HTML(html) url_list = tree.xpath('//div[@id="imgid"]/div/ul/li/@data-objurl') for img_url in url_list:headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}content = requests.get(url=img_url, headers=headers).contentif 'token' not in img_url:with open('./baidupics/%s'%img_url.split('/')[-1], 'wb') as f:f.write(content) # 搜狗圖片: import requests import reurl = 'http://pic.sogou.com/pics?' params = {'query': '熊二' } res = requests.get(url=url, params=params).text url_list = re.findall(r',"(https://i\d+piccdn\.sogoucdn.com/.*?)"]', res) for img_url in url_list:headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}print(img_url)content = requests.get(url=img_url, headers=headers).contentname = img_url.split('/')[-1] + '.jpg'with open('./sougoupics/%s.jpg'%name, 'wb') as f:f.write(content)

2、JS動(dòng)態(tài)渲染

1).selenium爬取: selenium測(cè)試框架, 完全模人操作瀏覽器, *** page_source *** 2).基礎(chǔ)語(yǔ)法:from selenium import webdriver# 實(shí)例化瀏覽器對(duì)象:browser = webdriver.Chrome('瀏覽器驅(qū)動(dòng)路徑') # 在當(dāng)前路徑下: './chromedriver.exe'# 訪問響應(yīng)的url地址:browser.get(url)# 獲取頁(yè)面元素:find_element_by_idfind_element_by_name(): name是標(biāo)簽的name屬性值find_element_by_class_name: class的屬性值find_element_by_xpath: 根據(jù)xpath表達(dá)式定位元素find_element_by_css_selector:根據(jù)css選擇器# 示例:獲取一個(gè)id為kw的input輸入框input_tag = browser.find_element_by_id('kw')# 輸入內(nèi)容:input_tag.clear()input_tag.send_keys('喬碧蘿殿下')# 點(diǎn)擊button按鈕:button.click()# 執(zhí)行JS代碼:js = 'window.scrollTo(0, document.body.scrollHeight)'for i in range(3):browser.execute_script(js)# 獲取HTML源碼: 記住沒有括號(hào)*****html = browser.page_source # str類型# 數(shù)據(jù)解析工作:1).xpath提取數(shù)據(jù):2).正則提取: 正則表達(dá)式的書寫 + re模塊的使用3).Beautifulsoup: CSS選擇器 -->(節(jié)點(diǎn)選擇器, 方法選擇器, CSS選擇器)# 媒體類型: 視頻, 圖片, 壓縮包, 軟件安裝包1).下載鏈接2).requests請(qǐng)求: response.content --> 二進(jìn)制流scrapy框架: response.body --> 二進(jìn)制流3).寫文件:with open('./jdkfj/name', 'wb') as f:f.write(res.content | response.body)

3、數(shù)據(jù)解析

1.Xpath # 編碼流程from lxml import etree# 實(shí)例化etree對(duì)象 tree = etree.HTML(res.text) # 調(diào)用xpath表達(dá)式提取數(shù)據(jù)li_list = tree.xpath('xpath表達(dá)式') # xpath提取的數(shù)據(jù)在列表中# 嵌套for li in li_list:li.xpath('xpath表達(dá)式')# ./# .//# 基礎(chǔ)語(yǔ)法:./:從當(dāng)前的根節(jié)點(diǎn)向下匹配../:從當(dāng)前節(jié)點(diǎn)下的任意位置匹配nodeName: 節(jié)點(diǎn)名定位nodename[@attributename="value"]: 根據(jù)屬性定位單屬性多值匹配:contains--> div[contains(@class, "item")]多屬性匹配: and --> div[@class="item" and @name="divtag"]@attributename: 提取其屬性值text(): 提取文本信息# 按序選擇:1).索引定位: 索引從1開始, res.xpath('//div/ul/li[1]/text()'): 定位第一個(gè)li標(biāo)簽requests模塊請(qǐng)求的響應(yīng)對(duì)象:res.text-->文本res.json()-->python的基礎(chǔ)數(shù)據(jù)類型 --> 字典res.content--> 二進(jìn)制流2).last()函數(shù)定位: 最后一個(gè), 倒數(shù)第二個(gè):last()-1res.xpath('//div/ul/li[last()]'): 定位最后一個(gè)res.xpath('//div/ul/li[last()-1]'): 定位倒數(shù)第二個(gè)3).position()函數(shù): 位置res.xpath('//div/ul/li[position()<4]')2.BS4基礎(chǔ)語(yǔ)法: # 編碼流程:from bs4 import BeautifulSoup# 實(shí)例化soup對(duì)象soup = BeautifulSoup(res.text, 'lxml')# 定位節(jié)點(diǎn)soup.select('CSS選擇器') # CSS選擇器語(yǔ)法:id: #class: .soup.select('div > ul > li') # 單層級(jí)選擇器soup.select('div li') # 多層級(jí)選擇器 # 獲取節(jié)點(diǎn)的屬性或文本:tag.string: 取直接文本 --> 當(dāng)標(biāo)簽中除了字節(jié)文本, 還包含其他標(biāo)簽時(shí), 取不到直接文本tag.get_text(): 取文本tag['attributename']: 取屬性(試試屬性有兩個(gè)(包含)值以上時(shí)返回的數(shù)據(jù)類型) 3.正則 & re模塊分組 & 非貪婪匹配:() --> 'dfkjd(kdf.*?dfdf)dfdf'<a href="https://www.baidu.com/kdjfkdjf.jpg">這是一個(gè)a標(biāo)簽</a> --> '<a href="(https://www.baidu.com/.*?\.jpg)">' 量詞:+ : 匹配1次或多次* : 匹配0次獲取多次{m}: 匹配m次{m,n}: 匹配m到n次{m,}: 至少m次{,n}: 至多n次 re模塊:re.findall('正則表示', res.text) --> list列表

4、持久化存儲(chǔ)

1.txt ############# 寫入txt文件 ###############if title and joke and comment:# with open('qbtxt.txt', 'a', encoding='utf-8') as txtfile:# txtfile.write('&'.join([title[0], joke[0], comment[0]]))# txtfile.write('\n')# txtfile.write('********************************************\n')2.json############# 寫入json文件 ################# dic = {'title': title[0], 'joke':joke[0], 'comment':comment[0]}# with open('jsnfile.json', 'a', encoding='utf-8') as jsonfile:# jsonfile.write(json.dumps(dic, indent=4, ensure_ascii=False))# jsonfile.write(','+'\n')3.csv ############# 寫入CSV文件 ##################with open('csvfile.csv', 'a', encoding='utf-8') as csvfile:writer = csv.writer(csvfile, delimiter=' ')writer.writerow([title[0], joke[0], comment[0]]) ############# scrapy框架 ###################FEED_URI = 'file:///home/eli/Desktop/qtw.csv'FEED_FORMAT = 'CSV'

總結(jié)

以上是生活随笔為你收集整理的图片爬取数据解析数据持久化的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。