當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫基本原理及Request和Response分析

發布時間：2023/12/31 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫基本原理及Request和Response分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、爬蟲

互聯網是由網絡設備（網線，路由器，交換機，防火墻等等）和一臺臺計算機連接而成，像一張網一樣。

互聯網的核心價值在于數據的共享/傳遞：數據是存放于一臺臺計算機上的，而將計算機互聯到一起的目的就是為了能夠方便彼此之間的數據共享/傳遞，否則就只能拿U盤去拷貝數據了。

互聯網中最有價值的便是數據，爬蟲模擬瀏覽器發送請求->下載網頁代碼->只提取有用的數據->存放于數據庫或文件中，就得到了我們需要的數據了

爬蟲是一種向網站發起請求，獲取資源后分析并提取有用數據的程序。

二、爬蟲流程

1、基本流程介紹

發送請求-----> 獲取響應內容----->解析內容 ----->保存數據

#1、發起請求使用http庫向目標站點發起請求，即發送一個Request Request包含：請求頭、請求體等#2、獲取響應內容如果服務器能正常響應，則會得到一個Response Response包含：html，json，圖片，視頻等#3、解析內容解析html數據：正則表達式，第三方解析庫如Beautifulsoup，pyquery等解析json數據：json模塊解析二進制數據:以b的方式寫入文件#4、保存數據數據庫文件

2、Request

常用的請求方式：GET，POST

其他請求方式：HEAD，PUT，DELETE，OPTHONS

>>>?import?requests >>>?r?=?requests.get('https://api.github.com/events') >>>?r?=?requests.post('http://httpbin.org/post',?data?=?{'key':'value'}) >>>?r?=?requests.put('http://httpbin.org/put',?data?=?{'key':'value'}) >>>?r?=?requests.delete('http://httpbin.org/delete') >>>?r?=?requests.head('http://httpbin.org/get') >>>?r?=?requests.options('http://httpbin.org/get')

百度搜索內容爬取頁面：

import?requests response=requests.get("https://www.baidu.com/s",params={"wd":"美女","a":1},headers={ "User-Agent":"Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/66.0.3359.139?Safari/537.36"})????????????????#模擬在百度搜索美女的第一頁內容，wd后面跟輸入的搜索內容??#自己定制headers，解決網站的反爬蟲功能 print(response.status_code) print(response.text) with?open("bd.html","w",encoding="utf8")?as?f:f.write(response.text)????????????????????????#下載的頁面寫進bd.html文件，文件用瀏覽器打開發現和百度頁面一樣

3、Response

# 1、響應狀態

200：代表成功

301：代表跳轉

404：文件不存在

403：權限

502：服務器錯誤

# 2、Respone header

Location：跳轉

set - cookie：可能有多個，是來告訴瀏覽器，把cookie保存下來

# 3、preview就是網頁源代碼

最主要的部分，包含了請求資源的內容

如網頁html，圖片，二進制數據等

# 4、response屬性

import?requests respone=requests.get('http://www.jianshu.com') #?respone屬性 print(respone.text)?????????????????????#?獲取響應文本 print(respone.content)?????????????????#獲取網頁上的二進制圖片、視頻 print(respone.status_code)???????????????#響應狀態碼 print(respone.headers)???????????????????#響應頭print(respone.cookies)???????????????????#獲取cookies信息 print(respone.cookies.get_dict()) print(respone.cookies.items())print(respone.url) print(respone.history)?????????????????#獲取history信息（頁面經過重定向等方式，不是一次返回的頁面） print(respone.encoding)????????????????#響應字符編碼#關閉：response.close() from?contextlib?import?closing with?closing(requests.get('xxx',stream=True))?as?response:for?line?in?response.iter_content():pass

#5、獲取大文件

#stream參數:一點一點的取,對于特別大的資源，一下子寫到文件中是不合理的 import?requests response=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',stream=True) with?open('b.mp4','wb')?as?f:for?line?in?response.iter_content():???????????#?獲取二進制流(iter_content)f.write(line)

三、爬取校花網視頻（加了并發的）

import?requests?????????#安裝模塊?pip3?install?requests import?re import?hashlib import?time from?concurrent.futures?import?ThreadPoolExecutorpool=ThreadPoolExecutor(50) movie_path=r'C:\mp4'def?get_page(url):try:response=requests.get(url)if?response.status_code?==?200:return?response.textexcept?Exception:passdef?parse_index(index_page):index_page=index_page.result()urls=re.findall('class="items".*?href="(.*?)"',index_page,re.S)???????#找到所有屬性類為items的標簽的鏈接地址，re.S表示前面的.*?代表所有字符for?detail_url?in?urls:ret?=?re.search('<video.*?source?src="(?P<path>.*?)"',?res.text,?re.S)???#找到所有video標簽的鏈接地址detail_url?=?ret.group("path")res?=?requests.get(detail_url)if?not?detail_url.startswith('http'):detail_url='http://www.xiaohuar.com'+detail_urlpool.submit(get_page,detail_url).add_done_callback(parse_detail)def?parse_detail(detail_page):detail_page=detail_page.result()l=re.findall('id="media".*?src="(.*?)"',detail_page,re.S)if?l:movie_url=l[0]if?movie_url.endswith('mp4'):pool.submit(get_movie,movie_url)def?get_movie(url):try:response=requests.get(url)if?response.status_code?==?200:m=hashlib.md5()m.update(str(time.time()).encode('utf-8'))m.update(url.encode('utf-8'))filepath='%s\%s.mp4'?%(movie_path,m.hexdigest())with?open(filepath,'wb')?as?f:????????????????????????#視頻文件，wb保存f.write(response.content)print('%s?下載成功'?%url)except?Exception:passdef?main():base_url='http://www.xiaohuar.com/list-3-{page_num}.html'for?i?in?range(5):url=base_url.format(page_num=i)pool.submit(get_page,url).add_done_callback(parse_index)if?__name__?==?'__main__':main()

四、爬蟲模擬登陸github網站

import?requests import?re #?第三次請求，登錄成功之后#-?請求之前自己先登錄一下，看一下有沒有referer#-?請求新的url，進行其他操作#-?查看用戶名在不在里面#第一次請求GET請求 response1?=?requests.get("https://github.com/login",headers?=?{"User-Agent":"Mozilla/5.0?(Windows?NT?6.1;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/63.0.3239.108?Safari/537.36",}, ) authenticity_token?=?re.findall('name="authenticity_token".*?value="(.*?)"',response1.text,re.S) r1_cookies?=??response1.cookies.get_dict()???????????????????#獲取到的cookie#第二次請求POST請求 response2?=?requests.post("https://github.com/session",headers?=?{"Referer":?"https://github.com/","User-Agent":"Mozilla/5.0?(Windows?NT?6.1;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/63.0.3239.108?Safari/537.36",},data={"commit":"Sign?in","utf8":"?","authenticity_token":authenticity_token,"login":"zzzzzzzz","password":"xxxx", zhy..azjash1234},cookies?=?r1_cookies ) print(response2.status_code) print(response2.history)??#跳轉的歷史狀態碼#第三次請求，登錄成功之后，訪問其他頁面 r2_cookies?=?response2.cookies.get_dict()???????????#拿上cookie，表示登陸狀態，開始訪問頁面 response3?=?requests.get("https://github.com/settings/emails",headers?=?{"Referer":?"https://github.com/","User-Agent":?"Mozilla/5.0?(Windows?NT?6.1;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/63.0.3239.108?Safari/537.36",},cookies?=?r2_cookies, ) print(response3.text) print("zzzzzzzz"?in?response3.text)?????????????#返回True說明就成功了

五、高級用法

1、SSL Cert Verification

import?requests respone=requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key')) print(respone.status_code)

2、使用代理

#官網鏈接:?http://docs.python-requests.org/en/master/user/advanced/#proxies #代理設置:先發送請求給代理,然后由代理幫忙發送(封ip是常見的事情) import?requests proxies={'http':'http://egon:123@localhost:9743',#帶用戶名密碼的代理,@符號前是用戶名與密碼'http':'http://localhost:9743','https':'https://localhost:9743', } respone=requests.get('https://www.12306.cn',proxies=proxies) print(respone.status_code) #支持socks代理,安裝:pip?install?requests[socks] import?requests proxies?=?{'http':?'socks5://user:pass@host:port','https':?'socks5://user:pass@host:port' } respone=requests.get('https://www.12306.cn',proxies=proxies) print(respone.status_code)

3、超時設置

#兩種超時:float?or?tuple #timeout=0.1?#代表接收數據的超時時間 #timeout=(0.1,0.2)#0.1代表鏈接超時??0.2代表接收數據的超時時間 import?requests respone=requests.get('https://www.baidu.com',?timeout=0.0001)

4、認證設置

#官網鏈接：http://docs.python-requests.org/en/master/user/authentication/ #認證設置:登陸網站是,彈出一個框,要求你輸入用戶名密碼（與alter很類似），此時是無法獲取html的 #?但本質原理是拼接成請求頭發送 #?????????r.headers['Authorization']?=?_basic_auth_str(self.username,?self.password) #看一看默認的加密方式吧，通常網站都不會用默認的加密設置import?requests from?requests.auth?import?HTTPBasicAuth r=requests.get('xxx',auth=HTTPBasicAuth('user','password')) print(r.status_code)#HTTPBasicAuth可以簡寫為如下格式 import?requests r=requests.get('xxx',auth=('user','password')) print(r.status_code)

5、異常處理

import?requests from?requests.exceptions?import?*?#可以查看requests.exceptions獲取異常類型 try:r=requests.get('http://www.baidu.com',timeout=0.00001) except?ReadTimeout:print('===:') #?except?ConnectionError: #?????print('-----網絡不通') #?except?Timeout: #?????print('aaaaa') except?RequestException:print('Error')

6、上傳文件

import?requests files={'file':open('a.jpg','rb')} respone=requests.post('http://httpbin.org/post',files=files) print(respone.status_code)

轉載于:https://blog.51cto.com/qidian510/2134735

總結

以上是生活随笔為你收集整理的爬虫基本原理及Request和Response分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： win10蓝牙禁用后如何打开 Win10
下一篇： java多线程系列：ThreadPool