日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

常见的爬虫分析库(1)-Python3中Urllib库基本使用

發布時間:2025/3/15 python 24 豆豆
生活随笔 收集整理的這篇文章主要介紹了 常见的爬虫分析库(1)-Python3中Urllib库基本使用 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原文來自:https://www.cnblogs.com/0bug/p/8893677.html

?

什么是Urllib?

Python內置的HTTP請求庫

urllib.request? ? ? ? ? 請求模塊

urllib.error? ? ? ? ? ? ? 異常處理模塊

urllib.parse? ? ? ? ? ? ?url解析模塊

urllib.robotparser? ? robots.txt解析模塊

相比Python的變化

Python2中的urllib2在Python3中被統一移動到了urllib.request中

python2

import urllib2

response = urllib2.urlopen('http://www.cnblogs.com/0bug')

Python3

import urllib.request

response = urllib.request.urlopen('http://www.cnblogs.com/0bug/')

urlopen()

不加data是以GET方式發送,加data是以POST發送

1 2 3 4 5 import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') html = response.read().decode('utf-8') print(html)
?結果

加data發送POST請求

1 2 3 4 5 6 import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'hello':?'0bug'}), encoding='utf-8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read())
?結果

timeout超時間

1 2 3 4 import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug', timeout=0.01) print(response.read())
?結果
1 2 3 4 5 6 7 8 import urllib.request import socket import urllib.error try: ????response = urllib.request.urlopen('http://www.cnblogs.com/0bug', timeout=0.01) except urllib.error.URLError?as??e: ????if?isinstance(e.reason,socket.timeout): ????????print('請求超時')
?結果

響應

1.響應類型

1 2 3 4 import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') print(type(response))
?結果

2.狀態碼、響應頭

1 2 3 4 5 6 import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') print(response.status) print(response.getheaders()) print(response.getheader('Content-Type'))
?結果

3.響應體

響應體是字節流,需要decode('utf-8')

1 2 3 4 5 import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') html = response.read().decode('utf-8') print(html)

Request

1 2 3 4 5 import urllib.request request = urllib.request.Request('http://www.cnblogs.com/0bug') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
?結果

添加請求頭信息

1 2 3 4 5 6 7 8 9 10 11 12 from?urllib import request, parse url =?'http://httpbin.org/post' headers = { ????'User-Agent':?'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', ????'Host':?'httpbin.org' } dic = {'name':?'0bug'} data = bytes(parse.urlencode(dic), encoding='utf-8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))
?結果

add_header

1 2 3 4 5 6 7 8 9 10 from?urllib import request, parse url =?'http://httpbin.org/post' dic = {'name':?'0bug'} data = bytes(parse.urlencode(dic), encoding='utf-8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', ???????????????'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8'))

Handler

代理:

1 2 3 4 5 6 7 8 9 import urllib.request proxy_handler = urllib.request.ProxyHandler({ ????'http':?'http代理', ????'https':?'https代理' }) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://www.cnblogs.com/0bug') print(response.read())

Cookie

1 2 3 4 5 6 7 8 import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for?item?in?cookie: ????print(item.name +?"="?+ item.value)
?結果

Cookie保存為文件

1 2 3 4 5 6 7 8 import http.cookiejar, urllib.request filename =?'cookie.txt' cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
?cookie.txt

另一種方式存

1 2 3 4 5 6 7 8 import http.cookiejar, urllib.request filename =?'cookie.txt' cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
?cookie.txt

用什么格式的存就應該用什么格式的讀

1 2 3 4 5 6 7 8 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))

異常處理

1 2 3 4 5 6 from?urllib import request, error try: ????response = request.urlopen('http://www.cnblogs.com/0bug/xxxx') except error.URLError?as?e: ????print(e.reason)
?結果
1 2 3 4 5 6 7 8 9 10 from?urllib import request, error try: ????response = request.urlopen('http://www.cnblogs.com/0bug/xxxx') except error.HTTPError?as?e: ????print(e.reason, e.code, e.headers, sep='\n') except error.URLError?as?e: ????print(e.reason) else: ????print('Request Successfully')
?結果
1 2 3 4 5 6 7 8 9 10 import socket import urllib.request import urllib.error try: ????response = urllib.request.urlopen('http://www.cnblogs.com/0bug/xxxx', timeout=0.001) except urllib.error.URLError?as?e: ????print(type(e.reason)) ????if?isinstance(e.reason, socket.timeout): ????????print('請求超時')
?結果

URL解析

1 2 3 4 5 from?urllib.parse import urlparse result = urlparse('www.baidu.com/index.html;user?id=5#comment') print(type(result)) print(result)
?結果
1 2 3 4 from?urllib.parse import urlparse result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https') print(result)
?結果
1 2 3 4 from?urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https') print(result)
?結果
1 2 3 4 from?urllib.parse import urlparse result = urlparse('http://www.badiu.com/index.html;user?id=5#comment', allow_fragments=False) print(result)
?結果
1 2 3 4 from?urllib.parse import urlparse result = urlparse('http://www.badiu.com/index.html#comment', allow_fragments=False) print(result)
?結果

urlunparse

1 2 3 4 from?urllib.parse import urlunparse data = ['http',?'www.baidu.com',?'index.html',?'user',?'id=6',?'comment'] print(urlunparse(data))
?結果

urljoin

1 2 3 4 5 6 7 8 9 10 from?urllib.parse import urljoin print(urljoin('http://www.baidu.com',?'ABC.html')) print(urljoin('http://www.baidu.com',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com/0bug',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com/0bug',?'https://www.cnblogs.com/0bug?q=2')) print(urljoin('http://www.baidu.com/0bug?q=2',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com',?'?q=2#comment')) print(urljoin('www.baidu.com',?'?q=2#comment')) print(urljoin('www.baidu.com#comment',?'?q=2'))
?結果

urlencode

1 2 3 4 5 6 7 8 9 from?urllib.parse import urlencode params?= { ????'name':?'0bug', ????'age': 25 } base_url =?'http://www.badiu.com?' url = base_url + urlencode(params) print(url)

轉載于:https://www.cnblogs.com/yunlongaimeng/p/9802052.html

總結

以上是生活随笔為你收集整理的常见的爬虫分析库(1)-Python3中Urllib库基本使用的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。