當前位置：首頁 > 编程语言 > python >内容正文

python

python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）

發布時間：2024/9/21 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這兩天學習了python3實現抓取網頁資源的方法，發現了很多種方法，所以，今天添加一點小筆記。

文章最后為各位小伙伴提供超級彩蛋！不要錯過了！ 1、最簡單

import urllib.request response = urllib.request.urlopen('http://python.org/') html = response.read() 復制代碼

2、使用 Request

import urllib.requestreq = urllib.request.Request('http://python.org/') response = urllib.request.urlopen(req) the_page = response.read() 復制代碼

3、發送數據

#! /usr/bin/env python3import urllib.parse import urllib.requesturl = 'http://localhost/login.php' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = {'act' : 'login','login[email]' : 'yzhang@i9i8.com','login[password]' : '123456'}data = urllib.parse.urlencode(values) req = urllib.request.Request(url, data) req.add_header('Referer', 'http://www.python.org/') response = urllib.request.urlopen(req) the_page = response.read()print(the_page.decode("utf8")) 復制代碼

4、發送數據和header

#! /usr/bin/env python3import urllib.parse import urllib.requesturl = 'http://localhost/login.php' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = {'act' : 'login','login[email]' : 'yzhang@i9i8.com','login[password]' : '123456'} headers = { 'User-Agent' : user_agent }data = urllib.parse.urlencode(values) req = urllib.request.Request(url, data, headers) response = urllib.request.urlopen(req) the_page = response.read()print(the_page.decode("utf8")) 復制代碼

5、http 錯誤

#! /usr/bin/env python3import urllib.requestreq = urllib.request.Request('http://www.python.org/fish.html') try:urllib.request.urlopen(req) except urllib.error.HTTPError as e:print(e.code)print(e.read().decode("utf8")) 復制代碼

6、異常處理1

#! /usr/bin/env python3from urllib.request import Request, urlopen from urllib.error import URLError, HTTPError req = Request("http://twitter.com/") try:response = urlopen(req) except HTTPError as e:print('The server couldn\'t fulfill the request.')print('Error code: ', e.code) except URLError as e:print('We failed to reach a server.')print('Reason: ', e.reason) else:print("good!")print(response.read().decode("utf8")) 復制代碼

7、異常處理2

#! /usr/bin/env python3from urllib.request import Request, urlopen from urllib.error import URLError req = Request("http://twitter.com/") try:response = urlopen(req) except URLError as e:if hasattr(e, 'reason'):print('We failed to reach a server.')print('Reason: ', e.reason)elif hasattr(e, 'code'):print('The server couldn\'t fulfill the request.')print('Error code: ', e.code) else:print("good!")print(response.read().decode("utf8")) 復制代碼

8、HTTP 認證

#! /usr/bin/env python3import urllib.request# create a password manager password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()# Add the username and password. # If we knew the realm, we could use it instead of None. top_level_url = "https://cms.tetx.com/" password_mgr.add_password(None, top_level_url, 'yzhang', 'cccddd')handler = urllib.request.HTTPBasicAuthHandler(password_mgr)# create "opener" (OpenerDirector instance) opener = urllib.request.build_opener(handler)# use the opener to fetch a URL a_url = "https://cms.tetx.com/" x = opener.open(a_url) print(x.read())# Install the opener. # Now all calls to urllib.request.urlopen use our opener. urllib.request.install_opener(opener)a = urllib.request.urlopen(a_url).read().decode('utf8') print(a) 復制代碼

9、使用代理

#! /usr/bin/env python3import urllib.requestproxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener)a = urllib.request.urlopen("http://g.cn").read().decode("utf8") print(a) 復制代碼

10、超時

#! /usr/bin/env python3import socket import urllib.request# timeout in seconds timeout = 2 socket.setdefaulttimeout(timeout)# this call to urllib.request.urlopen now uses the default timeout # we have set in the socket module req = urllib.request.Request('http://twitter.com/') a = urllib.request.urlopen(req).read() print(a) 復制代碼

超多Python免費資料領取！看下面！需要的小伙伴加美女姐姐的微信：kele22558！

轉載于:https://juejin.im/post/5b7686415188253345137109

總結

以上是生活随笔為你收集整理的python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：盒马鲜生颠覆传统生鲜市场的胜算几何？
下一篇： websocket python爬虫_p