get 到的html代码如何转码,爬虫网页转码逻辑
爬蟲網(wǎng)頁轉(zhuǎn)碼邏輯
最先出現(xiàn)的編碼格式是ASCII碼,這種編碼規(guī)則是美國人制定的,大致的規(guī)則是用一個字節(jié)(8個bit)去表示出現(xiàn)的字符,其實(shí)由于在老美的世界里中總共出現(xiàn)的字符也不超過128個,而一個字節(jié)能夠表示256種字符,所以當(dāng)時這種編碼的方式是沒有問題的。
后來計算機(jī)在全世界普及起來,不同國家的語言都面臨著如何在計算機(jī)中表示的問題,比如我們的漢字常用的就有幾千個,顯然最開始一個字節(jié)的ASIIC碼表示就不夠用了,這個時候就出現(xiàn)了Unicode編碼,確切的說它只是一種表示規(guī)則,并不對應(yīng)具體的實(shí)現(xiàn)形式。Uni-這個前綴在英文中表示的是統(tǒng)一的含義,它試圖把全世界的語言用一種統(tǒng)一的編碼表示,但是Unicode只規(guī)定了字符對應(yīng)的二進(jìn)制數(shù)據(jù),但是沒有規(guī)定這種二進(jìn)制數(shù)據(jù)在內(nèi)存中具體用幾個字節(jié)存儲,然后就亂套了,各國在實(shí)現(xiàn)Unicode時都發(fā)揮了自己的聰明才智,出現(xiàn)了類似utf-16,utf-32等等的形式,在這種情況下,Unicode的理想并沒有實(shí)現(xiàn),直到互聯(lián)網(wǎng)的普及,utf-8的出現(xiàn),utf-8的出現(xiàn)真正實(shí)現(xiàn)了大一統(tǒng),它在實(shí)現(xiàn)Unicode規(guī)范的同時,又?jǐn)U展了自己的規(guī)則,utf-8規(guī)定了任意一種字符編碼后的機(jī)器碼都是占用6個字節(jié)。
很多人在這里有個誤會,就是容易把Bytes和編程語言里的其它數(shù)據(jù)類型混淆,其實(shí)Bytes才是計算機(jī)里真正的數(shù)據(jù)類型,也是網(wǎng)絡(luò)數(shù)據(jù)傳輸中唯一的數(shù)據(jù)格式,什么Json,Xml這些格式的字符串最后想傳輸也都得轉(zhuǎn)成Bytes的數(shù)據(jù)類型才能通過socket進(jìn)行傳輸,而Bytes的數(shù)據(jù)與字符串類型數(shù)據(jù)的轉(zhuǎn)換就是編碼與解碼的轉(zhuǎn)換,utf-8是編解碼時指定的格式。
這里再簡單說一下序列化與反序列化,序列化可以分為本地和網(wǎng)絡(luò),對于本地序列化,往往就是將內(nèi)存中的對象持久化到本地的硬盤,此時序列化做的工作就是將對象和一些對象的相關(guān)信息序列化成字符串,然后字符串以某種格式(比如utf-8)進(jìn)行編碼變成bytes類型,存儲到硬盤。反序列化就是先將硬盤中的bytes類型中的數(shù)據(jù)讀到內(nèi)存經(jīng)過解碼變成字符串,然后對字符串進(jìn)行反序列化解析生成對象。
Request的編碼判斷:
bytes str unicode
1. str/bytes
>> s = '123'
>> type(s)
str
>> s = b'123'
bytes
1
2
3
4
5
6
2. str 與 bytes 之間的類型轉(zhuǎn)換
python str與bytes之間的轉(zhuǎn)換
str 與 bytes 之間的類型轉(zhuǎn)換如下:
str ? bytes:bytes(s, encoding='utf8')
bytes ? str:str(b, encoding='utf-8')
此外還可通過編碼解碼的形式對二者進(jìn)行轉(zhuǎn)換,
str 編碼成 bytes 格式:str.encode(s)
bytes 格式編碼成 str 類型:bytes.decode(b)
3. strings 分別在 Python2、Python 3下
What is tensorflow.compat.as_str()?
Python 2 將 strings 處理為原生的 bytes 類型,而不是 unicode,
Python 3 所有的 strings 均是 unicode 類型。
1, BefaultSoup 轉(zhuǎn)碼邏輯
代碼位置 python2.7/site-packages/bs4/dammit.py
@property
def encodings(self):
"""Yield a number of encodings that might work for this markup."""
tried = set()
for e in self.override_encodings:
if self._usable(e, tried):
yield e
# Did the document originally start with a byte-order mark
# that indicated its encoding?
if self._usable(self.sniffed_encoding, tried):
yield self.sniffed_encoding
# Look within the document for an XML or HTML encoding
# declaration.
if self.declared_encoding is None:
self.declared_encoding = self.find_declared_encoding(
self.markup, self.is_html)
if self._usable(self.declared_encoding, tried):
yield self.declared_encoding
# Use third-party character set detection to guess at the
# encoding.
if self.chardet_encoding is None:
self.chardet_encoding = chardet_dammit(self.markup)
if self._usable(self.chardet_encoding, tried):
yield self.chardet_encoding
# As a last-ditch effort, try utf-8 and windows-1252.
for e in ('utf-8', 'windows-1252'):
if self._usable(e, tried):
yield e
解釋: 這段代碼包含了幾個編碼測試函數(shù)流程, 優(yōu)先級如下:
1, self.override_encodings 用戶定義的編碼
2, self.sniffed_encoding
self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
這個函數(shù)通過檢查網(wǎng)頁開始的空格的編碼格式來判斷網(wǎng)頁的編碼
@classmethod
def strip_byte_order_mark(cls, data):
"""If a byte-order mark is present, strip it and return the encoding it implies."""
encoding = None
if isinstance(data, unicode):
# Unicode data cannot have a byte-order mark.
return data, encoding
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
and (data[2:4] != '\x00\x00'):
encoding = 'utf-16be'
data = data[2:]
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
and (data[2:4] != '\x00\x00'):
encoding = 'utf-16le'
data = data[2:]
elif data[:3] == b'\xef\xbb\xbf':
encoding = 'utf-8'
data = data[3:]
elif data[:4] == b'\x00\x00\xfe\xff':
encoding = 'utf-32be'
data = data[4:]
elif data[:4] == b'\xff\xfe\x00\x00':
encoding = 'utf-32le'
data = data[4:]
return data, encoding
3, self.declared_encoding
self.declared_encoding = self.find_declared_encoding(
self.markup, self.is_html)
這個函數(shù)通過正則匹配來找到html前面的聲明
正則匹配串
xml_encoding_re = re.compile(
'^'.encode(), re.I)
html_meta_re = re.compile(
']+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)
@classmethod
def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
"""Given a document, tries to find its declared encoding.
An XML encoding is declared at the beginning of the document.
An HTML encoding is declared in a tag, hopefully near the
beginning of the document.
"""
if search_entire_document:
xml_endpos = html_endpos = len(markup)
else:
xml_endpos = 1024
html_endpos = max(2048, int(len(markup) * 0.05))
declared_encoding = None
declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)
if not declared_encoding_match and is_html:
declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)
if declared_encoding_match is not None:
declared_encoding = declared_encoding_match.groups()[0].decode(
'ascii', 'replace')
if declared_encoding:
return declared_encoding.lower()
return None
self.chardet_encoding = chardet_dammit(self.markup)
很明顯, 這個是根據(jù)chardet包來判斷, chardet根據(jù)正文的編碼匹配來統(tǒng)計, 會有個confidence的輔助判斷
import chardet
def chardet_dammit(s):
return chardet.detect(s)['encoding']
2,Request 轉(zhuǎn)碼邏輯
response = requests.get(url, verify=False, headers=configSpider.get_head())
requests 提供了兩個編碼識別結(jié)果
requests.encoding
位置: python2.7/site-packages/requests/adapters.py
```
response.encoding = get_encoding_from_headers(response.headers)
```
位置:python2.7/site-packages/requests/utils.py
```
def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
:rtype: str
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = cgi.parse_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1'
```
cgi.parse_header()函數(shù)
```
def parse_header(line):
"""Parse a Content-type like header.
Return the main content-type and a dictionary of options.
"""
parts = _parseparam(';' + line)
key = parts.next()
pdict = {}
for p in parts:
i = p.find('=')
if i >= 0:
name = p[:i].strip().lower()
value = p[i+1:].strip()
if len(value) >= 2 and value[0] == value[-1] == '"':
value = value[1:-1]
value = value.replace('\\\\', '\\').replace('\\"', '"')
pdict[name] = value
return key, pdict
```
這個就是取的響應(yīng)頭 header的聲明編碼,如果有charset具體的編碼 則給出, 如果是text/html 則返回 'ISO-8859-1'
很多網(wǎng)頁Response-Headers都是直接給一個content-type: text/html, 用 'ISO-8859-1'明顯是亂碼了
response.apparent_encoding
Request還有一個apparent_encoding的編碼, 這個很簡單也是來自于正文的chardet, 也并不能保證完全準(zhǔn)確的
3, Request的content和text
```
@property
def content(self):
"""Content of the response, in bytes."""
if self._content is False:
# Read the contents.
if self._content_consumed:
raise RuntimeError(
'The content for this response was already consumed')
if self.status_code == 0 or self.raw is None:
self._content = None
else:
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
self._content_consumed = True
# don't need to release the connection; that's been handled by urllib3
# since we exhausted the data.
return self._content
@property
def text(self):
"""Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using
``chardet``.
The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""
# Try charset from content-type
content = None
encoding = self.encoding
if not self.content:
return str('')
# Fallback to auto-detected encoding.
if self.encoding is None:
encoding = self.apparent_encoding
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace')
except (LookupError, TypeError):
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# A TypeError can be raised if encoding is None
#
# So we try blindly encoding.
content = str(self.content, errors='replace')
return content
```
content是bytes 字節(jié)流格式的, 而text是將其轉(zhuǎn)為str
content = str(self.content, encoding, errors='replace')
如果網(wǎng)頁正好是utf-8格式的, 因?yàn)榫幋a環(huán)境# -*- coding: utf-8 -*-, 所以content直接可用; 否則依然會有亂碼問題
綜上, 最好的解決方案是 結(jié)合源碼的實(shí)現(xiàn)以及自身的需求來實(shí)現(xiàn)一套方案:
Headers 聲明編碼
網(wǎng)頁開始的空格檢測
正文聲明編碼
chardet 模塊檢測編碼
對于 調(diào)用Request包, 簡單處理:
if response.encoding == 'ISO-8859-1':
response.encoding = response.apparent_encoding
response.text
或者借用bs4的方法
from bs4.dammit import EncodingDetector
self.detector = EncodingDetector(
markup, override_encodings, is_html, exclude_encodings)
print self.detector.encoding
總結(jié)
以上是生活随笔為你收集整理的get 到的html代码如何转码,爬虫网页转码逻辑的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用Nmap工具查找Downadup/Co
- 下一篇: javaIO学习下:javase学习(三