當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

get 到的html代码如何转码,爬虫网页转码逻辑

發(fā)布時間：2023/12/20 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 get 到的html代码如何转码,爬虫网页转码逻辑小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

爬蟲網(wǎng)頁轉(zhuǎn)碼邏輯

最先出現(xiàn)的編碼格式是ASCII碼，這種編碼規(guī)則是美國人制定的，大致的規(guī)則是用一個字節(jié)(8個bit)去表示出現(xiàn)的字符，其實(shí)由于在老美的世界里中總共出現(xiàn)的字符也不超過128個，而一個字節(jié)能夠表示256種字符，所以當(dāng)時這種編碼的方式是沒有問題的。

后來計算機(jī)在全世界普及起來，不同國家的語言都面臨著如何在計算機(jī)中表示的問題，比如我們的漢字常用的就有幾千個，顯然最開始一個字節(jié)的ASIIC碼表示就不夠用了,這個時候就出現(xiàn)了Unicode編碼，確切的說它只是一種表示規(guī)則，并不對應(yīng)具體的實(shí)現(xiàn)形式。Uni-這個前綴在英文中表示的是統(tǒng)一的含義，它試圖把全世界的語言用一種統(tǒng)一的編碼表示，但是Unicode只規(guī)定了字符對應(yīng)的二進(jìn)制數(shù)據(jù)，但是沒有規(guī)定這種二進(jìn)制數(shù)據(jù)在內(nèi)存中具體用幾個字節(jié)存儲，然后就亂套了，各國在實(shí)現(xiàn)Unicode時都發(fā)揮了自己的聰明才智，出現(xiàn)了類似utf-16,utf-32等等的形式，在這種情況下，Unicode的理想并沒有實(shí)現(xiàn)，直到互聯(lián)網(wǎng)的普及，utf-8的出現(xiàn)，utf-8的出現(xiàn)真正實(shí)現(xiàn)了大一統(tǒng)，它在實(shí)現(xiàn)Unicode規(guī)范的同時，又?jǐn)U展了自己的規(guī)則，utf-8規(guī)定了任意一種字符編碼后的機(jī)器碼都是占用6個字節(jié)。

很多人在這里有個誤會，就是容易把Bytes和編程語言里的其它數(shù)據(jù)類型混淆，其實(shí)Bytes才是計算機(jī)里真正的數(shù)據(jù)類型，也是網(wǎng)絡(luò)數(shù)據(jù)傳輸中唯一的數(shù)據(jù)格式，什么Json，Xml這些格式的字符串最后想傳輸也都得轉(zhuǎn)成Bytes的數(shù)據(jù)類型才能通過socket進(jìn)行傳輸，而Bytes的數(shù)據(jù)與字符串類型數(shù)據(jù)的轉(zhuǎn)換就是編碼與解碼的轉(zhuǎn)換，utf-8是編解碼時指定的格式。

這里再簡單說一下序列化與反序列化，序列化可以分為本地和網(wǎng)絡(luò)，對于本地序列化，往往就是將內(nèi)存中的對象持久化到本地的硬盤，此時序列化做的工作就是將對象和一些對象的相關(guān)信息序列化成字符串，然后字符串以某種格式(比如utf-8)進(jìn)行編碼變成bytes類型，存儲到硬盤。反序列化就是先將硬盤中的bytes類型中的數(shù)據(jù)讀到內(nèi)存經(jīng)過解碼變成字符串，然后對字符串進(jìn)行反序列化解析生成對象。

Request的編碼判斷：

bytes str unicode

1. str/bytes

>> s = '123'

>> type(s)

str

>> s = b'123'

bytes

2. str 與 bytes 之間的類型轉(zhuǎn)換

python str與bytes之間的轉(zhuǎn)換

str 與 bytes 之間的類型轉(zhuǎn)換如下：

str ? bytes：bytes(s, encoding='utf8')

bytes ? str：str(b, encoding='utf-8')

此外還可通過編碼解碼的形式對二者進(jìn)行轉(zhuǎn)換，

str 編碼成 bytes 格式：str.encode(s)

bytes 格式編碼成 str 類型：bytes.decode(b)

3. strings 分別在 Python2、Python 3下

What is tensorflow.compat.as_str()?

Python 2 將 strings 處理為原生的 bytes 類型，而不是 unicode，

Python 3 所有的 strings 均是 unicode 類型。

1, BefaultSoup 轉(zhuǎn)碼邏輯

代碼位置 python2.7/site-packages/bs4/dammit.py

@property

def encodings(self):

"""Yield a number of encodings that might work for this markup."""

tried = set()

for e in self.override_encodings:

if self._usable(e, tried):

yield e

# Did the document originally start with a byte-order mark

# that indicated its encoding?

if self._usable(self.sniffed_encoding, tried):

yield self.sniffed_encoding

# Look within the document for an XML or HTML encoding

# declaration.

if self.declared_encoding is None:

self.declared_encoding = self.find_declared_encoding(

self.markup, self.is_html)

if self._usable(self.declared_encoding, tried):

yield self.declared_encoding

# Use third-party character set detection to guess at the

# encoding.

if self.chardet_encoding is None:

self.chardet_encoding = chardet_dammit(self.markup)

if self._usable(self.chardet_encoding, tried):

yield self.chardet_encoding

# As a last-ditch effort, try utf-8 and windows-1252.

for e in ('utf-8', 'windows-1252'):

if self._usable(e, tried):

yield e

解釋：這段代碼包含了幾個編碼測試函數(shù)流程，優(yōu)先級如下：

1， self.override_encodings 用戶定義的編碼

2， self.sniffed_encoding

self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)

這個函數(shù)通過檢查網(wǎng)頁開始的空格的編碼格式來判斷網(wǎng)頁的編碼

@classmethod

def strip_byte_order_mark(cls, data):

"""If a byte-order mark is present, strip it and return the encoding it implies."""

encoding = None

if isinstance(data, unicode):

# Unicode data cannot have a byte-order mark.

return data, encoding

if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \

and (data[2:4] != '\x00\x00'):

encoding = 'utf-16be'

data = data[2:]

elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \

and (data[2:4] != '\x00\x00'):

encoding = 'utf-16le'

data = data[2:]

elif data[:3] == b'\xef\xbb\xbf':

encoding = 'utf-8'

data = data[3:]

elif data[:4] == b'\x00\x00\xfe\xff':

encoding = 'utf-32be'

data = data[4:]

elif data[:4] == b'\xff\xfe\x00\x00':

encoding = 'utf-32le'

data = data[4:]

return data, encoding

3, self.declared_encoding

self.declared_encoding = self.find_declared_encoding(

self.markup, self.is_html)

這個函數(shù)通過正則匹配來找到html前面的聲明

正則匹配串

xml_encoding_re = re.compile(

'^'.encode(), re.I)

html_meta_re = re.compile(

']+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)

@classmethod

def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):

"""Given a document, tries to find its declared encoding.

An XML encoding is declared at the beginning of the document.

An HTML encoding is declared in a tag, hopefully near the

beginning of the document.

"""

if search_entire_document:

xml_endpos = html_endpos = len(markup)

else:

xml_endpos = 1024

html_endpos = max(2048, int(len(markup) * 0.05))

declared_encoding = None

declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)

if not declared_encoding_match and is_html:

declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)

if declared_encoding_match is not None:

declared_encoding = declared_encoding_match.groups()[0].decode(

'ascii', 'replace')

if declared_encoding:

return declared_encoding.lower()

return None

self.chardet_encoding = chardet_dammit(self.markup)

很明顯，這個是根據(jù)chardet包來判斷， chardet根據(jù)正文的編碼匹配來統(tǒng)計，會有個confidence的輔助判斷

import chardet

def chardet_dammit(s):

return chardet.detect(s)['encoding']

2，Request 轉(zhuǎn)碼邏輯

response = requests.get(url, verify=False, headers=configSpider.get_head())

requests 提供了兩個編碼識別結(jié)果

requests.encoding

位置： python2.7/site-packages/requests/adapters.py

```

response.encoding = get_encoding_from_headers(response.headers)

```

位置：python2.7/site-packages/requests/utils.py

```

def get_encoding_from_headers(headers):

"""Returns encodings from given HTTP Header Dict.

:param headers: dictionary to extract encoding from.

:rtype: str

"""

content_type = headers.get('content-type')

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if 'charset' in params:

return params['charset'].strip("'\"")

if 'text' in content_type:

return 'ISO-8859-1'

```

cgi.parse_header()函數(shù)

```

def parse_header(line):

"""Parse a Content-type like header.

Return the main content-type and a dictionary of options.

"""

parts = _parseparam(';' + line)

key = parts.next()

pdict = {}

for p in parts:

i = p.find('=')

if i >= 0:

name = p[:i].strip().lower()

value = p[i+1:].strip()

if len(value) >= 2 and value[0] == value[-1] == '"':

value = value[1:-1]

value = value.replace('\\\\', '\\').replace('\\"', '"')

pdict[name] = value

return key, pdict

```

這個就是取的響應(yīng)頭 header的聲明編碼，如果有charset具體的編碼則給出，如果是text/html 則返回 'ISO-8859-1'

很多網(wǎng)頁Response-Headers都是直接給一個content-type: text/html, 用 'ISO-8859-1'明顯是亂碼了

response.apparent_encoding

Request還有一個apparent_encoding的編碼，這個很簡單也是來自于正文的chardet，也并不能保證完全準(zhǔn)確的

3， Request的content和text

```

@property

def content(self):

"""Content of the response, in bytes."""

if self._content is False:

# Read the contents.

if self._content_consumed:

raise RuntimeError(

'The content for this response was already consumed')

if self.status_code == 0 or self.raw is None:

self._content = None

else:

self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

self._content_consumed = True

# don't need to release the connection; that's been handled by urllib3

# since we exhausted the data.

return self._content

@property

def text(self):

"""Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using

``chardet``.

The encoding of the response content is determined based solely on HTTP

headers, following RFC 2616 to the letter. If you can take advantage of

non-HTTP knowledge to make a better guess at the encoding, you should

set ``r.encoding`` appropriately before accessing this property.

"""

# Try charset from content-type

content = None

encoding = self.encoding

if not self.content:

return str('')

# Fallback to auto-detected encoding.

if self.encoding is None:

encoding = self.apparent_encoding

# Decode unicode from given encoding.

try:

content = str(self.content, encoding, errors='replace')

except (LookupError, TypeError):

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

# A TypeError can be raised if encoding is None

# So we try blindly encoding.

content = str(self.content, errors='replace')

return content

```

content是bytes 字節(jié)流格式的，而text是將其轉(zhuǎn)為str

content = str(self.content, encoding, errors='replace')

如果網(wǎng)頁正好是utf-8格式的，因?yàn)榫幋a環(huán)境# -*- coding: utf-8 -*-，所以content直接可用；否則依然會有亂碼問題

綜上，最好的解決方案是結(jié)合源碼的實(shí)現(xiàn)以及自身的需求來實(shí)現(xiàn)一套方案：

Headers 聲明編碼

網(wǎng)頁開始的空格檢測

正文聲明編碼

chardet 模塊檢測編碼

對于調(diào)用Request包，簡單處理：

if response.encoding == 'ISO-8859-1':

response.encoding = response.apparent_encoding

response.text

或者借用bs4的方法

from bs4.dammit import EncodingDetector

self.detector = EncodingDetector(

markup, override_encodings, is_html, exclude_encodings)

print self.detector.encoding

總結(jié)

以上是生活随笔為你收集整理的get 到的html代码如何转码,爬虫网页转码逻辑的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：用Nmap工具查找Downadup/Co
下一篇： javaIO学习下：javase学习（三