當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

爬虫踩坑系列——etree.HTML解析异常

發布時間：2023/12/20 HTML 27 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫踩坑系列——etree.HTML解析异常小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在爬蟲的過程中，難免會遇到各種各樣的問題。在這里，為大家分享一個關于etree.HTML解析異常的問題。
1.問題描述：
爬蟲過程中，一般會使用requests.get()方法獲取一個網頁上的HTML內容，然后通過lxml庫中的etree.HTML來解析這個網頁的結構，最后通過xpath獲取自己所需的內容。

本人爬蟲的具體代碼可簡單抽象如下：

res = requests.get(url)html = etree.HTML(res.text)contents = html.xpaht('//div/xxxx')

然后遇到了如下的錯誤信息：

Traceback (most recent call last):File "xxxxxxxx.py", line 157, in <module>get_website_title_content(url)File "xxxxxxxx.py", line 141, in get_website_title_contenthtml = etree.HTML(html_text)File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTMLFile "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocumentValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

關鍵錯誤就是 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

2.解決方法
通過查閱相關資料，造成這個錯誤的原因其實是requests返回的 res.text 和 res.content 兩者區別的問題。查閱requests源代碼中是text和content定義（如下所示）可知：res.text返回的是Unicode類型的數據，而res.content返回的是bytes類型的數據。

@propertydef content(self):"""Content of the response, in bytes."""if self._content is False:# Read the contents.if self._content_consumed:raise RuntimeError('The content for this response was already consumed')if self.status_code == 0 or self.raw is None:self._content = Noneelse:self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''self._content_consumed = True# don't need to release the connection; that's been handled by urllib3# since we exhausted the data.return self._content@propertydef text(self):"""Content of the response, in unicode.If Response.encoding is None, encoding will be guessed using``chardet``.The encoding of the response content is determined based solely on HTTPheaders, following RFC 2616 to the letter. If you can take advantage ofnon-HTTP knowledge to make a better guess at the encoding, you shouldset ``r.encoding`` appropriately before accessing this property."""# Try charset from content-typecontent = Noneencoding = self.encodingif not self.content:return str('')# Fallback to auto-detected encoding.if self.encoding is None:encoding = self.apparent_encoding# Decode unicode from given encoding.try:content = str(self.content, encoding, errors='replace')except (LookupError, TypeError):# A LookupError is raised if the encoding was not found which could# indicate a misspelling or similar mistake.## A TypeError can be raised if encoding is None## So we try blindly encoding.content = str(self.content, errors='replace')return content

導致該錯誤的原因是etree解析是不支持編碼聲明的Unicode字符串的。
因此解決方法很簡單，第一種就是直接使用 res.content，如下：

res = requests.get(url)html = etree.HTML(res.content )contents = html.xpath('//div/xxxx')

第二種方法則是將Unicode字符串轉換為bytes數組，如下：

res = requests.get(url)html_text = bytes(bytearray(res.text, encoding='utf-8'))html = etree.HTML(html_text)contents = html.xpath('//div/xxxx')

總結

以上是生活随笔為你收集整理的爬虫踩坑系列——etree.HTML解析异常的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： c语言编程实现dsa算法,C语言实现DS
下一篇： es文件管理器怎么运行html,es文件