日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 前端技术 > HTML >内容正文

HTML

爬虫踩坑系列——etree.HTML解析异常

發布時間:2023/12/20 HTML 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 爬虫踩坑系列——etree.HTML解析异常 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在爬蟲的過程中,難免會遇到各種各樣的問題。在這里,為大家分享一個關于etree.HTML解析異常的問題。
1.問題描述:
爬蟲過程中,一般會使用requests.get()方法獲取一個網頁上的HTML內容,然后通過lxml庫中的etree.HTML來解析這個網頁的結構,最后通過xpath獲取自己所需的內容。

本人爬蟲的具體代碼可簡單抽象如下:

res = requests.get(url)html = etree.HTML(res.text)contents = html.xpaht('//div/xxxx')

然后遇到了如下的錯誤信息:

Traceback (most recent call last):File "xxxxxxxx.py", line 157, in <module>get_website_title_content(url)File "xxxxxxxx.py", line 141, in get_website_title_contenthtml = etree.HTML(html_text)File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTMLFile "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocumentValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

關鍵錯誤就是 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

2.解決方法
通過查閱相關資料,造成這個錯誤的原因其實是requests返回的 res.text 和 res.content 兩者區別的問題。查閱requests源代碼中是text和content定義(如下所示)可知:res.text返回的是Unicode類型的數據,而res.content返回的是bytes類型的數據。

@propertydef content(self):"""Content of the response, in bytes."""if self._content is False:# Read the contents.if self._content_consumed:raise RuntimeError('The content for this response was already consumed')if self.status_code == 0 or self.raw is None:self._content = Noneelse:self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''self._content_consumed = True# don't need to release the connection; that's been handled by urllib3# since we exhausted the data.return self._content@propertydef text(self):"""Content of the response, in unicode.If Response.encoding is None, encoding will be guessed using``chardet``.The encoding of the response content is determined based solely on HTTPheaders, following RFC 2616 to the letter. If you can take advantage ofnon-HTTP knowledge to make a better guess at the encoding, you shouldset ``r.encoding`` appropriately before accessing this property."""# Try charset from content-typecontent = Noneencoding = self.encodingif not self.content:return str('')# Fallback to auto-detected encoding.if self.encoding is None:encoding = self.apparent_encoding# Decode unicode from given encoding.try:content = str(self.content, encoding, errors='replace')except (LookupError, TypeError):# A LookupError is raised if the encoding was not found which could# indicate a misspelling or similar mistake.## A TypeError can be raised if encoding is None## So we try blindly encoding.content = str(self.content, errors='replace')return content

導致該錯誤的原因是etree解析是不支持編碼聲明的Unicode字符串的
因此解決方法很簡單,第一種就是直接使用 res.content,如下:

res = requests.get(url)html = etree.HTML(res.content )contents = html.xpath('//div/xxxx')

第二種方法則是將Unicode字符串轉換為bytes數組,如下:

res = requests.get(url)html_text = bytes(bytearray(res.text, encoding='utf-8'))html = etree.HTML(html_text)contents = html.xpath('//div/xxxx')

總結

以上是生活随笔為你收集整理的爬虫踩坑系列——etree.HTML解析异常的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。