當前位置：首頁 > 编程语言 > python >内容正文

python

【Python实战】机型自动化标注（搜狗爬虫实现）

發布時間：2024/4/14 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python实战】机型自动化标注（搜狗爬虫实现）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. 引言

從安卓手機收集上來的機型大都為這樣：

mi|5
mi|4c
mi 4c
2014022
kiw-al10
nem-tl00h

收集的機型大都雜亂無章，不便于做統計分析。因此，標注顯得尤為重要。

中關村在線有對國內大部分手機的介紹情況，包括手機機型nem-tl00h及其對應的常見名稱榮耀暢玩5C。因而，設計機型自動化標注策略如下：

在搜狗搜索中輸入機型進行搜索，為了限定第一個返回結果為ZOL網站，加上限定詞site:detail.zol.com.cn；

通過第一條返回結果的鏈接，跳轉到相應的ZOL頁面，解析拿到標注名稱與手機別名。

2. 實現

根據上面的爬取策略，我用Python實現一個簡單的爬蟲：采用PyQuery解析HTML頁面，PyQuery采用類似jQuery的語法來操作HTML元素，熟悉jQuery的人對PyQuery是上手即用。

Sogou爬蟲的代碼實現（基于Python 3.5.2）如下：

# -*- coding: utf-8 -*- # @Time : 2016/8/8 # @Author : rain import codecs import csv import logging import re import time import urllib.parse import urllib.request import urllib.errorfrom pyquery import PyQuery as pqdef quote_url(model_name):base_url = "https://www.sogou.com/web?query=%s"site_zol = "site:detail.zol.com.cn "return base_url % (urllib.parse.quote(site_zol + model_name))def parse_sogou(model_name):search_url = quote_url(model_name)request = urllib.request.Request(url=search_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/45.0.2454.101 Safari/537.36'})sogou_html = urllib.request.urlopen(request).read()sogou_dom = pq(sogou_html)goto_url = sogou_dom("div.results>.vrwrap>.vrTitle>a[target='_blank']").eq(0).attr("href")logging.warning("goto url: %s", goto_url)if goto_url is None:return Nonegoto_dom = pq(url=goto_url)script_text = goto_dom("script").text()zol_url = re.findall(r'\("(.*)"\)', script_text)[0]return zol_urldef parse_zol(model_name):zol_url = parse_sogou(model_name)if zol_url is None:return None, Nonetry:zol_html = urllib.request.urlopen(zol_url).read()except urllib.error.HTTPError as e:logging.exception(e)return None, Nonezol_dom = pq(zol_html)title = zol_dom(".page-title.clearfix")name = title("h1").text()alias = title("h2").text()if u'（' in name and u'）' in name:match_result = re.match(u'(.*)（(.*)）', name)name = match_result.group(1)alias = match_result.group(2) + " " + aliasreturn name, aliasif __name__ == "__main__":with codecs.open("./resources/data.txt", 'r', 'utf-8') as fr:with open("./resources/result.csv", 'w', newline='') as fw:writer = csv.writer(fw, delimiter=',')for model in fr.readlines():model = model.rstrip()label_name, label_alias = parse_zol(model)writer.writerow([model, label_name, label_alias])logging.warning("model: %s, name: %s, alias: %s", model, label_name, label_alias)time.sleep(10)

為了防止sogou封禁，每爬一次則休息10s。當然，這種爬取的速度會非常慢，需要做些優化。

3. 優化

下載驗證碼

sogou是通過訪問頻次來進行封禁，當訪問次數過多時，會要求輸入驗證碼：

<div class="content-box"><p class="ip-time-p">IP:61...<br/>訪問時間：2016.08.09 15:40:04</p><p class="p2">用戶您好，您的訪問過于頻繁，為確認本次訪問為正常用戶行為，需要您協助驗證。</p>...<form name="authform" method="POST" id="seccodeForm" action="/"><p class="p4">...<input type="hidden" name="m" value="0"/> <span class="s1"><a onclick="changeImg2();" href="javascript:void(0)"><img id="seccodeImage" onload="setImgCode(1)" onerror="setImgCode(0)" src="util/seccode.php?tc=1470728404" width="100" height="40" alt="請輸入圖中的驗證碼" title="請輸入圖中的驗證碼"/></a></span><a href="javascript:void(0);" id="change-img" onclick="changeImg2();" style="padding-left:50px;">換一張</a><span class="s2" id="error-tips" style="display: none;"/></p></form>... </div>

通過分析html，真實的驗證碼圖像需要做如下的拼接：

http://weixin.sogou.com/antispider/util/seccode.php?tc=1470728404

下載驗證碼圖像到本地：

import urllib.request from pyquery import PyQuery as pq import refor i in range(100):html = urllib.request.urlopen("https://www.sogou.com/web?query=treant").read()dom = pq(html)img_src = dom("#seccodeImage").attr("src")if img_src is not None:img_name = re.search("tc=(.*)", img_src).group(1)anti_img_url = "http://weixin.sogou.com/antispider/" + img_srcurllib.request.urlretrieve(anti_img_url, "./images/" + img_name + ".jpg")

tesseract識別驗證碼，識別的效果的一般，等以后有時間再考慮下其他識別方法。

轉載于:https://www.cnblogs.com/en-heng/p/5754112.html

超強干貨來襲云風專訪：近40年碼齡，通宵達旦的技術人生

總結

以上是生活随笔為你收集整理的【Python实战】机型自动化标注（搜狗爬虫实现）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：转已知两点坐标和半径求圆心坐标程序C+
下一篇： Python加密—HMACSHA1 加密