百度迁徙 迁入人口和迁徙规模爬虫
生活随笔
收集整理的這篇文章主要介紹了
百度迁徙 迁入人口和迁徙规模爬虫
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
最近做COVID-19相關的課程項目,需要用到省級間人口遷移的數據。筆者參考改進了https://blog.csdn.net/qq_44315987/article/details/104118498 的城市間流動的代碼,從百度遷徙爬取了數據,并將數據保存在同一張表內。
遷入人口
# coding:utf-8 import urllib.request import pandas as pddef get_code_city():code_str = """北京|110000,天津|120000,廣西壯族自治區|450000,內蒙古自治區|150000,寧夏回族自治區|640000,新疆維吾爾自治區|650000,西藏自治區|540000,上海|310000,浙江|330000,重慶|500000,安徽|340000,福建|350000,甘肅|620000,廣東|440000,貴州|520000,海南|460000,河北|130000,黑龍江|230000,河南|410000,湖北|420000,湖南|430000,江蘇|320000,江西|360000,吉林|220000,遼寧|210000,青海|630000,山東|370000,山西|140000,陜西|610000,四川|510000,云南|530000"""code_dict = {}for mapping in code_str.split(","):name, number = mapping.split("|")code_dict[name] = numberreturn code_dictdef conserve(data, time, work):province = []value = []for item in data['list']:province.append(item['province_name'])value.append(item['value'])res = {'省份': province, '比例': value}res = pd.DataFrame(res)res.to_excel(excel_writer=work, sheet_name=time)data_type = "move_in" f = pd.DataFrame() f.to_excel('{}.xlsx'.format(data_type)) # 保存的文件名 work = pd.ExcelWriter('{}.xlsx'.format(data_type)) time_slots = list(range(20200115, 20200132)) + list(range(20200201, 20200230)) + list(range(20200301, 20200316))# 用到的時間片段provinces = ["上海市", "北京市", "重慶市", "天津市", "內蒙古自治區", "廣西壯族自治區", "西藏自治區", "新疆維吾爾自治區", "寧夏回族自治區", "河北省", "山西省", "遼寧省", "吉林省", "黑龍江省", "江蘇省", "浙江省", "安徽省", "福建省", "江西省", "山東省", "河南省", "湖北省", "湖南省", "廣東省", "海南省", "四川省", "貴州省", "云南省", "陜西省", "甘肅省", "青海省"]code = get_code_city() # 獲取城市和編碼。這里去除了港澳臺 for name, num in code.items():province_ratio_dict = {}province_ratio_dict['省份'] = provincesfor t in time_slots:print(name, t)url = 'http://huiyan.baidu.com/migration/provincerank.jsonp?' \'dt=province&id={}&type=move_in&date={}'.format(str(num), str(t)) # 設置url,這里比較重要。如果選擇遷出,改為type=move_outhead = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/74.0.3729.169 Safari/537.36'}req = urllib.request.Request(url, headers=head)response = urllib.request.urlopen(url)html = response.read().decode('unicode_escape')if html.startswith("cb"):html = html[3:-1]data = eval(html)["data"]ratios = []province_ratio_temp = {item['province_name']: item['value'] for item in data['list']}for province in provinces:if province not in province_ratio_temp: # 這一段是為了確保每個日期下保存的省市數目都一致,因而可以都保存在一張表格內ratios.append(0)else:ratios.append(province_ratio_temp[province])province_ratio_dict[str(t)] = ratiosres = pd.DataFrame(province_ratio_dict)res.to_excel(excel_writer=work, sheet_name=name)work.save()
遷徙規模爬蟲
# coding:utf-8 import urllib.request import pandas as pddef get_code_city():code_str = """北京|110000,天津|120000,廣西壯族自治區|450000,內蒙古自治區|150000,寧夏回族自治區|640000,新疆維吾爾自治區|650000,西藏自治區|540000,上海|310000,浙江|330000,重慶|500000,安徽|340000,福建|350000,甘肅|620000,廣東|440000,貴州|520000,海南|460000,河北|130000,黑龍江|230000,河南|410000,湖北|420000,湖南|430000,江蘇|320000,江西|360000,吉林|220000,遼寧|210000,青海|630000,山東|370000,山西|140000,陜西|610000,四川|510000,云南|530000"""code_dict = {}for mapping in code_str.split(","):name, number = mapping.split("|")code_dict[name] = numberreturn code_dictdef conserve(data, time, work):province = []value = []for item in data['list']:province.append(item['province_name'])value.append(item['value'])res = {'省份': province, '比例': value}res = pd.DataFrame(res)res.to_excel(excel_writer=work, sheet_name=time)data_type = "scale" f = pd.DataFrame() f.to_excel('{}.xlsx'.format(data_type)) work = pd.ExcelWriter('{}.xlsx'.format(data_type)) t = 20200115 time_slots = list(range(20200115, 20200132)) + list(range(20200201, 20200230)) + list(range(20200301, 20200316))provinces = ["上海市", "北京市", "重慶市", "天津市", "內蒙古自治區", "廣西壯族自治區", "西藏自治區", "新疆維吾爾自治區", "寧夏回族自治區", "河北省", "山西省", "遼寧省", "吉林省", "黑龍江省", "江蘇省", "浙江省", "安徽省", "福建省", "江西省", "山東省", "河南省", "湖北省", "湖南省", "廣東省", "海南省", "四川省", "貴州省", "云南省", "陜西省", "甘肅省", "青海省"]code = get_code_city() province_scale_dict = {} province_scale_dict["時間"] = time_slots for name, num in code.items():url = 'http://huiyan.baidu.com/migration/historycurve.jsonp?' \'dt=province&id={}&type=move_in&date={}'.format(str(num), str(t))head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/74.0.3729.169 Safari/537.36'}req = urllib.request.Request(url, headers=head)response = urllib.request.urlopen(url)html = response.read().decode('unicode_escape')if html.startswith("cb"):html = html[3:-1]data = eval(html)["data"]scales = []for time in data["list"]:if int(time) in time_slots:scales.append(data["list"][time])province_scale_dict[name] = scalesres = pd.DataFrame(province_scale_dict)res.to_excel(excel_writer=work)work.save()
規模爬蟲結果保存在scale.xlsx中
總結
以上是生活随笔為你收集整理的百度迁徙 迁入人口和迁徙规模爬虫的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 无法创建接口的实例_什么是接口?
- 下一篇: 网络知识:路由器不关闭这个功能,视频越刷