【日常】《中国统计年鉴》与《中国金融年鉴》数据表爬虫(附1985-2020所有Excel资源)
序言
最近期末比較忙,掛個可能有用的資源:《中國金融年鑒》(1986-2019)和《中國統計年鑒》(1981-2020)的所有Excel表的資源。數據來源于中國知網的爬蟲(下面正文中有提及具體鏈接網址)。目前為止網上還沒有人提供完整的自1986年至今的年鑒數據,基本上只有特定年份的年鑒數據,而且還都是需要付費的。
鏈接:https://pan.baidu.com/s/13fjrInmjjxaNQRgS_Jv91w 提取碼:k5ir好了需要資源的上面自取即可,后記里的廢話就不用看了。
目錄
- 序言
- 1 《中國統計年鑒》與《中國金融年鑒》Excel數據爬蟲
- 2 關于爬蟲的一些細節說明及如何使用腳本處理獲得的Excel表
- 后記(最近的一些感想)
1 《中國統計年鑒》與《中國金融年鑒》Excel數據爬蟲
- 魚已經提供在上面了,下面是漁,不過筆者事先提醒,漁并不好學,建議自己去爬一遍就知道哪里比較坑了。
- 不過第二部分里筆者也簡要說明了一下爬蟲的細節。
正在經歷史上最難期末,放個歷年《中國統計年鑒》和《中國金融年鑒》所有Excel表的爬蟲腳本:
# -*- coding: utf-8 -*- # @author: caoyang # @email: caoyang@163.sufe.edu.cnimport os import re import time import requestsfrom selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.action_chains import ActionChainsfrom bs4 import BeautifulSoupdef get_cookie(url):options = webdriver.FirefoxOptions() options.add_argument("--headless") driver = webdriver.Firefox(options=options) driver.get(url)cookies = driver.get_cookies()driver.quit()def _cookie_to_string(cookies):string = ''for cookie in cookies:string += '{}={}; '.format(cookie['name'], cookie['value'])return string.strip()return _cookie_to_string(cookies)def download_chinese_statistical_yearbook(ybcode='N2020100004', year='2020', save_root='csyb', is_initial=True, ignore_caj=True):with open('system_csyb.log', 'w') as f:passheaders = {'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'}query_url = 'https://data.cnki.net/Yearbook/PartialGetCatalogResult'excel_url = 'https://data.cnki.net/{}'.formatcaj_url = 'https://data.cnki.net/download/GetCajUrl'regex = r'<[^>]+>'cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))compiler = re.compile(regex, re.S)regular_interval = 15reset_interval = 300if not os.path.exists(save_root):os.mkdir(save_root)# year = ybcode[1:5]target_path = os.path.join(save_root, year)if not os.path.exists(target_path):os.mkdir(target_path)with open(os.path.join(target_path, 'log.txt'), 'w') as f:passformdata = {'ybcode': ybcode,'entrycode': '','page': '1','pagerow': '20'}response = requests.post(query_url, data=formdata, headers=headers)html = response.textsoup = BeautifulSoup(html, 'lxml')span = soup.find('span', class_='s_p_listl')for link in span.find_all('a'):onclick = link.attrs.get('onclick')if onclick is not None:lindex = onclick.find('\'')rindex = onclick.find('\'', lindex + 1)n_pages = int(onclick[lindex + 1:rindex])breakwith open('system_csyb.log', 'a') as f:f.write('正在處理{}年...\t{}\n'.format(year, time.strftime('%Y-%m-%d %H:%M:%S')))print('正在處理{}年...'.format(year))with open('system_csyb.log', 'a') as f:f.write('共計{}頁\t{}\n'.format(n_pages, time.strftime('%Y-%m-%d %H:%M:%S')))print('共計{}頁'.format(n_pages))for page in range(1, n_pages + 1):with open('system_csyb.log', 'a') as f:f.write(' - 第{}頁..\t{}\n'.format(page, time.strftime('%Y-%m-%d %H:%M:%S')))print(' - 第{}頁..'.format(page))if not page == '1': formdata = {'ybcode': ybcode,'entrycode': '','page': str(page),'pagerow': '20'}while True:try:response = requests.post(query_url, data=formdata, headers=headers)breakexcept:with open('system_csyb.log', 'a') as f:f.write(' 頁面訪問失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 頁面訪問失敗...')time.sleep(reset_interval) html = response.textsoup = BeautifulSoup(html, 'lxml')table = soup.find('table')for tr in table.find_all('tr'):tds = tr.find_all('td')assert len(tds) == 3title = compiler.sub('', str(tds[0])).replace('\n', '').replace('\t', '').replace(' ', '').replace('\r', '')page_range = compiler.sub('', str(tds[1])).replace('\n', '').replace('\t', '').replace(' ', '')for _link in tds[2].find_all('a'):href = _link.attrs['href']if href.startswith('/download/excel'): # excelfilecode = href[href.find('=')+1:]while True:_headers = headers.copy()_headers['Cookie'] = cookiestry:with open('system_csyb.log', 'a') as f:f.write(' + 下載{}...\t{}\n'.format(title, time.strftime('%Y-%m-%d %H:%M:%S')))print(' + 下載{}...'.format(title))response = requests.get(excel_url(href), headers=_headers)print(' ' + str(response.status_code))try:html = response.textsoup = BeautifulSoup(html, 'lxml')if str(soup.find('title').string)=='中國經濟社會大數據研究平臺':with open('system_csyb.log', 'a') as f:f.write(' 重置cookie...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 重置cookie...')cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))else:breakexcept:breakexcept:with open('system_csyb.log', 'a') as f:f.write(' 失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 失敗...')time.sleep(reset_interval)cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))time.sleep(regular_interval)with open(os.path.join(target_path, '{}.xls'.format(filecode)), 'wb') as f:f.write(response.content)with open(os.path.join(target_path, 'log.txt'), 'a') as f:f.write('{}\t{}\t{}.xls\n'.format(title, page_range, filecode))else: # cajif ignore_caj:continuefilecode = _link.attrs['fn']pagerange = _link.attrs['pg']disk = _link.attrs['disk']_formdata = {'filecode': filecode,'pagerange': pagerange,'disk': disk,}while True:_headers = headers.copy()_headers['Cookie'] = cookies try: with open('system_csyb.log', 'a') as f:f.write(' + 下載{}的資源鏈接...\t{}\n'.format(title, time.strftime('%Y-%m-%d %H:%M:%S')))print(' + 下載{}的資源鏈接...'.format(title)) response = requests.post(caj_url, headers=_headers, data=_formdata)breakexcept:with open('system_csyb.log', 'a') as f:f.write(' 失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 失敗...')time.sleep(reset_interval) cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode)) resource_url = response.json()['url']while True:try:with open('system_csyb.log', 'a') as f:f.write(' + 下載{}...\t{}\n'.format(title, time.strftime('%Y-%m-%d %H:%M:%S')))print(' + 下載{}...'.format(title))response = requests.get(resource_url, headers=headers)if str(response.status_code) == '200':break else:with open('system_csyb.log', 'a') as f:f.write(' 重置cookie...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 重置cookie...')time.sleep(reset_interval)cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))except:with open('system_csyb.log', 'a') as f:f.write(' 失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 失敗...')time.sleep(regular_interval)cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))time.sleep(regular_interval)with open(os.path.join(target_path, '{}.caj'.format(filecode)), 'wb') as f:f.write(response.content)with open(os.path.join(target_path, 'log.txt'), 'a') as f:f.write('{}\t{}\t{}.caj\n'.format(title, page_range, filecode))# Find urls of yearif is_initial:url = 'https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode)response = requests.get(url, headers=headers)html = response.textsoup = BeautifulSoup(html, 'lxml')div = soup.find('div', class_='s_year clearfix')links = []ybcodes = []for link in div.find_all('a'):class_ = link.attrs.get('class')if class_ is None: # not currenthref = link.attrs.get('href')ybcode = href.split('/')[-1].split('?')[0]links.append(href)ybcodes.append(ybcode)with open('ybcode_csyb.txt', 'w') as f:for ybcode in ybcodes:f.write(f'{ybcode}\n')# for ybcode in ybcodes:# download_chinese_statistical_yearbook(ybcode=ybcode, is_initial=False)def download_chinese_financial_yearbook(ybcode='N2020070552', year='2019', save_root='cfyb', is_initial=True, ignore_caj=True):with open('system_cfyb.log', 'w') as f:passheaders = {'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'}query_url = 'https://data.cnki.net/Yearbook/PartialGetCatalogResult'excel_url = 'https://data.cnki.net/{}'.formatcaj_url = 'https://data.cnki.net/download/GetCajUrl'regex = r'<[^>]+>'cookies = '''ASP.NET_SessionId=qgfddbtpp2yw1yik5xpie3mo; Ecp_ClientId=2210524115702029814; Ecp_LoginStuts={"IsAutoLogin":false,"UserName":"SH0013","ShowName":"%e4%b8%8a%e6%b5%b7%e8%b4%a2%e7%bb%8f%e5%a4%a7%e5%ad%a6","UserType":"bk","BUserName":"","BShowName":"","BUserType":"","r":"6dHmNy"}; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqMDh6a1dpNjgzOEtGdzBoZVNMWk5Nc0RUeDFBOD0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=05/24/2021 12:16:44; LID=WEEvREcwSlJHSldSdmVqMDh6a1dpNjgzOEtGdzBoZVNMWk5Nc0RUeDFBOD0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; c_m_expire=2021-05-24 12:16:44; SID=009026; Hm_lvt_911066eb2f53848f7d902db7bb8ac4d7=1621828625; Hm_lpvt_911066eb2f53848f7d902db7bb8ac4d7=1621828625'''cookies = '''ASP.NET_SessionId=pdbekustghjjz2neuam5etnt; Ecp_ClientId=5210524165003078186; Ecp_LoginStuts={\"IsAutoLogin\":false,\"UserName\":\"SH0013\",\"ShowName\":\"%e4%b8%8a%e6%b5%b7%e8%b4%a2%e7%bb%8f%e5%a4%a7%e5%ad%a6\",\"UserType\":\"bk\",\"BUserName\":\"\",\"BShowName\":\"\",\"BUserType\":\"\",\"r\":\"087ZRr\"}; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqelcxUzhJV1VTdGVGdmpHd1JmTGx6Sjd5N1Yzcz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=05/24/2021 17:09:30; LID=WEEvREcwSlJHSldSdmVqelcxUzhJV1VTdGVGdmpHd1JmTGx6Sjd5N1Yzcz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; c_m_expire=2021-05-24 17:09:30; SID=009024; Hm_lvt_911066eb2f53848f7d902db7bb8ac4d7=1621846228; Hm_lpvt_911066eb2f53848f7d902db7bb8ac4d7=1621846228'''cookies = '''ASP.NET_SessionId=mow1jjxmf3yl0kudfyxajmzc; Ecp_ClientId=2210524182003926881; Ecp_LoginStuts={\"IsAutoLogin\":false,\"UserName\":\"SH0013\",\"ShowName\":\"%e4%b8%8a%e6%b5%b7%e8%b4%a2%e7%bb%8f%e5%a4%a7%e5%ad%a6\",\"UserType\":\"bk\",\"BUserName\":\"\",\"BShowName\":\"\",\"BUserType\":\"\",\"r\":\"4ZXI5N\"}; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqMDh6a1dpNjgzOEtnL01TODdZeGZBQjFVNFFhVT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=05/24/2021 18:39:44; LID=WEEvREcwSlJHSldSdmVqMDh6a1dpNjgzOEtnL01TODdZeGZBQjFVNFFhVT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; c_m_expire=2021-05-24 18:39:44; SID=009026; Hm_lvt_911066eb2f53848f7d902db7bb8ac4d7=1621851606; Hm_lpvt_911066eb2f53848f7d902db7bb8ac4d7=1621852178'''cookies = '''ASP.NET_SessionId=x2uuxyelllkb01vne0bg1fcz; Ecp_ClientId=1210524220405317104; Ecp_LoginStuts={\"IsAutoLogin\":false,\"UserName\":\"SH0013\",\"ShowName\":\"%e4%b8%8a%e6%b5%b7%e8%b4%a2%e7%bb%8f%e5%a4%a7%e5%ad%a6\",\"UserType\":\"bk\",\"BUserName\":\"\",\"BShowName\":\"\",\"BUserType\":\"\",\"r\":\"CoZFit\"}; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqM1BLVW9SQVR4WDNESDFyZmdtZks1OWNYNFlMRT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=05/24/2021 22:23:34; LID=WEEvREcwSlJHSldSdmVqM1BLVW9SQVR4WDNESDFyZmdtZks1OWNYNFlMRT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; c_m_expire=2021-05-24 22:23:34; SID=009025; Hm_lvt_911066eb2f53848f7d902db7bb8ac4d7=1621865075; Hm_lpvt_911066eb2f53848f7d902db7bb8ac4d7=1621865075'''cookies = '''ASP.NET_SessionId=nl5mpjvzy2az5kamdhek0ydq; Ecp_ClientId=3210525133102568069; Ecp_LoginStuts={\"IsAutoLogin\":false,\"UserName\":\"SH0013\",\"ShowName\":\"%e4%b8%8a%e6%b5%b7%e8%b4%a2%e7%bb%8f%e5%a4%a7%e5%ad%a6\",\"UserType\":\"bk\",\"BUserName\":\"\",\"BShowName\":\"\",\"BUserType\":\"\",\"r\":\"ubJVB4\"}; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqMVc3M1dGdk5Xa2hFYzh2WjV6Y2cvSUZzR3FPbz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=05/25/2021 13:51:50; LID=WEEvREcwSlJHSldSdmVqMVc3M1dGdk5Xa2hFYzh2WjV6Y2cvSUZzR3FPbz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; c_m_expire=2021-05-25 13:51:50; SID=009022; Hm_lvt_911066eb2f53848f7d902db7bb8ac4d7=1621920712; Hm_lpvt_911066eb2f53848f7d902db7bb8ac4d7=1621920726'''cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))compiler = re.compile(regex, re.S)regular_interval = 15reset_interval = 300if not os.path.exists(save_root):os.mkdir(save_root)# year = ybcode[1:5]target_path = os.path.join(save_root, year)if not os.path.exists(target_path):os.mkdir(target_path)with open(os.path.join(target_path, 'log.txt'), 'w') as f:passformdata = {'ybcode': ybcode,'entrycode': '','page': '1','pagerow': '20'}response = requests.post(query_url, data=formdata, headers=headers)html = response.textsoup = BeautifulSoup(html, 'lxml')span = soup.find('span', class_='s_p_listl')for link in span.find_all('a'):onclick = link.attrs.get('onclick')if onclick is not None:lindex = onclick.find('\'')rindex = onclick.find('\'', lindex + 1)n_pages = int(onclick[lindex + 1:rindex])breakwith open('system_cfyb.log', 'a') as f:f.write('正在處理{}年...\t{}\n'.format(year, time.strftime('%Y-%m-%d %H:%M:%S')))print('正在處理{}年...'.format(year))with open('system_cfyb.log', 'a') as f:f.write('共計{}頁\t{}\n'.format(n_pages, time.strftime('%Y-%m-%d %H:%M:%S')))print('共計{}頁'.format(n_pages))for page in range(1, n_pages + 1):with open('system_cfyb.log', 'a') as f:f.write(' - 第{}頁..\t{}\n'.format(page, time.strftime('%Y-%m-%d %H:%M:%S')))print(' - 第{}頁..'.format(page))if not page == '1': formdata = {'ybcode': ybcode,'entrycode': '','page': str(page),'pagerow': '20'}while True:try:response = requests.post(query_url, data=formdata, headers=headers)breakexcept:with open('system_cfyb.log', 'a') as f:f.write(' 頁面訪問失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 頁面訪問失敗...')time.sleep(reset_interval) html = response.textsoup = BeautifulSoup(html, 'lxml')table = soup.find('table')for tr in table.find_all('tr'):tds = tr.find_all('td')assert len(tds) == 3title = compiler.sub('', str(tds[0])).replace('\n', '').replace('\t', '').replace(' ', '').replace('\r', '')page_range = compiler.sub('', str(tds[1])).replace('\n', '').replace('\t', '').replace(' ', '')for _link in tds[2].find_all('a'):href = _link.attrs['href']if href.startswith('/download/excel'): # excelfilecode = href[href.find('=')+1:]while True:_headers = headers.copy()_headers['Cookie'] = cookiestry:with open('system_cfyb.log', 'a') as f:f.write(' + 下載{}...\t{}\n'.format(title, time.strftime('%Y-%m-%d %H:%M:%S')))print(' + 下載{}...'.format(title))response = requests.get(excel_url(href), headers=_headers)print(' ' + str(response.status_code))try:html = response.textsoup = BeautifulSoup(html, 'lxml')if str(soup.find('title').string)=='中國經濟社會大數據研究平臺':with open('system_cfyb.log', 'a') as f:f.write(' 重置cookie...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 重置cookie...')cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))else:breakexcept:breakexcept:with open('system_cfyb.log', 'a') as f:f.write(' 失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 失敗...')time.sleep(reset_interval)cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))time.sleep(regular_interval)with open(os.path.join(target_path, '{}.xls'.format(filecode)), 'wb') as f:f.write(response.content)with open(os.path.join(target_path, 'log.txt'), 'a') as f:f.write('{}\t{}\t{}.xls\n'.format(title, page_range, filecode))else: # cajif ignore_caj:continuefilecode = _link.attrs['fn']pagerange = _link.attrs['pg']disk = _link.attrs['disk']_formdata = {'filecode': filecode,'pagerange': pagerange,'disk': disk,}while True:_headers = headers.copy()_headers['Cookie'] = cookies try: with open('system_cfyb.log', 'a') as f:f.write(' + 下載{}的資源鏈接...\t{}\n'.format(title, time.strftime('%Y-%m-%d %H:%M:%S')))print(' + 下載{}的資源鏈接...'.format(title)) response = requests.post(caj_url, headers=_headers, data=_formdata)breakexcept:with open('system_cfyb.log', 'a') as f:f.write(' 失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 失敗...')time.sleep(reset_interval) cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode)) resource_url = response.json()['url']while True:try:with open('system_cfyb.log', 'a') as f:f.write(' + 下載{}...\t{}\n'.format(title, time.strftime('%Y-%m-%d %H:%M:%S')))print(' + 下載{}...'.format(title))response = requests.get(resource_url, headers=headers)if str(response.status_code) == '200':break else:with open('system_cfyb.log', 'a') as f:f.write(' 重置cookie...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 重置cookie...')time.sleep(reset_interval)cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))except:with open('system_cfyb.log', 'a') as f:f.write(' 失敗...\t{}\n'.format(time.strftime('%Y-%m-%d %H:%M:%S')))print(' 失敗...')time.sleep(regular_interval)cookies = get_cookie('https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode))time.sleep(regular_interval)with open(os.path.join(target_path, '{}.caj'.format(filecode)), 'wb') as f:f.write(response.content)with open(os.path.join(target_path, 'log.txt'), 'a') as f:f.write('{}\t{}\t{}.caj\n'.format(title, page_range, filecode))# Find urls of yearif is_initial:url = 'https://data.cnki.net/trade/Yearbook/Single/{}?z=Z016'.format(ybcode)response = requests.get(url, headers=headers)html = response.textsoup = BeautifulSoup(html, 'lxml')div = soup.find('div', class_='s_year clearfix')links = []ybcodes = []for link in div.find_all('a'):class_ = link.attrs.get('class')if class_ is None: # not currenthref = link.attrs.get('href')ybcode = href.split('/')[-1].split('?')[0]links.append(href)ybcodes.append(ybcode)with open('ybcode_cfyb.txt', 'w') as f:for ybcode in ybcodes:f.write(f'{ybcode}\n')for ybcode in ybcodes:download_chinese_financial_yearbook(ybcode=ybcode, is_initial=False) if __name__ == '__main__':'''with open('ybcode_csyb.txt', 'r') as f:lines = f.read().splitlines() for line in lines:ybcode, year = line.split()# if int(year) > 1999:# continuedownload_chinese_statistical_yearbook(ybcode=ybcode, year=year, is_initial=False)'''with open('ybcode_cfyb.txt', 'r') as f:lines = f.read().splitlines() for line in lines:ybcode, year = line.split()#if int(year) > 1994:# continuedownload_chinese_financial_yearbook(ybcode=ybcode, year=year, save_root='cfyb', is_initial=False, ignore_caj=True)數據源來自中國知網,附本爬蟲的數據源:
- 中國金融年鑒@CNKI
- 中國統計年鑒@CNKI
因為筆者所在區域具有知網下載的權限,所以沒有知網下載權限的拿到這個爬蟲也沒什么意義,注意到每次爬蟲之前都會調用Selenium驅動瀏覽器去獲取Cookies,以獲取下載權限,由于知網Cookies時效很短,一旦失效下載得到的就是知網首頁的HTML,所以需要編寫邏輯去監測Cookies的有效性,一旦失效就需要繼續調用Selenium重新獲取Cookies。此外爬取速度盡可能地慢一些,代碼中兩次下載之間間隔151515秒,以《中國統計年鑒》為例,截至本文發布共計393939年,每年差不多有600600600張Excel表,所以基本上需要兩三天時間才能全部下完。
2 關于爬蟲的一些細節說明及如何使用腳本處理獲得的Excel表
筆者鑒于時間有限不想多提爬蟲的思路,這里主要記錄幾個細節(坑點):
代碼中的ignore_caj=True即自動過濾caj鏈接的下載,事實上caj的下載相對會復雜一些,需要先做一個POST請求獲取資源鏈接,而excel的下載鏈接直接就寫在頁面源代碼上了。之所以選擇不下載caj文件有兩個原因,其一是caj文件確實沒什么用,有用的數據都寫在excel中了,重要的是另一個原因,就是caj的POST請求獲取資源鏈接很容易造成Cookie不可用,導致需要頻繁切換Cookie,太費時間,相對excel下載鏈接就很穩定,一般來說Cookie用上整整一天都不會失效。
爬蟲主體僅為requests,瀏覽器驅動selenium只用作更新Cookie,所以資源耗用是比較小的。
注意上述鏈接中的頁面源代碼上是有<iframe>標簽的,所以下載鏈接并不能直接在頁面源代碼中找到,需要監聽抓包取得<iframe>標簽下內容對應的URL。
這里有一個問題,就是從上述數據源獲取得到的Excel表全部都是受保護的,其實只是不能去編輯這些Excel,正常使用Office或是WPS依然可以打開讀取(只讀模式),如果想要解除保護,需要在審閱菜單下輸入密碼:
其實這也不影響使用,因為本來也不需要修改這些Excel表格,但是如果是使用pandas.read_excel或是xlrd.open_workbook都會發生報錯:
筆者暫時沒有測試openpyxl.load_workbook在受保護的Excel上的讀取情況,原因是openpyxl只能支持.xlsx格式的文件讀取。因為年鑒數據的Excel表數量實在是太多了,如果想要批量地找一類數據,不借助腳本而是手動去復制數據實在是太蠢,最后終于找到了一種可行的方案,即使用win32com.client模塊下的DispatchEx方法:
from win32com.client import DispatchEx excel = DispatchEx('Excel.Application') demo = excel.Workbooks.Open('N2018070031000595.xls') sheet = demo.WorkSheets(1) print(sheet.Cells(1,1).Value) demo.Close(True)本質是調用Office程序來讀取,在任務管理器中會出現Excel的進程,所以demo.Close關閉進程就非常重要,否則計算機很容易會因為打開過多的Excel而內存爆炸。缺點是這個模塊沒有提供什么現成的結構性數據處理方法,所以取值只能借助sheet.Cells來原始的取值,代碼量會比較高。
筆者主要是在取各個省份的金融經濟數據表,獲取分省份的貸款余額,存款余額,生產總值,價格指數等信息,腳本如下:
# -*- coding: utf-8 -*- # @author: caoyang # @email: caoyang@163.sufe.edu.cnimport re import os import sysimport time import xlrd import numpy import pandas import openpyxl from win32com.client import DispatchEx # 全局變量 CURRENT_DIR = os.getcwd() CFYB_DIR = 'cfyb' CSYB_DIR = 'csyb' TEMP_DIR = 'temp' LOG_FILE = 'log.txt' PROVINCE = ['北京', '天津', '上海', '重慶', '河北', '山西','遼寧', '吉林', '黑龍江', '江蘇', '浙江', '安徽','福建', '江西', '山東', '河南', '湖北', '湖南','廣東', '海南', '四川', '貴州', '云南', '陜西','甘肅', '青海', '臺灣', '內蒙古', '廣西', '西藏','寧夏', '新疆', '香港', '澳門', ]INT_COMPILER = re.compile(r'[^\d]') # 整型數正則 FLOAT_COMPILER = re.compile(r'[^\d | .]') # 浮點數正則# 獲取分省份的金融經濟信息:以貸款余額為主 def get_loan_by_province():def _get_province_by_title(_title): # 根據數據表名稱提取對應省份名for _province in PROVINCE:if _province in _title:return _provincedef _format_cell_value(_cell_value, dtype=str): # 標準化單元格的值if str(_cell_value) == 'None':return Noneif dtype == int:return INT_COMPILER.sub('', str(_cell_value).replace(' ', ''))if dtype == float:return FLOAT_COMPILER.sub('', str(_cell_value).replace(' ', ''))return str(_cell_value).replace(' ', '')def _get_dataframe_from_sheet(_sheet):_data_dict = {'year': [], # 年份'gdp': [], # 國內生產總值'cpi': [], # 價格指數'deposit': [], # 總存款余額'individual_deposit': [], # 個人存款余額'unit_deposit': [], # 單位存款余額'finance_deposit': [], # 財政存款余額'loan': [], # 總貸款余額'short_loan': [], # 短期貸款余額'long_loan': [], # 長期貸款余額}_flags = { # 對應_data_dict中的每一個字段用一個flag記錄它是否被找到'year': True, # 年份'gdp': True, # 國內生產總值'cpi': True, # 價格指數'deposit': True, # 總存款余額'individual_deposit': True, # 個人存款余額'unit_deposit': True, # 單位存款余額'finance_deposit': True, # 財政存款余額'loan': True, # 總貸款余額'short_loan': True, # 短期貸款余額'long_loan': True, # 長期貸款余額}_row = 0_MAX_ROW = 100while _row < _MAX_ROW: # 遍歷每一行_row += 1 # 行號遞進_cell_value = _format_cell_value(_sheet.Cells(_row, 1)) # 提取每一行第一列的單元格值if _cell_value is None: # 跳過空值的單元格continueif _flags['year'] and '項目' in _cell_value: # year: 【項目】所在行可以提取年份print('year: ' + _cell_value)_flags['year'] = False # 已經找到【項目】所在行_column = 1while True: # 遍歷【項目】所在行的每一列:取得年份_column += 1 # 列號遞進_year_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=int)if _year_string is None: # 遇到空單元格值即可退出搜索break_data_dict['year'].append(_year_string) # 將當前年份添加到字典中_num_year = len(_data_dict['year']) # 記錄一共有多少年份continueif _flags['gdp'] and '生產總值' in _cell_value: # gdp: 國內生產總值if _flags['year']: # 偶然發現有的表里沒有【項目】,用國內生產總值的上一行修正_flags['year'] = False # 已經找到【項目】所在行_column = 1while True: # 遍歷【項目】所在行的每一列:取得年份_column += 1 # 列號遞進_year_string = _format_cell_value(_sheet.Cells(_row - 1, _column), dtype=float)if _year_string is None: # 遇到空單元格值即可退出搜索break_data_dict['year'].append(_year_string) # 將當前年份添加到字典中_num_year = len(_data_dict['year']) # 記錄一共有多少年份 print('gdp: ' + _cell_value)_flags['gdp'] = False # 已經找到【國內生產總值】所在行for _column in range(2, 2 + _num_year):_gdp_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['gdp'].append(_gdp_string)continue if _flags['cpi'] and '價格指數' in _cell_value: # cpi: 消費者價格指數print('cpi: ' + _cell_value)_flags['cpi'] = False # 已經找到【消費者價格指數】所在行for _column in range(2, 2 + _num_year):_cpi_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['cpi'].append(_cpi_string)continueif _flags['deposit'] and ('存款' in _cell_value and \'銀行' in _cell_value): # deposit: 總存款余額print('deposit: ' + _cell_value)_flags['deposit'] = False # 已經找到【總存款余額】所在行for _column in range(2, 2 + _num_year):_deposit_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['deposit'].append(_deposit_string)continueif _flags['individual_deposit'] and ('存款' in _cell_value and \('城鄉' in _cell_value or '儲蓄' in _cell_value)): # individual_deposit:個人存款余額print('individual deposit: ' + _cell_value)_flags['individual_deposit'] = False # 已經找到【個人存款余額】所在行for _column in range(2, 2 + _num_year):_individual_deposit_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['individual_deposit'].append(_individual_deposit_string)continue if _flags['unit_deposit'] and ('存款' in _cell_value and \('企' in _cell_value or '單位' in _cell_value)): # unit_deposit: 單位存款余額print('unit deposit: ' + _cell_value)_flags['unit_deposit'] = False # 已經找到【單位存款余額】所在行for _column in range(2, 2 + _num_year):_unit_deposit_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['unit_deposit'].append(_unit_deposit_string)continue if _flags['finance_deposit'] and ('存款' in _cell_value and ('財政' in _cell_value)): # finance_deposit: 財政存款余額print('finance deposit: ' + _cell_value)_flags['finance_deposit'] = False # 已經找到【財政存款余額】所在行for _column in range(2, 2 + _num_year):_finance_deposit_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['finance_deposit'].append(_finance_deposit_string)continue if _flags['loan'] and ('貸款' in _cell_value and \'銀行' in _cell_value): # loan: 總貸款余額print('loan: ' + _cell_value)_flags['loan'] = False # 已經找到【總貸款余額】所在行for _column in range(2, 2 + _num_year):_loan_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['loan'].append(_loan_string)continue if _flags['short_loan'] and ('貸款' in _cell_value and \('短期' in _cell_value or '流動' in _cell_value)):# short_loan: 短期貸款余額print('short loan: ' + _cell_value)_flags['short_loan'] = False # 已經找到【短期貸款余額】所在行for _column in range(2, 2 + _num_year):_short_loan_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['short_loan'].append(_short_loan_string)continue if _flags['long_loan'] and ('貸款' in _cell_value and \('長期' in _cell_value or '固定' in _cell_value)):# long_loan: 長期貸款余額print('long loan: ' + _cell_value)_flags['long_loan'] = False # 已經找到【長期貸款余額】所在行for _column in range(2, 2 + _num_year):_long_loan_string = _format_cell_value(_sheet.Cells(_row, _column), dtype=float)_data_dict['long_loan'].append(_long_loan_string)continuefor key in _flags: # 沒有找到數據的字段置空if _flags[key]:for _ in range(_num_year):_data_dict[key].append(None)_dataframe = pandas.DataFrame(_data_dict, columns=list(_data_dict.keys()))print(_flags)# print(_data_dict)# print(_dataframe)return _dataframeapplication = DispatchEx('Excel.Application') # 啟動Excel程序for year in os.listdir(os.path.join(CURRENT_DIR, CFYB_DIR)): # 遍歷中國金融年鑒下所有年份print(f'======={year}=======')# if year < 2017:# continuelog_df = pandas.read_csv(os.path.join(CFYB_DIR, year, LOG_FILE), header=None, sep='\t', encoding='gbk') # 讀取爬蟲時的log表log_df.columns = ['title', 'pagerange', 'filename'] # 為log表添加表頭result_df = log_df[log_df['title'].map(lambda x: sum([province in x for province in PROVINCE]) > 0 \and ('主要金融經濟統計' in x or '主要經濟金融統計' in x))].reset_index(drop=True) # 篩選分地區的Excel表dataframes = []if result_df.shape[0] == 0:continuefor i in range(result_df.shape[0]): # 遍歷每一行title = result_df.loc[i, 'title'] # 獲取Excel表的名稱filename = result_df.loc[i, 'filename'] # 獲取Excel表的文件名province = _get_province_by_title(title) # 根據Excel表的名稱提取對應的省份print(title, filename)excel = application.Workbooks.Open(os.path.join(CURRENT_DIR, CFYB_DIR, year, filename)) # 打開Excel表sheet = excel.WorkSheets(1) # 取出其中第一張sheetdataframe = _get_dataframe_from_sheet(sheet)dataframe['province'] = provinceexcel.Close(True)dataframes.append(dataframe)df_concat = pandas.concat(dataframes, ignore_index=True)df_concat.to_csv(f'loan_by_area_{year}.csv', index=False, header=True, sep='\t')if __name__ == '__main__':get_loan_by_province()這個腳本寫得很硬,很多地方的邏輯寫得都非常硬,因為這些表每年的格式還不太一樣,同一年份下不同省份的表格式也不太一樣,所以數據真的是很難找,僅供參考。
后記(最近的一些感想)
數罪當論,最近做了好多錯事。
跟老媽聊了一點之后,覺得昨天既然見面了還是應該打個招呼,大大方方就就好了。但是真的見到面還是覺得有點尷尬,招呼不是很說得出口,本來就不怎么擅長打招呼,尤其是這種情境。
實話說SSS確實變了很多,頭發留長了,也染了一些顏色,如果不是看正臉我應該是不太能一眼認出來了。
兩年后在這個學校里,該走的人也該走完了,到那時也不由得自己再想些什么,讀博之路,長夜漫漫,可能還是要孤獨一些比較好,不然只會越走越長,誰知道呢?
堅持日記和跑步,這是我生命中最重要的兩個信仰,前者讓我記住自己,后者再讓我忘掉自己。在記憶與遺忘中浮沉,也是一種人生之道。
今年是我寫日記的第十個年頭,所有日記加起來約莫百萬余字,十幾本整整齊齊地排在我的書桌上。
在2011年12月31日提筆開始寫第一篇日記時,我可能想不到自己會寫到這么久,但是對于我來說,一旦選擇堅持一件事情,很可能會堅持到自己力不能及為止,因為我覺得沒有什么是值得放棄的,如果還能抓得住的話。
但是也許我只是不想再去抓什么東西了罷,或者確切地說,我只是不敢再去抓住什么東西了罷。
總結
以上是生活随笔為你收集整理的【日常】《中国统计年鉴》与《中国金融年鉴》数据表爬虫(附1985-2020所有Excel资源)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: docker --- 镜像、容器
- 下一篇: docker --- mysql的部署