日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

高校新闻抓取分析之百度新闻篇---数据清洗解析

發布時間:2023/12/29 编程问答 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 高校新闻抓取分析之百度新闻篇---数据清洗解析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

高校新聞抓取分析之百度新聞篇—數據清洗解析

tips:

  • 本文代碼使用python3編寫
  • 代碼倉庫
  • 使用re抓取解析數據

前言

在上一篇文章中,成功構建URL并獲取到高校新聞數據。

現在將對請求回來的數據進行清洗解析,提取:新聞標題新聞來源新聞時間更多新聞鏈接

回顧一下新聞信息HTML片段:

<div class="result title" id="1"> ?&nbsp; <h3 class="c-title"><a href="http://www.thepaper.cn/newsDetail_forward_8438609" data-click="{'f0':'77A717EA','f1':'9F63F1E4','f2':'4CA6DE6E','f3':'54E5243F','t':'1595752997'}" target="_blank">中國科協“老科學家學術成長資料采集工程”<em>北京大學</em>聯合采集啟動...</a> </h3> <div class="c-title-author">澎湃新聞&nbsp;&nbsp; 7小時前&nbsp;&nbsp;<a href="/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:1782421700&amp;same=2&amp;cl=1&amp;tn=newstitle&amp;rn=30&amp;fm=sd" class="c-more_link" data-click="{'fm':'sd'}">查看更多相關新聞&gt;&gt;</a> </div> </div> <div class="result title" id="2"> ?&nbsp; <h3 class="c-title"><a href="http://sc.people.com.cn/n2/2020/0726/c345509-34183157.html" data-click="{'f0':'77A717EA','f1':'9F63F1E4','f2':'4CA6DD6E','f3':'54E5243F','t':'1595752997'}" target="_blank"><em>北京大學</em>、清華大學等17所高校在川招生行程安排和咨詢地址、電話...</a> </h3> <div class="c-title-author">人民網四川站&nbsp;&nbsp; 9小時前</div> </div>

實現數據清洗

在上篇中將請求回來的數據保存為了html,方便我們進一步處理和分析。一般來說,我們提取html中有用的信息,是不需要樣式和js的。所以可以預先刪除這部分內容,減少干擾項。通過刪除多余的空格和換行符,壓縮文件。

刪除script,style及空格換行符

import rewith open('test.html','r',encoding='utf-8') as f:data = f.read() # 刪除script data = re.sub(r'<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', '', data, flags=re.I) # 刪除style data = re.sub(r'<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', '', data, flags=re.I) # 刪除多余的空格及換行符 data = re.sub(r'\s{2,}', '', data)

通過上兩步可以刪除大部分的樣式和js,但是有一部分js無法刪除。那么通過下面的方式來刪除,找到未能刪除的js頭,然后找到結束的js尾巴,循環刪除這一部分即可。

while data.find('<script>') != -1:start_index = data.find('<script>')end_index = data.find('</script>')if end_index == -1:breakdata = data[:start_index] + data[end_index + len('</script>'):]

刪除樣式及js后壓縮的html片段:

<div><div class="result title" id="1"> &#8226;&nbsp; <h3 class="c-title"><a href="https://3g.163.com/news/article/FIG1TC630514B1BP.html"data-click="{'f0':'77A717EA','f1':'9F63F1E4','f2':'4CA6DD6E','f3':'54E5243F','t':'1595773708'}"target="_blank">在宣漢揭牌!<em>北京大學</em>文化產業博士后創新實踐基地成立</a> </h3> <div class="c-title-author">網易&nbsp;&nbsp;3小時前&nbsp;&nbsp;<a href="/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:2675916080&same=2&cl=1&tn=newstitle&rn=30&fm=sd" class="c-more_link" data-click="{'fm':'sd'}">查看更多相關新聞>></a> </div> </div><div class="result title" id="2"> &#8226;&nbsp; <h3 class="c-title"><a href="https://3g.163.com/news/article/FIFOP7BP05501E5I.html"data-click="{'f0':'77A717EA','f1':'9F63F1E4','f2':'4CA6DD6E','f3':'54E5243F','t':'1595773708'}"target="_blank">衡陽市“人大講壇”再次開講!聽<em>北京大學</em>法學博士說“貪”</a> </h3> <div class="c-title-author">網易&nbsp;&nbsp;5小時前</div> </div>

實現數據解析

1. 提取Html新聞片段

通過前面的刪除多余刪除樣式及js并壓縮后,html相對規整,接下來需要做的:

  • 正則提取出包含所有新聞的字符串
  • 移除多余的字符串
  • 字符操作分割出每條新聞html片段

代碼如下:

# 取出解析的數據字符串 wait_parse_string = re.search(r'<div id="content_left">([\s\S]*)<div id="gotoPage">', data)if wait_parse_string:wait_parse_string = wait_parse_string.group()# 移除最后多余字符串wait_parse_string = re.sub(r'</div></div><div id="gotoPage">', '', wait_parse_string)flag = '<div class="result title"'flag_length = len(flag)# 遍歷字符串并分割出單條新聞數據news_list = []while wait_parse_string.find('<div class="result title"') != -1:start_index = wait_parse_string.find(flag)end_index = wait_parse_string[start_index + flag_length:].find(flag)if end_index > 0:end_index = start_index + end_index + flag_lengthelse:end_index = len(wait_parse_string)news_list.append(wait_parse_string[start_index:end_index])wait_parse_string = wait_parse_string[end_index:]print(news_list)

部分結果:

['<div class="result title" id="1">\n&#8226;&nbsp;\n<h3 class="c-title"><a href="https://3g.163.com/news/article/FIG1TC630514B1BP.html"data-click="{\'f0\':\'77A717EA\',\'f1\':\'9F63F1E4\',\'f2\':\'4CA6DD6E\',\'f3\':\'54E5243F\',\'t\':\'1595773708\'}"target="_blank">在宣漢揭牌!<em>北京大學</em>文化產業博士后創新實踐基地成立</a>\n</h3>\n<div class="c-title-author">網易&nbsp;&nbsp;3小時前&nbsp;&nbsp;<a href="/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:2675916080&same=2&cl=1&tn=newstitle&rn=30&fm=sd" class="c-more_link" data-click="{\'fm\':\'sd\'}">查看更多相關新聞>></a>\n</div>\n</div>']

2. 提取數據

通過上一步分割出了單條數據,將數據進一步解析,通過正則提取出需要的字段。代碼如下:

from datetime import datetime, timedeltadef time_convert(time_string: str):"""標準化日期字符串:param time_string::return:"""if not time_string:return ''if '分鐘前' in time_string:minute = re.search(r'\d+', time_string)if minute:minute = minute.group()now = datetime.now() - timedelta(minutes=int(minute))else:now = datetime.now()return now.strftime('%Y-%m-%d')elif '小時前' in time_string:hour = re.search(r'\d+', time_string)if hour:hour = hour.group()now = datetime.now() - timedelta(hours=int(hour))else:now = datetime.now()return now.strftime('%Y-%m-%d')else:try:parse_time = datetime.strptime(time_string, '%Y年%m月%d日 %H:%M')return parse_time.strftime('%Y-%m-%d')except Exception as e:now = datetime.now()return now.strftime('%Y-%m-%d')news_data = [] for news in news_list:temp = {"news_key": '北京大學',"news_title": '','news_link': '','news_author': '','news_time': '','more_link': '',}# 解析鏈接news_link = re.search(r'<a\s*href="(\S+)"\s*data-click', news, re.I)if news_link:temp["news_link"] = news_link.group(1)# 解析標題news_title = re.search(r'target="_blank">([\d\D]*)(</a>\s*</h3>)', news, re.I)if news_title:temp["news_title"] = news_title.group(1)# 解析發布者及時間author_time = re.search(r'<div class="c-title-author">(\S+)&nbsp;&nbsp;((\d+分鐘前)|(\d+小時前)|(\d+年\d+月\d+日 \d+:\d+))', news, re.I)if author_time:temp["news_author"] = author_time.group(1)temp["news_time"] = time_convert(author_time.group(2))# 解析查詢更多相同新聞more_link = re.search(r'<a\s*href="(\S+)"\s*class="c-more_link"', news, re.I)if more_link:temp["more_link"] = more_link.group(1)news_data.append(temp)

部分結果:

[{'news_key': '北京大學', 'news_title': '在宣漢揭牌!<em>北京大學</em>文化產業博士后創新實踐基地成立', 'news_link': 'https://3g.163.com/news/article/FIG1TC630514B1BP.html', 'news_author': '網易', 'news_time': '2020-07-26', 'more_link': '/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:2675916080&same=2&cl=1&tn=newstitle&rn=30&fm=sd'},]

3. 數據存儲

經過數據的從抓取和解析,已經能得到比較規整的數據。我們只需要按照一定的字段寫入到數據庫或者文本即可,接下來存儲到非關系型數據庫mongodb。需要使用到pymongo包,使用命令pip install pymongo。

import pymongo # 構建數據庫連接 mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % ("root", '123456', "127.0.0.1", "27017", "school_news_analysis") conn = pymongo.MongoClient(mongo_uri) db = conn.school_news_analysis school_news = db.school_news insert_data = list() for item in news_data:# 根據鏈接去重,重復更新 insert_data.append(pymongo.ReplaceOne({'news_link': item['news_link']}, item, upsert=True)) school_news.bulk_write(insert_data)

4. 完整代碼

import re from datetime import datetime, timedeltaimport pymongowith open('test.html', 'r', encoding='utf-8') as f:data = f.read() # 刪除script data = re.sub(r'<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', '', data, flags=re.I) # 刪除style data = re.sub(r'<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', '', data, flags=re.I) # 刪除多余的空格及換行符 data = re.sub(r'\s{2,}', '', data)while data.find('<script>') != -1:start_index = data.find('<script>')end_index = data.find('</script>')if end_index == -1:breakdata = data[:start_index] + data[end_index + len('</script>'):]with open('test1.html', 'w', encoding='utf-8') as f:f.write(data)# 取出解析的數據字符串 wait_parse_string = re.search(r'<div id="content_left">([\s\S]*)<div id="gotoPage">', data)news_list = [] if wait_parse_string:wait_parse_string = wait_parse_string.group()# 移除最后多余字符串wait_parse_string = re.sub(r'</div></div><div id="gotoPage">', '', wait_parse_string)flag = '<div class="result title"'flag_length = len(flag)# 遍歷字符串并分割出單條新聞數據while wait_parse_string.find('<div class="result title"') != -1:start_index = wait_parse_string.find(flag)end_index = wait_parse_string[start_index + flag_length:].find(flag)if end_index > 0:end_index = start_index + end_index + flag_lengthelse:end_index = len(wait_parse_string)news_list.append(wait_parse_string[start_index:end_index])wait_parse_string = wait_parse_string[end_index:]print(news_list)def time_convert(time_string: str):"""標準化日期字符串:param time_string::return:"""if not time_string:return ''if '分鐘前' in time_string:minute = re.search(r'\d+', time_string)if minute:minute = minute.group()now = datetime.now() - timedelta(minutes=int(minute))else:now = datetime.now()return now.strftime('%Y-%m-%d')elif '小時前' in time_string:hour = re.search(r'\d+', time_string)if hour:hour = hour.group()now = datetime.now() - timedelta(hours=int(hour))else:now = datetime.now()return now.strftime('%Y-%m-%d')else:try:parse_time = datetime.strptime(time_string, '%Y年%m月%d日 %H:%M')return parse_time.strftime('%Y-%m-%d')except Exception as e:now = datetime.now()return now.strftime('%Y-%m-%d')news_data = [] for news in news_list:temp = {"news_key": '北京大學',"news_title": '','news_link': '','news_author': '','news_time': '','more_link': '',}# 解析鏈接news_link = re.search(r'<a\s*href="(\S+)"\s*data-click', news, re.I)if news_link:temp["news_link"] = news_link.group(1)# 解析標題news_title = re.search(r'target="_blank">([\d\D]*)(</a>\s*</h3>)', news, re.I)if news_title:temp["news_title"] = news_title.group(1)# 解析發布者及時間author_time = re.search(r'<div class="c-title-author">(\S+)&nbsp;&nbsp;((\d+分鐘前)|(\d+小時前)|(\d+年\d+月\d+日 \d+:\d+))', news, re.I)if author_time:temp["news_author"] = author_time.group(1)temp["news_time"] = time_convert(author_time.group(2))# 解析查詢更多相同新聞more_link = re.search(r'<a\s*href="(\S+)"\s*class="c-more_link"', news, re.I)if more_link:temp["more_link"] = more_link.group(1)news_data.append(temp)print(news_data)# 構建數據庫連接 mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % ("root", '123456', "127.0.0.1", "27017", "school_news_analysis") conn = pymongo.MongoClient(mongo_uri) db = conn.school_news_analysis school_news = db.school_news insert_data = list() for item in news_data:# 根據鏈接去重,重復更新insert_data.append(pymongo.ReplaceOne({'news_link': item['news_link']}, item, upsert=True)) school_news.bulk_write(insert_data)

總結

  • 基本上完成了高校百度新聞的抓取和存儲
  • 需進一步封裝函數,方便后續使用,可以關注本代碼庫。
  • 下一步將對高校基本信息進行抓取

總結

以上是生活随笔為你收集整理的高校新闻抓取分析之百度新闻篇---数据清洗解析的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。