當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

[python爬虫] 招聘信息定时系统 (一).BeautifulSoup爬取信息并存储MySQL

發布時間：2024/5/28 数据库 128 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python爬虫] 招聘信息定时系统 (一).BeautifulSoup爬取信息并存储MySQL 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

? ? ? ? 這系列文章主要講述，如何通過Python爬取招聘信息，且爬取的日期為當前天的，同時將爬取的內容保存到數據庫中，然后制作定時系統每天執行爬取，最后是Python調用相關庫發送短信到手機。
? ? ? ? 最近研究了數據庫的定時計劃備份，聯系爬蟲簡單做了這個實驗，雖然方法是基于單機，比較落后，但可行，創新也比較好。整個系列主要分為五部分，共五篇文章：
? ? ? ??1.Python爬取招聘信息，并且存儲到MySQL數據庫中；
? ? ? ? 2.調用pyinstaller包將py文件打包成exe可執行文件；
? ? ? ? 3.設置Windows系統的計劃，制作定時任務，每天早上定時執行exe爬蟲；
? ? ? ? 4.結合PHP（因畢業設計指導學生的是PHP系統）簡單實現前端招聘信息界面；
? ? ? ? 5.補充知識：Python調用短信貓發送招聘短信到客戶手機，研究ing。
? ? ? ?? 文章比較基礎好玩，希望對您有所幫助，如果文章中存在錯誤或不足之處。

一. 運行結果

? ? ? ? 爬取地址為智聯招聘網站：http://sou.zhaopin.com/

? ? ? ? 爬取結果存儲至MySQL數據庫如下圖所示，注意只有4月22日的信息。

? ? ? ? 運行結果及保存TXT文件如下所示：

二. BeautifulSoup爬蟲詳解

? ? ? ? 完整代碼如下所示：

# -*- coding: utf-8 -*- """ Created on 2017-04-22 15:10@author: Easstmount """import urllib2 import re from bs4 import BeautifulSoup import codecs import MySQLdb import os#存儲數據庫 #參數:職位名稱公司名稱職位月薪工作地點發布時間職位鏈接 def DatabaseInfo(zwmc, gsmc, zwyx, gzdd, gxsj, zwlj): try: conn = MySQLdb.connect(host='localhost',user='root',passwd='123456',port=3306, db='eastmount') cur=conn.cursor() #數據庫游標 #報錯:UnicodeEncodeError: 'latin-1' codec can't encode character conn.set_character_set('utf8') cur.execute('SET NAMES utf8;') cur.execute('SET CHARACTER SET utf8;') cur.execute('SET character_set_connection=utf8;')#SQL語句智聯招聘(zlzp)sql = '''insert into eastmount_zlzp (zwmc,gsmc,zwyx,gzdd,gxsj,zwlj) values(%s, %s, %s, %s, %s, %s)'''cur.execute(sql, (zwmc, gsmc, zwyx, gzdd, gxsj, zwlj))print '數據庫插入成功' #異常處理 except MySQLdb.Error,e: print "Mysql Error %d: %s" % (e.args[0], e.args[1]) finally: cur.close() conn.commit() conn.close() #爬蟲函數 def crawl(url):page = urllib2.urlopen(url) contents = page.read() soup = BeautifulSoup(contents, "html.parser") print u'貴陽JAVA招聘信息: 職位名稱 \t 公司名稱 \t 職位月薪 \t 工作地點 \t 發布日期 \n'infofile.write(u"貴陽JAVA招聘信息: 職位名稱 \t 公司名稱 \t 職位月薪 \t 工作地點 \t 發布日期 \r\n")print u'爬取信息如下:\n'i = 0for tag in soup.find_all(attrs={"class":"newlist"}):#print tag.get_text()i = i + 1#職位名稱zwmc = tag.find(attrs={"class":"zwmc"}).get_text()zwmc = zwmc.replace('\n','')print zwmc#職位鏈接url_info = tag.find(attrs={"class":"zwmc"}).find_all("a")#print url_info#url_info.get(href) AttributeError: 'ResultSet' object has no attribute 'get' for u in url_info:zwlj = u.get('href')print zwlj#公司名稱gsmc = tag.find(attrs={"class":"gsmc"}).get_text()gsmc = gsmc.replace('\n','')print gsmc#find另一種定位方法 <td class="zwyx">8000-16000</td>zz = tag.find_all('td', {"class":"zwyx"})print zz#職位月薪zwyx = tag.find(attrs={"class":"zwyx"}).get_text()zwyx = zwyx.replace('\n','')print zwyx#工作地點gzdd = tag.find(attrs={"class":"gzdd"}).get_text()gzdd = gzdd.replace('\n','')print gzdd#發布時間gxsj = tag.find(attrs={"class":"gxsj"}).get_text()gxsj = gxsj.replace('\n','')print gxsj#獲取當前日期并判斷寫入文件import datetimenow_time = datetime.datetime.now().strftime('%m-%d') #%Y-%m-%d#print now_timeif now_time==gxsj:print u'存入文件'infofile.write(u"[職位名稱]" + zwmc + "\r\n")infofile.write(u"[公司名稱]" + gsmc + "\r\n")infofile.write(u"[職位月薪]" + zwyx + "\r\n")infofile.write(u"[工作地點]" + gzdd + "\r\n")infofile.write(u"[發布時間]" + gxsj + "\r\n")infofile.write(u"[職位鏈接]" + zwlj + "\r\n\r\n") else:print u'日期不一致，當前日期: ', now_time###################################### 重點：寫入MySQL數據庫#####################################if now_time==gxsj:print u'存入數據庫'DatabaseInfo(zwmc, gsmc, zwyx, gzdd, gxsj, zwlj)print '\n\n'else:print u'爬取職位總數', i#主函數 if __name__ == '__main__':infofile = codecs.open("Result_ZP.txt", 'a', 'utf-8') #翻頁執行crawl(url)爬蟲i = 1 while i<=2: print u'頁碼', iurl = 'http://sou.zhaopin.com/jobs/searchresult.ashx?in=160400&jl=%E8%B4%B5%E9%98%B3&kw=java&p=' + str(i) + '&isadv=0' crawl(url) infofile.write("###########################\r\n\r\n\r\n") i = i + 1infofile.close() ? ? ? ? 安裝Beautifulsoup如下圖所示，使用pip install bs4即可。

? ? ? ? 重點是分析智聯招聘的DOM樹結構。
? ? ? ? 1.分析URL
? ? ? ? URL為：http://sou.zhaopin.com/jobs/searchresult.ashx?in=160400&jl=%E8%B4%B5%E9%98%B3&kw=java&p=2&isadv=0
? ? ? ? 其中，"in=160400" 表示 "行業類別" 選擇"計算機軟件"（可以多選）；"jl=貴陽" 表示工作地點選擇貴陽市；"kw=java" 表示職位選擇Java相關專業；"p=2" 表示頁碼，main函數通過循環分析爬取。

? ? ? ? 2.分析DOM樹節點
? ? ? ? 然后瀏覽器右鍵審查元素，可以看到每行職位信息都是在HTML中都是一個<table></table>，其中class為newlist。
? ? ? ? 核心代碼：for tag in soup.find_all(attrs={"class":"newlist"}):
? ? ? ? 定位該節點后再分別爬取內容，并賦值給變量，存儲到MySQL數據庫中。

? ? ? ? 3.具體內容分析
? ? ? ? 獲取職位名稱代碼如下：
? ? ? ? zwmc = tag.find(attrs={"class":"zwmc"}).get_text()
? ? ? ? print zwmc
? ? ? ? 另一段代碼，會輸出節點信息，如：
? ? ? ??zz = tag.find_all('td', {"class":"zwyx"})
? ? ? ? print zz
? ? ? ? #<td class="zwyx">8000-16000</td>
? ? ? ? 對應的HTML DOM樹分析如下圖所示。

? ? ? ? 4.判斷為當前日期則保存到TXT和MySQL中，這是為了后面方便，每天爬取最新的信息并周期執行，然后發送短信給手機。我也是佩服自己的大腦，哈哈~

? ? ? ? 參考前文，并推薦官網。
? ? ? ? Python爬蟲之Selenium+BeautifulSoup+Phantomjs專欄
? ? ? ??[python知識] 爬蟲知識之BeautifulSoup庫安裝及簡單介紹
? ? ? ??[python爬蟲] BeautifulSoup和Selenium對比爬取豆瓣Top250電影信息

三. 數據庫操作

? ? ? ??SQL語句創建表代碼如下：

CREATE TABLE `eastmount_zlzp` ( `ID` int(11) NOT NULL AUTO_INCREMENT, `zwmc` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '職位名稱', `gsmc` varchar(50) COLLATE utf8_bin DEFAULT NULL COMMENT '公司名稱', `zwyx` varchar(50) COLLATE utf8_bin DEFAULT NULL COMMENT '職位月薪', `gzdd` varchar(50) COLLATE utf8_bin DEFAULT NULL COMMENT '工作地點', `gxsj` varchar(50) COLLATE utf8_bin DEFAULT NULL COMMENT '發布時間',`zwlj` varchar(50) COLLATE utf8_bin DEFAULT NULL COMMENT '職位鏈接',`info` varchar(200) COLLATE utf8_bin DEFAULT NULL COMMENT '詳情', PRIMARY KEY (`ID`) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin; ? ? ? ? 顯示如下圖所示：

? ? ? ??其中，Python調用MySQL推薦下面這篇文字。
? ? ? ??[python] 專題九.Mysql數據庫編程基礎知識
? ? ? ? 核心代碼如下所示：

# coding:utf-8 import MySQLdbtry:conn=MySQLdb.connect(host='localhost',user='root',passwd='123456',port=3306, db='test01')cur=conn.cursor()#插入數據sql = '''insert into student values(%s, %s, %s)'''cur.execute(sql, ('yxz','111111', '10'))#查看數據print u'\n插入數據:'cur.execute('select * from student')for data in cur.fetchall():print '%s %s %s' % datacur.close()conn.commit()conn.close() except MySQLdb.Error,e:print "Mysql Error %d: %s" % (e.args[0], e.args[1])
? ? ? ? 后面還將繼續探尋、繼續寫文，寫完這種單擊版的定時發送功能，后面研究Python服務器的相關功能。最后希望文章對你有所幫助，如果文章中存在錯誤或不足之處，還請海涵~
? ? ? ? 太忙了，但是年輕人忙才好，多經歷多磨礪多感悟；想想自己都是下班在學習，配女神的時候學習，真的有個好賢內助。胡子來省考，晚上陪他們吃個飯。感覺人生真的很奇妙，昨天加完班走了很遠給女神一個91禮物和一個拼圖，感覺挺開心的。生活、教學、編程、愛情，最后獻上一首最近寫的詩，每句都是近期一個故事。
? ? ? ? 風雪交加雨婆娑，
? ? ? ? 琴瑟和鳴淚斑駁。
? ? ? ? 披星戴月輾轉夢，
? ? ? ? 娜璋白首愛連綿。
? ? ? ?? 同時準備寫本python書給我的女神，一直沒定下來，唯一要求就是她的署名及支持。
? ? ? （By:Eastmount 2017-04-22 下午4點 ? http://blog.csdn.net/eastmount/ ）

?

總結

以上是生活随笔為你收集整理的[python爬虫] 招聘信息定时系统 (一).BeautifulSoup爬取信息并存储MySQL的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [数据库] Navicat for My
下一篇： [python爬虫] 招聘信息定时系统