前言
到了一定年齡,父母可能會催你找女朋友,結婚。
大多數的父母催婚,是父母漸漸老了,想讓你找個人照顧你,有熱飯吃,生病了有人照顧。在外面不被人欺負。
當然,也有一部分來自周圍人的壓力,跟你同齡的孩子差不多都結婚了,你父母的壓力自然就來了。跟父母給孩子報課外輔導班的心理一樣。
很多時候讓你成家立業,在父母看來,幫你完成成家的任務,父母的一大任務算是完成了。不然單身的男女每個家,在父母心里始終是個心結,這種心情,小城鎮特別的突出。
父母幫你完成了結婚的任務,不需要像以前那樣辛辛苦苦奔波賺錢了。
催婚,第一,是父母對你的關心。
第二,是父母的私心(雖然有時候這種私心是被動的私心)
第三,父母養育任務的完成,要開始享受生活了。
所以,今天作者就來爬取下交友網站,看看小姐姐的擇偶觀。
結合博主的年齡
所以博主的篩選條件是
重慶,年齡21-27歲,未婚小姐姐。
大姐姐們的擇偶觀我并不關心。
對技術不感興趣的,下拉到后面看結論。
技術部分
網站選取
世紀佳緣得到的信息如圖,對擇偶條件未怎么提及。所以該網站放棄。
世紀佳緣爬取代碼
```python
import requests
import json
import pandas
as pd
from requests
.exceptions
import ReadTimeout
, ConnectionError
, RequestException
def get_page(url
):'''參數:url :目標網頁的 url返回:目標網頁的 html 內容'''headers
= {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36','Cookie': 'guider_quick_search=on; SESSION_HASH=0ea6881596be6958acab86601f12c97fdc211b1d; jy_refer=www.baidu.com; accessID=20210215105936945652; user_access=1; Qs_lvt_336351=1613358012; _gscu_1380850711=133580687kjmry14; _gscbrs_1380850711=1; COMMON_HASH=03ccf3f907328da89142987423a9215b; stadate1=271766541; myloc=50%7C5001; myage=25; mysex=m; myuid=271766541; myincome=40; Qs_pv_336351=1107463009737048200%2C2408036345389375500%2C1494836557490850800%2C3408124253653422600%2C1396723355418865400; PHPSESSID=3699194bbb0a1fb7c7f3c46c813f162c; pop_avatar=1; PROFILE=272766541%3A%25E6%2580%25BB%25E8%25A3%2581%25E4%25BD%2599%3Am%3Aimages1.jyimg.com%2Fw4%2Fglobal%2Fi%3A0%3A%3A1%3Azwzp_m.jpg%3A1%3A1%3A50%3A10%3A3; main_search:272766541=%7C%7C%7C00; RAW_HASH=PBalPtMnGoSGsXuyDvb3BTznuvyG8MajCm%2AWrcDW%2Av1YkfseTjLUbLLCCHeQJ0B25bjAa%2Ak4IbveQI5X4uzQhvvD3qbP6ajy90MEyOpZDZzznTM.; is_searchv2=1; pop_1557218166=1613364302492; pop_time=1613363558012'}try:response
= requests
.get
(url
, headers
=headers
, timeout
=10)response
.encoding
= 'unicode_escape' if response
.status_code
== 200:return response
.text
except ReadTimeout
: return Noneexcept ConnectionError
: return Noneexcept RequestException
: return None
def pase_page(url
):html
= get_page
(url
)'''功能:嘗試解析其結構,獲取所需內容并保存進CSV'''if html
is not None:html
= str(html
)s
= json
.loads
(html
,strict
=False) usrinfolist
= [] for key
in s
['userInfo']:personlist
= [] uid
= key
['uid']nickname
= key
['nickname']sex
= key
['sex']age
= key
['age']work_location
= key
['work_location']height
= key
['height']education
= key
['education']marriage
= key
['marriage']income
= key
['income']matchCondition
= key
['matchCondition']shortnote
= key
['shortnote']image
= key
['image']personlist
.append
(uid
)personlist
.append
(nickname
)personlist
.append
(sex
)personlist
.append
(age
)personlist
.append
(work_location
)personlist
.append
(height
)personlist
.append
(education
)personlist
.append
(matchCondition
)personlist
.append
(marriage
)personlist
.append
(income
)personlist
.append
(shortnote
)personlist
.append
(image
)usrinfolist
.append
(personlist
)dataframe
= pd
.DataFrame
(usrinfolist
)dataframe
.to_csv
('世紀佳緣小姐姐信息.csv', mode
='a+', index
=False, header
=False) print('當前頁數{0}'.format(page
))else:print('解析失敗')import threading
if __name__
== '__main__':for page
in range(1, 5000,3):url1
= 'http://search.jiayuan.com/v2/search_v2.php?key=&sex=f&stc=2:18.24,3:155.170,23:1&sn=default&sv=1&p=%s&f=select'+str(page
)url2
= 'http://search.jiayuan.com/v2/search_v2.php?key=&sex=f&stc=2:18.24,3:155.170,23:1&sn=default&sv=1&p=%s&f=select' + str(page
+1)url3
= 'http://search.jiayuan.com/v2/search_v2.php?key=&sex=f&stc=2:18.24,3:155.170,23:1&sn=default&sv=1&p=%s&f=select' + str(page
+ 2)t1
= threading
.Thread
(target
=pase_page
, kwargs
={'url':url1
}) t2
= threading
.Thread
(target
=pase_page
, kwargs
={'url':url2
}) t3
= threading
.Thread
(target
=pase_page
, kwargs
={'url':url3
}) t1
.start
()t2
.start
()t3
.start
()
所以我爬取其他網站
http://www.lovewzly.com/jiaoyou.html
網站主頁圖
點開一個小姐姐,發現有擇偶觀信息可以提取。
發現該小姐姐的網址鏈接為http://www.lovewzly.com/user/4270839.html
在主頁中查看源碼,我們可以發現小姐姐的網頁地址鏈接可以從主頁圖data-uid分析得到。
于是我們可以認為,爬取主頁,得到所有小姐姐的data-uid,然后遍歷每一個data-uid,根據data-uid拼接小姐姐網頁的地址。然后分析該小姐姐的擇偶觀。
網頁分析
從sources,我們可以找到城市年齡,星座等的數字標簽。這些我們用來自己動手寫函數,用于篩選。
網頁鏈接如圖
下拉查看參數。因為我只勾選了幾個條件,所以網頁鏈接呈現出的參數少。
查看數據,如圖
從上發現數據沒有我想要的擇偶要求。
所以我在此網頁只取userid。然后構建小姐姐網頁地址如http://www.lovewzly.com/user/4276242.html,再從該網頁中提取小姐姐的信息和擇偶條件.
代碼1:根據條件提取小姐姐的userid
本次編程語言:python。
其他語言也在學,但尚未成長為我的主語言,還不能殺敵。
該代碼中我只設置了篩選條件:小姐姐年齡,性別,城市,是否婚配。
import requests
from requests
.exceptions
import ReadTimeout
, ConnectionError
, RequestException
import pandas
as pd
import numpy
as np
def set_age():age
= int(input("請輸入對方的期望年齡:")) if 21 <= age
<= 30:startage
= 21endage
= 30elif 31 <= age
<= 40:startage
= 31endage
= 40return startage
, endage
def set_sex():sex
= input("請輸入對方的期望性別:")if sex
== '男':gender
= 1elif sex
== '女':gender
= 2return gender
def set_city():city
= input("請輸入對方的期望城市:")if city
== '北京':cityid
= 52elif city
== '深圳':cityid
= 77elif city
== '廣州':cityid
= 76elif city
== '福州':cityid
= 53elif city
== '廈門':cityid
= 60elif city
== '杭州':cityid
= 383elif city
== '青島':cityid
= 284elif city
== '長沙':cityid
= 197elif city
== '濟南':cityid
= 283elif city
== '南京':cityid
= 220elif city
== '香港':cityid
= 395elif city
== '上海':cityid
= 321elif city
== '成都':cityid
= 322elif city
== '武漢':cityid
= 180elif city
== '蘇州':cityid
= 221elif city
== '重慶':cityid
= 394elif city
== '香港':cityid
= 395elif city
== '南昌':cityid
= 233elif city
== '南寧':cityid
= 97elif city
== '合肥':cityid
= 3401elif city
== '鄭州':cityid
= 149elif city
== '佛山':cityid
= 80elif city
== '珠海':cityid
= 96elif city
== '昆明':cityid
= 397elif city
== '石家莊':cityid
= 138elif city
== '天津':cityid
= 143return cityid
def marry():print('請輸入是否婚配。輸入字符如:未婚,離異,喪偶')marry
= input("輸入是否婚配:")if marry
== '未婚':marryid
=1elif marry
=='離異':marryid
=3elif marry
=='喪偶':marryid
=2return marryid
def get_info(page
, startage
, endage
, gender
, cityid
, marryid
):url
= 'http://www.lovewzly.com/api/user/pc/list/search?startage={}&endage={}&gender={}&cityid={}&marry={}&page={}'.format(startage
, endage
, gender
, cityid
, marryid
, page
)try:response
= requests
.get
(url
)if response
.status_code
== 200:result
= response
.json
()return result
except ReadTimeout
: print('Timeout')return Noneexcept ConnectionError
: print('Connection error')return Noneexcept RequestException
: print('Error')return None
def main():print("請輸入你的篩選條件,開始本次姻緣:")startage
, endage
= set_age
()gender
= set_sex
()cityid
= set_city
()marryid
=marry
()for i
in range(1, 100): json
= get_info
(i
, startage
, endage
, gender
, cityid
,marryid
)for item
in json
['data']['list']:userid
= item
['userid']userid
= np
.array
(userid
)userid
= pd
.Series
(userid
)userid
.to_csv
('小姐姐信息userid.csv', mode
='a+', index
=False, header
=False) if __name__
=='__main__':main
()print('讀取結束')
程序圖
結果圖。在此條件下,在該網站只找到454個重慶小姐姐。
再次運行程序,目標城市:成都,上海。將所有結果整合成一張
代碼2:根據userid提取小姐姐的個人信息和擇偶觀
選項一個小姐姐單擊,審查元素,發現信息直接顯示在網頁源代碼中,沒有經過渲染等。所以該部分信息提取沒有難度,不再細講。
import numpy
as np
import pandas
as pd
import requests
from bs4
import BeautifulSoup
import re
def get_page(url
):headers
= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}try:response
= requests
.get
(url
,headers
= headers
,timeout
=10)response
.encoding
= 'utf-8' if response
.status_code
== 200 :return response
.text
except ConnectionError
:print('程序錯誤')return Nonedef pase_page(url
,i
):html
= get_page
(url
)html
= str(html
)if html
is not None:soup
= BeautifulSoup
(html
, 'lxml')"----------------------------------小姐姐信息------------------------------""--昵稱--"nickname
=soup
.select
('.view.fl.c6 .nick.c3e')nickname
=''.join
([i
.get_text
() for i
in nickname
])"--年齡--"age
=soup
.select
('.f18.c3e.p2 .age.s1')age
=''.join
(i
.get_text
()for i
in age
)"--身高--"height
=soup
.select
('.f18.c3e.p2 .height')height
=''.join
(i
.get_text
()for i
in height
)"--學歷--"education
=soup
.select
('.f18.c3e.p2 .education')education
=''.join
(i
.get_text
()for i
in education
)"--現居地--"present_address
=soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div:nth-child(1) > div.view.fl.c6 > ul > li:nth-child(1) > span')present_address
=''.join
(i
.get_text
() for i
in present_address
)"--職業--"professional
=soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div:nth-child(1) > div.view.fl.c6 > ul > li:nth-child(7) > span')professional
=''.join
(i
.get_text
() for i
in professional
)"--收入--"income
=soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div:nth-child(1) > div.view.fl.c6 > ul > li:nth-child(8) > span')income
=''.join
(i
.get_text
()for i
in income
)"--個人照鏈接--"photo
=soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div:nth-child(1) > div.photo.fl > div.imgwrap > ul > li:nth-child(1) > img')photo
=str(photo
)pat1
= '.+src="(.+)"'photo
=re
.compile(pat1
).findall
(photo
)"----------------------------------擇偶要求------------------------------""--是否介意對方抽煙--"smoking
=soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(2) > div.body > ul > li:nth-child(2)')smoking
=''.join
(i
.get_text
() for i
in smoking
)"--是否介意對方喝酒--"drinking
= soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(2) > div.body > ul > li:nth-child(4)')drinking
=''.join
(i
.get_text
() for i
in drinking
)"--是否介意對方有子女--"children
= soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(2) > div.body > ul > li:nth-child(3)')children
=''.join
(i
.get_text
()for i
in children
)"--擇偶年齡--"age_man
= soup
.select
('#userid > div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(3) > div.body > ul > li:nth-child(1)')age_man
=''.join
(i
.get_text
()for i
in age_man
)"--擇偶身高--"height_man
= soup
.select
('#userid > div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(3) > div.body > ul > li:nth-child(2)')height_man
=''.join
(i
.get_text
()for i
in height_man
)"--擇偶月薪--"money_man
= soup
.select
('#userid > div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(3) > div.body > ul > li:nth-child(3)')money_man
=''.join
(i
.get_text
()for i
in money_man
)"--擇偶學歷--"study_man
= soup
.select
('div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(3) > div.body > ul > li:nth-child(4)')study_man
=''.join
(i
.get_text
()for i
in study_man
)"--擇偶職業--"professional_man
= soup
.select
('#userid > div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(3) > div.body > ul > li:nth-child(8)')professional_man
=''.join
(i
.get_text
() for i
in professional_man
)"--擇偶現居地--"present_addressman
= soup
.select
('#userid > div.cm-wrapin.user-warpin > div.clearfix > div.users-left > div.clearfix.user-detail > div:nth-child(3) > div.body > ul > li:nth-child(6)')present_addressman
=''.join
(i
.get_text
()for i
in present_addressman
)"----------------------------------所有信息寫入表格------------------------------"information
= [nickname
,age
,height
,education
,present_address
,professional
,income
,photo
,smoking
,drinking
,children
,age_man
,height_man
,money_man
,study_man
,professional_man
, present_addressman
]information
= np
.array
(information
)information
= information
.reshape
(-1, 17)information
= pd
.DataFrame
(information
,columns
=[nickname
,age
,height
,education
,present_address
,professional
,income
,photo
,smoking
,drinking
,children
,age_man
,height_man
,money_man
,study_man
,professional_man
, present_addressman
])if i
==0:information
.to_csv
('相親網站小姐姐數據.csv', mode
='a+', index
=False, header
=0) else:information
.to_csv
('相親網站小姐姐數據.csv', mode
='a+', index
=False, header
=False) else:print('解析錯誤')def main():f
= open('小姐姐信息.txt', encoding
='gbk')txt
= []for line
in f
:txt
.append
(line
.strip
())i
=0for userid
in txt
: print(i
)base_url
= 'http://www.lovewzly.com/user/'+str(userid
)+'.html'pase_page
(base_url
,i
)i
+=1
if __name__
=='__main__':main
()
得到的數據如下,2500多條數據
數據分析部分
數據分析部分,我懶得寫代碼了,有些累了。
簡單操作,通過表格的數據透視表來簡單分析下。
數據透視表教程
對對象工資分析
在有工資字段內,58.78%的小姐姐要求對象月入1萬以上。
(錢果然還是萬能的,前段時間聽說離我很近的一個成功企業家出軌一個比他女兒稍微大點的小姐姐,禽獸呀)
單身小姐姐的學歷分布
以本科和專科生居多。果然學歷越低越不容易單身。
查看各個學歷階段小姐姐對對象工資要求
本科生和專科生要求對象月入1萬的人數為294,188
查看小姐姐與對象工資的區別
橫坐標為小姐姐的工資,縱坐標為對象工資統計個數。能月收入5千到1萬的,基本都要求對象月收入1萬以上。
小姐姐對象學歷要求
橫坐標為小姐姐學歷,縱坐標為對象學歷
y軸為小姐姐學歷,x軸為對象。學歷為本科的小姐姐還有不少人要求對象為初高中專科。
小姐姐對象身高要求
字段太多了,簡單截圖看下,幾個比較大的值170-180
通過此文,我發現了小姐姐們的擇偶觀。
此刻的我很膨脹,我覺得她們都配不上我。
總結
以上是生活随笔為你收集整理的爬虫实战:过年你被催婚啦吗?爬取相亲网站,看看当下年轻小姐姐的择偶观。的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。