爬取链家在北京每个地区的房屋信息
生活随笔
收集整理的這篇文章主要介紹了
爬取链家在北京每个地区的房屋信息
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
提出要求
鏈接如下:
基本沒有出現(xiàn)JSON數(shù)據(jù)的請求,所以非常簡單
from fake_useragent import UserAgent import requests from lxml import etree from math import ceil import time import randomdef iskong(a):if len(a):return aelse:return '' n = 0 header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' } while True:n+=1url = 'https://bj.lianjia.com/ershoufang'response = requests.get(url,headers=header).content.decode('utf-8')tree = etree.HTML(response)list_a_href = tree.xpath('//div[@class="sub_nav section_sub_nav"]/a/@href')list_title = tree.xpath('//div[@class="sub_nav section_sub_nav"]/a/@title')for i in zip(list_a_href,list_title):if 'https://bj.lianjia.com' in i[0]:lianjie = i[0]else:lianjie = 'https://bj.lianjia.com/'+i[0]print(lianjie,i[1])#開始爬取每個(gè)地區(qū)的列表頁print('正在爬取'+i[1])response = requests.get(lianjie,headers=header).content.decode('utf-8')tree =etree.HTML(response)num = tree.xpath("//h2//span/text()")print("此地區(qū)一共有",num[0],'套房子')page = ceil(int(num[0])/30)print("一共有",page,"頁")for i in range(1,page+1):print("開始爬取第",i,"頁")base_url = 'https://bj.lianjia.com/ershoufang/dongcheng/pg{}/'.format(i)response = requests.get(base_url,headers=header).content.decode('utf-8')tree = etree.HTML(response)house = tree.xpath('//div[@class="title"]/a/@href')for j in range(0,len(house)):try:print("正在爬取第",i,"頁的第",j+1,"套房")response = requests.get(house[j],headers=header).content.decode('utf-8')tree = etree.HTML(response)print("正在獲取具體房源信息")time.sleep(random.choice([.2,.3,.4,.5,1]))fjxx = tree.xpath('//div[@class="title"]/h1/text()')[0]wsgs = tree.xpath('//div[@class="content"]/ul/li/text()')[0]jiage = tree.xpath('//div[@class="price "]/span/text()')[0]telphone = tree.xpath('//div[@class="phone"]//text()')if len(telphone):telphone = iskong(telphone[0])+iskong(telphone[1])+iskong(telphone[2])+iskong(telphone[3])else:telphone = ''print(telphone)print(fjxx)a = ("正在爬取第"+str(i)+ "頁的第"+str(j + 1)+"套房"+'\n'+'\n'+"房屋信息:"+fjxx+'\n'+"臥室個(gè)數(shù):"+wsgs[0]+'\n'+'廳的個(gè)數(shù):'+wsgs[2]+'\n'+'衛(wèi)生間個(gè)數(shù):'+wsgs[6]+'\n'+"房間價(jià)格:"+jiage+'萬'+'\n'+telphone+'\n'+'\n')with open('lianjia.txt', 'a', encoding='utf-8') as fp:fp.write(a)except:pass中途有可能出現(xiàn)爬取出錯(cuò)的問題,最主要的就是網(wǎng)絡(luò)鏈接問題,和爬取頻繁,我們可以嘗試讓它睡下。
我這里使用的TRY EXCEPT結(jié)構(gòu),只要出現(xiàn)錯(cuò)誤,就會(huì)停止本次爬取,然后繼續(xù)爬取下一個(gè)房源,由于鏈家網(wǎng)站的反爬技術(shù)非常的簡單,基本沒有,所以我們的爬取幾乎沒有遇到任何困難
總結(jié)
以上是生活随笔為你收集整理的爬取链家在北京每个地区的房屋信息的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 冒泡排序 - 数据结构和算法88
- 下一篇: 2021四川高考成绩等位分查询,2021