日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫笔记:pyquery详解

發布時間:2024/9/30 编程问答 23 豆豆
生活随笔 收集整理的這篇文章主要介紹了 爬虫笔记:pyquery详解 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

pyquery

強大又靈活的網頁解析庫,如果你覺得正則寫起來太麻煩,如果你覺得BeautifuiSoup語法太難記,如果你熟悉JQuery的語法,那么PyQuery就是你的絕對選擇。

初始化

1字符串初始化

html = ''' <div><ul><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div> ''' from pyquery import PyQuery as pq doc = pq(html)#聲明一個對象 print(doc('li'))#傳入一個選擇器

doc(‘li’) 選擇器,如果選擇標簽直接加名字,如果選擇id,加#,如果選擇class,前面加.點。

2URL初始化

from pyquery import PyQuery as pq doc = pq(url='https://www.2345.com/?38001')#傳入一個網址 print(doc('head'))

3文件初始化

from pyquery import PyQuery as pq doc = pq(filename='demo.html') print(doc('li'))

基本CSS選擇器

html = ''' <div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li'))#id,class,標簽名

doc(’#container .list li’)中list不一定是container的直接子對象,只要有層級關系就可以,中間需要用空格隔開。如果沒有空格表示并列,表示條件需要同時滿足。如(a.b)表示條件要同時滿足ab。ab之間沒有層級關系。

查找子元素

### 子元素#%%html = ''' <div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') print(type(items)) print(items)print('查找子元素') lis = items.children() print(type(lis)) print(lis) print('具體子元素') lis = items.children('.active') print(lis)

items = doc(’.list’),items是一個查找對象,對對象可以調用查找方法,如find(查找子元素),children(直接子元素)。

查找父元素

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') parents = items.parents() parent = items.parent() print('父親以及祖輩') print(parents) print('直接父元素') print(parent)

查找兄弟元素

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list .item-0.active') print('所有兄弟') print(li.siblings()) print('具體某一兄弟') print(li.siblings('.active'))

遍歷單個元素

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis:print(li)

獲取信息

獲取屬性

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a) print(a.attr('href'))#查找網址 print(a.attr.href)

獲取文本

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a')#.item-0.active之間沒有空格,表示class同時是item-0,active。有空格表示層級關系,如active a print(a) print(a.text())#獲取文本

獲取HTML

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) print(li.html())

DOM操作

addClass、removeClass

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active')#.item-0.active,屬性之間無空格,表示同時滿足 print(li) li.removeClass('active') print(li) li.addClass('active') print(li)

attr、css

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.attr('name', 'link') print(li) li.css('font-size', '14px') print(li)

remove

html = ''' <div class="wrap">Hello, World<p>This is a paragraph.</p></div> ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print(wrap.text())

其他DOM方法

http://pyquery.readthedocs.io/en/latest/api.html

偽類選擇器

html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('li:first-child')#獲取第一個Li標簽 print(li) li = doc('li:last-child')#獲取最后一個li標簽 print(li) li = doc('li:nth-child(2)')#獲取第二個li標簽 print(li) li = doc('li:gt(2)')#獲取第二個li標簽 print(li) li = doc('li:nth-child(2n)')#獲取第二個li標簽 print(li) li = doc('li:contains(second)')#獲取第二個li標簽 print(li)

作者:電氣-余登武。寫作屬實不容易,如果你覺得本文不錯,點個贊再走。

總結

以上是生活随笔為你收集整理的爬虫笔记:pyquery详解的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。