當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

goquery

發布時間：2023/12/18 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 goquery 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用goquery

會用jquery的，goquery基本可以1分鐘上手，下面是goquery文檔

http://godoc.org/github.com/PuerkitoBio/goquery

1、創建文檔

d,e?:=?goquery.NewDocumentFromReader(reader?io.Reader)

d,e?:=?goquery.NewDocument(url?string)

2、查找內容

ele.Find("#title")?//根據id查找

ele.Find(".title")?//根據class查找

ele.Find("h2").Find("a")?//鏈式調用

3、獲取內容

ele.Html()

ele.Text()

4、獲取屬性

ele.Attr("href")

ele.AttrOr("href",?"")

5、遍歷

ele.Find(".item").Each(func(index?int,?ele?*goquery.Selection){

???

})

更多api請參考官方文檔

http://liyangliang.me/posts/2016/03/zhihu-go-insight-parsing-html-with-goquery/

zhihu-go 源碼解析：用 goquery 解析 HTML

上一篇博客?簡單介紹了?zhihu-go?項目的緣起，本篇簡單介紹一下關于處理 HTML 的細節。

因為知乎沒有開發 API，所以只能通過模擬瀏覽器操作的方式獲取數據，這些數據有兩種格式：普通的 HTML 文檔和某些 Ajax 接口返回的 JSON（返回的數據實際上也是 HTML）。其實也就是爬蟲了，抓取網頁，然后提取數據。一般來說從 HTML 文檔提取數據有這些做法：正則、XPath、CSS 選擇器等。對我來說，正則寫起來比較復雜，代碼可讀性差而且維護起來麻煩；XPath 沒有詳細了解，不過用起來應該不難，而且 Chrome 瀏覽器可以直接提取 XPath. zhihu-go 里用的是選擇器的方式，使用了?goquery.

goquery 是 “a little like that j-thing, only in Go”，也就是用 jQuery 的方式去操作 DOM. jQuery 大家都很熟，API 也很簡單明了。本文不詳細介紹 goquery，下面選幾個場景（API）講講在 zhihu-go 里的應用。

創建 Document 對象

goquery 暴露了兩個結構體：Document?和?Selection.?Document?表示一個 HTML 文檔，Selection?用于像 jQuery 一樣操作，支持鏈式調用。goquery 需要指定一個 HTML 文檔才能繼續后續的操作，有以下幾個構造方式：

NewDocumentFromNode(root *html.Node) *Document: 傳入?*html.Node?對象，也就是根節點。
NewDocument(url string) (*Document, error): 傳入 URL，內部用?http.Get?獲取網頁。
NewDocumentFromReader(r io.Reader) (*Document, error): 傳入?io.Reader，內部從 reader 中讀取內容并解析。
NewDocumentFromResponse(res *http.Response) (*Document, error): 傳入 HTTP 響應，內部拿到res.Body(實現了?io.Reader) 后的處理方式類似?NewDocumentFromReader.

因為知乎的頁面需要登錄才能訪問（還需要偽造請求頭），而且我們并不想手動解析 HTML 來獲取*html.Node，最后用到了另外兩個構造方法。大致的使用場景是：

請求 HTML 頁面（如問題頁面），調用?NewDocumentFromResponse
請求 Ajax 接口，返回的 JSON 數據里是一些 HTML 片段，用?NewDocumentFromReader，其中?r = strings.NewReader(html)

為了方便舉例說明，下文采用這個定義:?var doc *goquery.Document.

查找到指定節點

Selection?有一系列類似 jQuery 的方法，Document?結構體內嵌了?*Selection，因此也能直接調用這些方法。主要的方法是?Selection.Find(selector string)，傳入一個選擇器，返回一個新的，匹配到的*Selection，所以能夠鏈式調用。

比如在用戶主頁（如?黃繼新），要獲取用戶的 BIO. 首先用 Chrome 定位到對應的 HTML：

和知乎在一起

對應的 go 代碼就是：

doc.Find("span.bio")

如果一個選擇器對應多個結果，可以使用?First(),?Last(),?Eq(index int),?Slice(start, end int)這些方法進一步定位。

還是在用戶主頁，在用戶資料欄的底下，從左往右展示了提問數、回答數、文章數、收藏數和公共編輯的次數。查看 HTML 源碼后發現這幾項的 class 是一樣的，所以只能通過下標索引來區分。

先看 HTML 源碼：

如果要定位找到回答數，對應的 go 代碼是：

doc.Find("div.profile-navbar").Find("span.num").Eq(1)

屬性操作

經常需要獲取一個標簽的內容和某些屬性值，使用 goquery 可以很容易做到。

繼續上面獲取回答數的例子，用?Text() string?方法可以獲取標簽內的文本內容，其中包含所有子標簽。

text := doc.Find("div.profile-navbar").Find("span.num").Eq(1).Text() // "785"

需要注意的是，Text()?方法返回的字符串，可能前后有很多空白字符，可以視情況做清除。

獲取屬性值也很容易，有兩個方法：

Attr(attrName string) (val string, exists bool): 返回屬性值和該屬性是否存在，類似從?map中取值
AttrOr(attrName, defaultValue string) string: 和上一個方法類似，區別在于如果屬性不存在，則返回給定的默認值

常見的使用場景就是獲取一個 a 標簽的鏈接。繼續上面獲取回答的例子，如果想要得到用戶回答的主頁，可以這么做：

href, _ := doc.Find("div.profile-navbar").Find("a.item").Eq(1).Attr("href")

還有其他設置屬性、操作 class 的方法，就不展開討論了。

迭代

很多場景需要返回列表數據，比如問題的關注者列表、所有回答，某個答案的點贊的用戶列表等。這種情況下一般需要用到迭代，遍歷所有的同類節點，做某些操作。

goquery 提供了三個用于迭代的方法，都接受一個匿名函數作為參數：

Each(f func(int, *Selection)) *Selection: 其中函數?f?的第一個參數是當前的下標，第二個參數是當前的節點
EachWithBreak(f func(int, *Selection) bool) *Selection: 和?Each?類似，增加了中途跳出循環的能力，當?f?返回?false?時結束迭代
Map(f func(int, *Selection) string) (result []string):?f?的參數與上面一樣，返回一個 string 類型，最終返回 []string.

比如獲取一個收藏夾（如?黃繼新的收藏：關于知乎的思考）下所有的問題，可以這么做（見?zhihu-go/collections.go）：

func getQuestionsFromDoc(doc *goquery.Document) []*Question {questions := make([]*Question, 0, pageSize)items := doc.Find("div#zh-list-answer-wrap").Find("h2.zm-item-title") items.Each(func(index int, sel *goquery.Selection) { a := sel.Find("a") qTitle := strip(a.Text()) qHref, _ := a.Attr("href") thisQuestion := NewQuestion(makeZhihuLink(qHref), qTitle) questions = append(questions, thisQuestion) }) return questions }

EachWithBreak?在 zhihu-go 中也有用到，可以參見?Answer.GetVotersN 方法：zhihu-go/answer.go.

刪除節點、插入 HTML、導出 HTML

有一個需求是把回答內容輸出到 HTML，說白了其實就是修復和清洗 HTML，具體的細節可以看?answer.go 里的 answerSelectionToHtml 函數. 其中用到了一些需要修改文檔的操作。

比如，調用?Remove()?方法把一個節點刪掉：

sel.Find("noscript").Each(func(_ int, tag *goquery.Selection) {tag.Remove() // 把無用的 noscript 去掉 })

在節點后插入一段 HTML:

sel.Find("img").Each(func(_ int, tag *goquery.Selection) {var src string if tag.HasClass("origin_image") { src, _ = tag.Attr("data-original") } else { src, _ = tag.Attr("data-actualsrc") } tag.SetAttr("src", src) if tag.Next().Size() == 0 { tag.AfterHtml(" ") // 在 img 標簽后插入一個換行 } })

在標簽尾部 append 一段內容：

wrapper := `<html><head><meta charset="utf-8"></head><body></body></html>` doc, _ := goquery.NewDocumentFromReader(strings.NewReader(wrapper)) doc.Find("body").AppendSelection(sel)

最終輸出為 html 文檔：

html, err := doc.Html()

總結

上面的例子基本涵蓋了 zhihu-go 中關于 HTML 操作的場景，得益于 goquery 和 jQuery 的 API 風格，實現起來還是非常簡單的。

goQuery中的輸入字符串是CSS selector，其語法風格是?http://www.w3school.com.cn/cssref/css_selectors.asp

CSS3 選擇器

在 CSS 中，選擇器是一種模式，用于選擇需要添加樣式的元素。

"CSS" 列指示該屬性是在哪個 CSS 版本中定義的。（CSS1、CSS2 還是 CSS3。）

選擇器例子例子描述CSS

.class	.intro	選擇 class="intro" 的所有元素。	1
#id	#firstname	選擇 id="firstname" 的所有元素。	1
*	*	選擇所有元素。	2
element	p	選擇所有 <p> 元素。	1
element,element	div,p	選擇所有 <div> 元素和所有 <p> 元素。	1
element?element	div p	選擇 <div> 元素內部的所有 <p> 元素。	1
element>element	div>p	選擇父元素為 <div> 元素的所有 <p> 元素。	2
element+element	div+p	選擇緊接在 <div> 元素之后的所有 <p> 元素。	2
[attribute]	[target]	選擇帶有 target 屬性所有元素。	2
[attribute=value]	[target=_blank]	選擇 target="_blank" 的所有元素。	2
[attribute~=value]	[title~=flower]	選擇 title 屬性包含單詞 "flower" 的所有元素。	2
[attribute\|=value]	[lang\|=en]	選擇 lang 屬性值以 "en" 開頭的所有元素。	2
:link	a:link	選擇所有未被訪問的鏈接。	1
:visited	a:visited	選擇所有已被訪問的鏈接。	1
:active	a:active	選擇活動鏈接。	1
:hover	a:hover	選擇鼠標指針位于其上的鏈接。	1
:focus	input:focus	選擇獲得焦點的 input 元素。	2
:first-letter	p:first-letter	選擇每個 <p> 元素的首字母。	1
:first-line	p:first-line	選擇每個 <p> 元素的首行。	1
:first-child	p:first-child	選擇屬于父元素的第一個子元素的每個 <p> 元素。	2
:before	p:before	在每個 <p> 元素的內容之前插入內容。	2
:after	p:after	在每個 <p> 元素的內容之后插入內容。	2
:lang(language)	p:lang(it)	選擇帶有以 "it" 開頭的 lang 屬性值的每個 <p> 元素。	2
element1~element2	p~ul	選擇前面有 <p> 元素的每個 <ul> 元素。	3
[attribute^=value]	a[src^="https"]	選擇其 src 屬性值以 "https" 開頭的每個 <a> 元素。	3
[attribute$=value]	a[src$=".pdf"]	選擇其 src 屬性以 ".pdf" 結尾的所有 <a> 元素。	3
[attribute*=value]	a[src*="abc"]	選擇其 src 屬性中包含 "abc" 子串的每個 <a> 元素。	3
:first-of-type	p:first-of-type	選擇屬于其父元素的首個 <p> 元素的每個 <p> 元素。	3
:last-of-type	p:last-of-type	選擇屬于其父元素的最后 <p> 元素的每個 <p> 元素。	3
:only-of-type	p:only-of-type	選擇屬于其父元素唯一的 <p> 元素的每個 <p> 元素。	3
:only-child	p:only-child	選擇屬于其父元素的唯一子元素的每個 <p> 元素。	3
:nth-child(n)	p:nth-child(2)	選擇屬于其父元素的第二個子元素的每個 <p> 元素。	3
:nth-last-child(n)	p:nth-last-child(2)	同上，從最后一個子元素開始計數。	3
:nth-of-type(n)	p:nth-of-type(2)	選擇屬于其父元素第二個 <p> 元素的每個 <p> 元素。	3
:nth-last-of-type(n)	p:nth-last-of-type(2)	同上，但是從最后一個子元素開始計數。	3
:last-child	p:last-child	選擇屬于其父元素最后一個子元素每個 <p> 元素。	3
:root	:root	選擇文檔的根元素。	3
:empty	p:empty	選擇沒有子元素的每個 <p> 元素（包括文本節點）。	3
:target	#news:target	選擇當前活動的 #news 元素。	3
:enabled	input:enabled	選擇每個啟用的 <input> 元素。	3
:disabled	input:disabled	選擇每個禁用的 <input> 元素	3
:checked	input:checked	選擇每個被選中的 <input> 元素。	3
:not(selector)	:not(p)	選擇非 <p> 元素的每個元素。	3
::selection	::selection	選擇被用戶選取的元素部分。	3

http://www.w3school.com.cn/cssref/css_selectors.asp

package mainimport ("fmt""log""github.com/PuerkitoBio/goquery" )func ExampleScrape() {doc, err := goquery.NewDocument("http://studygolang.com/topics")if err != nil {log.Fatal(err)}/*dhead := doc.Find("head")dTitle := dhead.Find("title")fmt.Printf("title text:%s\n", dTitle.Text())html, _ := dTitle.Html()fmt.Printf("title html:%s\n", html)metaArr := dhead.Find("meta")for i := 0; i < metaArr.Length(); i++ {d, _ := metaArr.Eq(i).Attr("name")fmt.Println(d)}*/doc.Find("div.wrapper .container .col-lg-9").Each(func(i int, cs *goquery.Selection) {d, _ := cs.Attr("class")fmt.Println(d)}) }func main() {ExampleScrape()returndoc, err := goquery.NewDocument("http://studygolang.com/topics")if err != nil {log.Fatal(err)}fmt.Println(doc.Html()) //.Html()得到html內容pTitle := doc.Find("title").Text() //直接提取title的內容class := doc.Find("h2").Text()fmt.Printf("class:%v\n", class)fmt.Printf("title:%v\n", pTitle)doc.Find(".topics .topic").Each(func(i int, contentSelection *goquery.Selection) {title := contentSelection.Find(".title a").Text()t := contentSelection.Find(".title a")log.Printf("the length;%d", t.Length())log.Println("第", i+1, "個帖子的標題：", title)})/*t := doc.Find(".topics .topic")log.Printf("%+v", t)t = doc.Find(".topics")log.Printf("%+v", t)t = doc.Find(".topic")log.Printf("%+v", t)t = doc.Find("div.topic")log.Printf("div.topic:%+v", t)*/t := doc.Find("div.topic").Find(".title a")log.Printf("div.topic.title a:%+v", t)for i := 0; i < t.Length(); i++ {d, _ := t.Eq(i).Attr("href")title, _ := t.Eq(i).Attr("title")fmt.Println(d)fmt.Println(title)}

輸出：

col-lg-9 col-md-8 col-sm-7

參考鏈接?

?http://liyangliang.me/posts/2016/03/zhihu-go-insight-parsing-html-with-goquery/?

http://www.tiege.me/?p=501

轉載于:https://www.cnblogs.com/diegodu/p/5761961.html

總結

以上是生活随笔為你收集整理的goquery的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

goquery

上一篇： |Tyvj|NOIP2004|堆|贪心|
下一篇： HDU 5025：Saving Tang