生活随笔
收集整理的這篇文章主要介紹了
AI Studio 学习 Go 豆瓣电影爬取
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
分析
1.首先獲得每個主頁面的內容 豆瓣電影 Top 250 URL:https://movie.douban.com/top250 第一頁URL:https://movie.douban.com/top250?start=0&filter= 第二頁URL:https://movie.douban.com/top250?start=25&filter= 第三頁URL:https://movie.douban.com/top250?start=50&filter= 第四頁URL:https://movie.douban.com/top250?start=75&filter= …… 第X頁:https://movie.douban.com/top250?start=(X-1)*25&filter=
2.提取每個主頁面中的有價值信息 獲取電影名稱:< img width=“100” alt="(.?)" 獲取電影評分:< span class=“rating_num” property=“v:average”>(.?)< /span> 獲取評價人數:< span>(.?)人評價< /span> 獲取電影總結:< span class=“inq”>(.?)< /span> 獲取電影詳情頁面URL:< a href="(.*?)/" class="">
3.打開電影詳情頁面獲取有價值信息 導演:rel=“v:directedBy”>(.?)< /a> 類型:< span property=“v:genre”>(.?)< /span> 制片國家:地區:< /span>(.?)< br/> 語言:語言:< /span>(.?)< br/> 上映時間:< span property=“v:initialReleaseDate” content="(.?)" 片長:< span property=“v:runtime” content="(.?)" 簡介:< span property=“v:summary” class="">(?s:(.*?))< /span>
package main
import ( "io" "fmt" "regexp" "strings" "strconv" "net/http" "github.com/360EntSecGroup-Skylar/excelize"
)
func HttpGet ( url
string ) ( result
string , err
error ) { resp
, err1
:= http
. Get ( url
) if err1
!= nil { err
= err1
return } defer resp
. Body
. Close ( ) buf
:= make ( [ ] byte , 4096 ) for { num
, err2
:= resp
. Body
. Read ( buf
) if num
== 0 { break ; } if err2
!= nil && err2
!= io
. EOF
{ err
= err2
return } result
+= string ( buf
[ : num
] ) } return
} func EditingRegularExpressions ( Expression
, result
string ) ( [ ] [ ] string ) { return regexp
. MustCompile ( Expression
) . FindAllStringSubmatch ( result
, - 1 )
} func SpiderDetail ( url
string ) ( Director
, Type
, Country
, Language
, Time
, Duration
[ ] [ ] string , content
string ) { result
, _ := HttpGet ( url
) Director
= EditingRegularExpressions ( `rel="v:directedBy">(.*?)</a>` , result
) Type
= EditingRegularExpressions ( `<span property="v:genre">(.*?)</span>` , result
) Country
= EditingRegularExpressions ( `地區:</span>(.*?)<br/>` , result
) Language
= EditingRegularExpressions ( `語言:</span>(.*?)<br/>` , result
) Time
= EditingRegularExpressions ( `<span property="v:initialReleaseDate" content="(.*?)"` , result
) Duration
= EditingRegularExpressions ( `<span property="v:runtime" content="(.*?)"` , result
) Introduction
:= EditingRegularExpressions ( `<span property="v:summary"(?s:(.*?))</span>` , result
) for _ , temp
:= range Introduction
{ content
= temp
[ 1 ] content
= strings
. Replace ( content
, ">" , "" , - 1 ) content
= strings
. Replace ( content
, " " , "" , - 1 ) content
= strings
. Replace ( content
, "\n" , "" , - 1 ) content
= strings
. Replace ( content
, "\t" , "" , - 1 ) content
= strings
. Replace ( content
, "<br/" , "" , - 1 ) content
= strings
. Replace ( content
, " " , "" , - 1 ) content
= strings
. Replace ( content
, `class=""` , "" , - 1 ) } return
} func SpiderPage ( index
int ) { fmt
. Println ( "Climbing page " , index
, "....." ) f
, err
:= excelize
. OpenFile ( "./Film.xlsx" ) if err
!= nil { fmt
. Println ( "open file err: " , err
) return } rows
, _ := f
. GetRows ( "Sheet1" ) lens
:= len ( rows
) url
:= "https://movie.douban.com/top250?start=" + strconv
. Itoa ( ( index
- 1 ) * 25 ) + "&filter=" result
, err
:= HttpGet ( url
) if err
!= nil { fmt
. Println ( "HttpGet err: " , err
) return } MovieTitle
:= EditingRegularExpressions ( `<img width="100" alt="(.*?)"` , result
) FilmRating
:= EditingRegularExpressions ( `<span class="rating_num" property="v:average">(.*?)</span>` , result
) NumberOfPeopleAssessed
:= EditingRegularExpressions ( `<span>(.*?)人評價</span>` , result
) FilmSummary
:= EditingRegularExpressions ( `<span class="inq">(.*?)</span>` , result
) DetailsPageURL
:= EditingRegularExpressions ( `<a href="(.*?)/" class="">` , result
) for i
:= 0 ; i
< len ( DetailsPageURL
) ; i
++ { Director
, Type
, Country
, Language
, Time
, Duration
, Introduction
:= SpiderDetail ( DetailsPageURL
[ i
] [ 1 ] ) content
:= [ ] string { MovieTitle
[ i
] [ 1 ] , FilmRating
[ i
] [ 1 ] , NumberOfPeopleAssessed
[ i
] [ 1 ] , FilmSummary
[ i
] [ 1 ] , DetailsPageURL
[ i
] [ 1 ] , Director
[ 0 ] [ 1 ] , Type
[ 0 ] [ 1 ] , Country
[ 0 ] [ 1 ] , Language
[ 0 ] [ 1 ] , Time
[ 0 ] [ 1 ] , Duration
[ 0 ] [ 1 ] , Introduction
} for j
:= 0 ; j
< 12 ; j
++ { coor
:= fmt
. Sprintf ( "%c" , j
+ 65 ) coor
= coor
+ strconv
. Itoa ( lens
+ i
+ 1 ) f
. SetCellValue ( "Sheet1" , coor
, content
[ j
] ) } } err
= f
. SaveAs ( "./Film.xlsx" ) if err
!= nil { fmt
. Println ( err
) }
} func Spider ( start
, end
int ) { var cont
= [ ] string { "電影名" , "電影評分" , "評分人數" , "電影總結" , "電影詳情網頁" , "導演" , "類型" , "制片國家" , "語言" , "上映時間" , "電影時長" , "簡介" } file
:= excelize
. NewFile ( ) for i
:= 0 ; i
< len ( cont
) ; i
++ { col
:= fmt
. Sprintf ( "%c1" , i
+ 65 ) file
. SetCellValue ( "Sheet1" , col
, cont
[ i
] ) } err
:= file
. SaveAs ( "./Film.xlsx" ) if err
!= nil { fmt
. Println ( err
) } fmt
. Printf ( "Crawl from page %d to page %d \n" , start
, end
) for i
:= start
; i
< end
+ 1 ; i
++ { SpiderPage ( i
) }
} func main ( ) { var start
, end
int fmt
. Printf ( "Please enter the crawl start page(>=1):" ) fmt
. Scan ( & start
) fmt
. Printf ( "Please enter the crawled termination page(>=start):" ) fmt
. Scan ( & end
) Spider ( start
, end
)
}
定義一個方法, 獲得頁面全部內容
func (*Client) Get
Get issues a GET to the specified URL. If the response is one of the following redirect codes, Get follows the redirect after calling the Client’s CheckRedirect function: get發出到指定url的get。如果響應是以下重定向代碼之一,則get在調用客戶端的checkredirect函數后遵循重定向:
301 (Moved Permanently) 301(永久移動) 302 (Found) 302(發現) 303 (See Other) 303(見其他) 307 (Temporary Redirect) 307(臨時重定向) 308 (Permanent Redirect) 308(永久重定向)
An error is returned if the Client’s CheckRedirect function fails or if there was an HTTP protocol error. A non-2xx response doesn’t cause an error. Any returned error will be of type url.Error. The url.Error value’s Timeout method will report true if request timed out or was canceled. 如果客戶端的檢查重定向函數失敗或存在HTTP協議錯誤,則返回錯誤。非2xx響應不會導致錯誤。任何返回的錯誤都將是url.error類型。如果請求超時或被取消,URL.Error值的超時方法將報告TRUE。
When err is nil, resp always contains a non-nil resp.Body. Caller should close resp.Body when done reading from it. 當err為nil時,resp始終包含非nil resp.body。調用方完成讀取后應關閉響應體。
To make a request with custom headers, use NewRequest and Client.Do. 要使用自定義頭發出請求,請使用newrequest和client.do。
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
FindAllStringSubmatch is the ‘All’ version of FindStringSubmatch; it returns a slice of all successive matches of the expression, as defined by the ‘All’ description in the package comment. A return value of nil indicates no match. FindAllStringSubmatch是FindStringSubmatch的“all”版本;它返回表達式所有連續匹配項的一部分,如包注釋中的“all”描述所定義。返回值nil表示不匹配。
遇到的問題
1.正則表達式的返回結果為空 排查后發現是獲得網頁信息有問題: 檢測到有異常請求從你的 IP 發出
貌似是IP被封了啊。
大概是請求過多了吧。
這是在服務器上做的,沒辦法,只能把文件下載到Windows上run了。
總結
以上是生活随笔 為你收集整理的AI Studio 学习 Go 豆瓣电影爬取 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。