生活随笔
收集整理的這篇文章主要介紹了
Java+Selenium爬贴吧
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
想爬貼吧很多年了,一直拖到現在。前幾天心血來潮,決定搞起。
大致思路如下:
啟動driver》進入貼吧第一頁》通過getPageUrls方法獲取所有帖子的url和標題》找到下一頁按鈕并點擊。如此循環,拿到整個貼吧帖子的url。
理想狀態就是這樣了。
遺憾的是,爬到大概一萬個帖子時,再翻頁時,顯示的是當天的帖子,更古老的帖子都被百度屏蔽了。
public static void searchBa() {String tieba
="%E6%9D%8E%E6%AF%85";String driverPath
="D:/DevSoft/AllUtils/src/com/framework/libInteresting/spider/chromedriver.exe";System
.setProperty("webdriver.chrome.driver",driverPath
);ChromeOptions chromeOptions
= new ChromeOptions();Map
<String, Object> prefs
= new HashMap<String, Object>();prefs
.put("profile.managed_default_content_settings.images", 2);chromeOptions
.setExperimentalOption("prefs", prefs
);WebDriver driver
= new ChromeDriver(chromeOptions
);driver
.get("https://tieba.baidu.com/f?kw="+tieba
+"&ie=utf-8");boolean flag
= true;while (flag
) {try {Map
<String, String> urls
= getPageUrls(driver
, driver
.getPageSource());for (String url
: urls
.keySet()) {String title
=urls
.get(url
);}Thread
.sleep(300);} catch (Exception e
) {e
.printStackTrace();}flag
= goNextPage(driver
);try {Thread
.sleep(500);} catch (InterruptedException e
) {e
.printStackTrace();}}driver
.quit();driver
.close();}public static Map
<String, String> getPageUrls(WebDriver driver
, String content
) {HashMap
<String, String> re
= new HashMap<String, String>();List
<String> htmls
= SearchUtil
.getListWithFeature(content
,StringUtil
.str2List("<a `class=\"j_th_tit`</a>", "`"));for (String html
: htmls
) {String htmlurl
= SearchUtil
.subString(html
, "href=\"", "\"").get(0);html
= StringUtil
.kill(html
, "<", ">");re
.put(htmlurl
.replace("/p/", ""), html
);}return re
;}public static boolean goNextPage(WebDriver driver
) {try {WebDriverWait wait
= new WebDriverWait(driver
, 10);wait
.until(new ExpectedCondition<WebElement>() {@Overridepublic WebElement
apply(WebDriver d
) {return d
.findElement(By
.xpath("//div[@id='frs_list_pager']/span/following-sibling::a[1]"));}}).click();} catch (Exception e
) {e
.printStackTrace();return false;}return true;}
拿到帖子地址后,就可以正式開始爬樓梯了。
有的樓層的回復較多,要找到“點擊查看”,點擊時,需要是登陸狀態。所以在爬帖子前,需要先登陸
public static void login(WebDriver driver
, String username
, String password
) {try {driver
.findElement(By
.linkText("登錄")).click();WebDriverWait wait
= new WebDriverWait(driver
, 5);wait
.until(new ExpectedCondition<WebElement>() {@Overridepublic WebElement
apply(WebDriver d
) {return d
.findElement(By
.cssSelector("p[id=TANGRAM__PSP_12__footerULoginBtn]"));}}).click();driver
.findElement(By
.name("userName")).sendKeys(username
);driver
.findElement(By
.name("password")).sendKeys(password
);driver
.findElement(By
.id("TANGRAM__PSP_12__submit")).click();} catch (Exception e
) {}}
登陸后,也是逐頁獲取信息,大同小異。我的需求比較簡單,就不獻丑了。
總結
以上是生活随笔為你收集整理的Java+Selenium爬贴吧的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。