當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

开源网络爬虫WebCollector的demo

發布時間：2025/4/16 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了开源网络爬虫WebCollector的demo 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、環境：jdk7+eclipse mars

2、WebCollector開源網址https://github.com/CrawlScript/WebCollector

? ? ? 下載webcollector-2.26-bin.zip，解壓文件夾引入所有jar包到工程。

3、demo源碼：

? ? ??

/*** Demo of crawling web by webcollector * @author fjs*/ package com; import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document;public class demo extends BreadthCrawler {/*** @param crawlPath crawlPath is the path of the directory which maintains* information of this crawler* @param autoParse if autoParse is true,BreadthCrawler will auto extract* links which match regex rules from page*/public demo(String crawlPath, boolean autoParse) {super(crawlPath, autoParse);/*start page*/this.addSeed("http://guangzhou.qfang.com");/*fetch url like the value by setting up RegEx filter rule */this.addRegex(".*");/*do not fetch jpg|png|gif*/this.addRegex("-.*\\.(jpg|png|gif).*");/*do not fetch url contains #*/this.addRegex("-.*#.*");}@Overridepublic void visit(Page page, CrawlDatums next) {String url = page.getUrl();Document doc = page.getDoc();System.out.println(url);System.out.println(doc.title());/*If you want to add urls to crawl,add them to nextLink*//*WebCollector automatically filters links that have been fetched before*//*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*///next.add("http://gz.house.163.com/");}public static void main(String[] args) throws Exception {demo crawler = new demo("path", true);crawler.setThreads(50);crawler.setTopN(100);//crawler.setResumable(true);/*start crawl with depth 3*/crawler.start(3);} }

4、實際應用中，對page進行解析抓取網頁內容。

總結

以上是生活随笔為你收集整理的开源网络爬虫WebCollector的demo的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ubuntu下部署eclipse集成ha
下一篇： (转载)jsp与servlet之间页面跳