开源网络爬虫WebCollector的demo
生活随笔
收集整理的這篇文章主要介紹了
开源网络爬虫WebCollector的demo
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1、環境:jdk7+eclipse mars
2、WebCollector開源網址https://github.com/CrawlScript/WebCollector
? ? ? 下載webcollector-2.26-bin.zip,解壓文件夾引入所有jar包到工程。
3、demo源碼:
? ? ??
/*** Demo of crawling web by webcollector * @author fjs*/ package com; import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document;public class demo extends BreadthCrawler {/*** @param crawlPath crawlPath is the path of the directory which maintains* information of this crawler* @param autoParse if autoParse is true,BreadthCrawler will auto extract* links which match regex rules from page*/public demo(String crawlPath, boolean autoParse) {super(crawlPath, autoParse);/*start page*/this.addSeed("http://guangzhou.qfang.com");/*fetch url like the value by setting up RegEx filter rule */this.addRegex(".*");/*do not fetch jpg|png|gif*/this.addRegex("-.*\\.(jpg|png|gif).*");/*do not fetch url contains #*/this.addRegex("-.*#.*");}@Overridepublic void visit(Page page, CrawlDatums next) {String url = page.getUrl();Document doc = page.getDoc();System.out.println(url);System.out.println(doc.title());/*If you want to add urls to crawl,add them to nextLink*//*WebCollector automatically filters links that have been fetched before*//*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*///next.add("http://gz.house.163.com/");}public static void main(String[] args) throws Exception {demo crawler = new demo("path", true);crawler.setThreads(50);crawler.setTopN(100);//crawler.setResumable(true);/*start crawl with depth 3*/crawler.start(3);} }4、實際應用中,對page進行解析抓取網頁內容。
總結
以上是生活随笔為你收集整理的开源网络爬虫WebCollector的demo的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ubuntu下部署eclipse集成ha
- 下一篇: (转载)jsp与servlet之间页面跳