日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

开源网络爬虫WebCollector的demo

發布時間:2025/4/16 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 开源网络爬虫WebCollector的demo 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、環境:jdk7+eclipse mars

2、WebCollector開源網址https://github.com/CrawlScript/WebCollector

? ? ? 下載webcollector-2.26-bin.zip,解壓文件夾引入所有jar包到工程。

3、demo源碼:

? ? ??

/*** Demo of crawling web by webcollector * @author fjs*/ package com; import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document;public class demo extends BreadthCrawler {/*** @param crawlPath crawlPath is the path of the directory which maintains* information of this crawler* @param autoParse if autoParse is true,BreadthCrawler will auto extract* links which match regex rules from page*/public demo(String crawlPath, boolean autoParse) {super(crawlPath, autoParse);/*start page*/this.addSeed("http://guangzhou.qfang.com");/*fetch url like the value by setting up RegEx filter rule */this.addRegex(".*");/*do not fetch jpg|png|gif*/this.addRegex("-.*\\.(jpg|png|gif).*");/*do not fetch url contains #*/this.addRegex("-.*#.*");}@Overridepublic void visit(Page page, CrawlDatums next) {String url = page.getUrl();Document doc = page.getDoc();System.out.println(url);System.out.println(doc.title());/*If you want to add urls to crawl,add them to nextLink*//*WebCollector automatically filters links that have been fetched before*//*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*///next.add("http://gz.house.163.com/");}public static void main(String[] args) throws Exception {demo crawler = new demo("path", true);crawler.setThreads(50);crawler.setTopN(100);//crawler.setResumable(true);/*start crawl with depth 3*/crawler.start(3);} }

4、實際應用中,對page進行解析抓取網頁內容。

總結

以上是生活随笔為你收集整理的开源网络爬虫WebCollector的demo的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。