日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Heritrix 3.1.0 源码解析(十一)

發布時間:2023/12/9 编程问答 39 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Heritrix 3.1.0 源码解析(十一) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

上文分析了Heritrix3.1.0系統是怎么添加CrawlURI curi對象的,那么在系統初始化的時候,是怎么載入CrawlURI curi種子的呢?

我們回顧前面的文章,在我們執行采集任務的launch指令的時候,實際會調用CrawlController對象的void requestCrawlStart()方法

/** * Operator requested crawl begin*/public void requestCrawlStart() {hasStarted = true; sendCrawlStateChangeEvent(State.PREPARING, CrawlStatus.PREPARING);if(recoveryCheckpoint==null) {// only announce (trigger scheduling of) seeds// when doing a cold (non-recovery) start getSeeds().announceSeeds();}setupToePool();// A proper exit will change this value.this.sExit = CrawlStatus.FINISHED_ABNORMAL;if (getPauseAtStart()) {// frontier is already paused unless started, so just // 'complete'/ack pause completePause();} else {getFrontier().run();}}

繼續調用getSeeds().announceSeeds()方法,這里的getSeeds()真實對象是TextSeedModule(spring自動注入的),然后調用它的void announceSeeds()方法

/*** Announce all seeds from configured source to SeedListeners * (including nonseed lines mixed in). * @see org.archive.modules.seeds.SeedModule#announceSeeds()*/public void announceSeeds() {if(getBlockAwaitingSeedLines()>-1) {final CountDownLatch latch = new CountDownLatch(getBlockAwaitingSeedLines());new Thread(){@Overridepublic void run() {announceSeeds(latch); while(latch.getCount()>0) {latch.countDown();}}}.start();try {latch.await();} catch (InterruptedException e) {// do nothing } } else {announceSeeds(null); }}

?上面方法中if后面的CountDownLatch latch是線程計數,else后面是null,繼續調用void announceSeeds(CountDownLatch latchOrNull)方法?

protected void announceSeeds(CountDownLatch latchOrNull) {BufferedReader reader = new BufferedReader(textSource.obtainReader()); try {announceSeedsFromReader(reader,latchOrNull); } finally {IOUtils.closeQuietly(reader);}}

?首先獲取ReadSource?textSource(org.archive.spring.ConfigString)的Reader(StringReader),然后調用void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull)方法?

/*** Announce all seeds (and nonseed possible-directive lines) from* the given Reader* @param reader source of seed/directive lines* @param latchOrNull if non-null, sent countDown after each line, allowing * another thread to proceed after a configurable number of lines processed*/protected void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull) {String s;Iterator<String> iter = new RegexLineIterator(new LineReadingIterator(reader),RegexLineIterator.COMMENT_LINE,RegexLineIterator.NONWHITESPACE_ENTRY_TRAILING_COMMENT,RegexLineIterator.ENTRY);int count = 0; while (iter.hasNext()) {s = (String) iter.next();if(Character.isLetterOrDigit(s.charAt(0))) {// consider a likely URI seedLine(s);count++;if(count%20000==0) {System.runFinalization();}} else {// report just in case it's a useful directive nonseedLine(s);}if(latchOrNull!=null) {latchOrNull.countDown(); }}publishConcludedSeedBatch(); }

?迭代url字符串并調用void seedLine(String uri)方法

/*** Handle a read line that is probably a seed.* * @param uri String seed-containing line*/protected void seedLine(String uri) {if (!uri.matches("[a-zA-Z][\\w+\\-]+:.*")) { // Rfc2396 s3.1 scheme,// minus '.'// Does not begin with scheme, so try http:// uri = "http://" + uri;}try {UURI uuri = UURIFactory.getInstance(uri);CrawlURI curi = new CrawlURI(uuri);curi.setSeed(true);curi.setSchedulingDirective(SchedulingConstants.MEDIUM);if (getSourceTagSeeds()) {curi.setSourceTag(curi.toString());}publishAddedSeed(curi);} catch (URIException e) {// try as nonseed line as fallback nonseedLine(uri);}}

最后調用父類SeedModule的void publishAddedSeed(CrawlURI curi)方法(observer模式)

protected void publishAddedSeed(CrawlURI curi) {for (SeedListener l: seedListeners) {l.addedSeed(curi);}}

BdbFrontier類間接實現了SeedListener接口(AbstractFrontier抽象類void addedSeed(CrawlURI puri)方法)

/*** When notified of a seed via the SeedListener interface, * schedule it.* * @see org.archive.modules.seeds.SeedListener#addedSeed(org.archive.modules.CrawlURI)*/public void addedSeed(CrawlURI puri) {schedule(puri);}

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源碼解析系本人原創

轉載請注明出處 博客園 刺猬的溫馴

本文鏈接?http://www.cnblogs.com/chenying99/archive/2013/04/20/3031924.html

總結

以上是生活随笔為你收集整理的Heritrix 3.1.0 源码解析(十一)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。