當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程

發布時間：2024/1/23 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、抓取流程概述 1、nutch抓取流程當使用crawl命令進行抓取任務時，其基本流程步驟如下：（1）InjectorJob 開始第一個迭代（2）GeneratorJob （3）FetcherJob （4）ParserJob （5）DbUpdaterJob （6）SolrIndexerJob 開始第二個迭代（2）GeneratorJob（3）FetcherJob（4）ParserJob（5）DbUpdaterJob（6）SolrIndexerJob 開始第三個迭代 ……
2、抓取日志使用crawl命令進行抓取時，console輸出日志如下： InjectorJob: starting at 2014-07-08 10:41:27 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 2 Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05 Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5 Generating batchId Generating a new fetchlist GeneratorJob: starting at 2014-07-08 10:41:34 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05 GeneratorJob: generated batch id: 1404787293-26339 Fetching : FetcherJob: starting FetcherJob: batchId: 1404787293-26339 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1404798101129 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 2 records. Hit by time limit :0 fetching http://www.csdn.net/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching http://www.itpub.net/ (queue crawl delay=5000ms) -finishing thread FetcherThread47, activeThreads=48 -finishing thread FetcherThread46, activeThreads=47 -finishing thread FetcherThread45, activeThreads=46 -finishing thread FetcherThread44, activeThreads=45 -finishing thread FetcherThread43, activeThreads=44 -finishing thread FetcherThread42, activeThreads=43 -finishing thread FetcherThread41, activeThreads=42 -finishing thread FetcherThread40, activeThreads=41 -finishing thread FetcherThread39, activeThreads=40 -finishing thread FetcherThread38, activeThreads=39 -finishing thread FetcherThread37, activeThreads=38 -finishing thread FetcherThread36, activeThreads=37 -finishing thread FetcherThread35, activeThreads=36 -finishing thread FetcherThread34, activeThreads=35 -finishing thread FetcherThread33, activeThreads=34 -finishing thread FetcherThread32, activeThreads=33 -finishing thread FetcherThread31, activeThreads=32 -finishing thread FetcherThread30, activeThreads=31 -finishing thread FetcherThread29, activeThreads=30 -finishing thread FetcherThread48, activeThreads=29 -finishing thread FetcherThread27, activeThreads=29 -finishing thread FetcherThread26, activeThreads=28 -finishing thread FetcherThread25, activeThreads=27 -finishing thread FetcherThread24, activeThreads=26 -finishing thread FetcherThread23, activeThreads=25 -finishing thread FetcherThread22, activeThreads=24 -finishing thread FetcherThread21, activeThreads=23 -finishing thread FetcherThread20, activeThreads=22 -finishing thread FetcherThread19, activeThreads=21 -finishing thread FetcherThread18, activeThreads=20 -finishing thread FetcherThread17, activeThreads=19 -finishing thread FetcherThread16, activeThreads=18 -finishing thread FetcherThread15, activeThreads=17 -finishing thread FetcherThread14, activeThreads=16 -finishing thread FetcherThread13, activeThreads=15 -finishing thread FetcherThread12, activeThreads=14 -finishing thread FetcherThread11, activeThreads=13 -finishing thread FetcherThread10, activeThreads=12 -finishing thread FetcherThread9, activeThreads=11 -finishing thread FetcherThread8, activeThreads=10 -finishing thread FetcherThread7, activeThreads=9 -finishing thread FetcherThread5, activeThreads=8 -finishing thread FetcherThread4, activeThreads=7 -finishing thread FetcherThread3, activeThreads=6 -finishing thread FetcherThread2, activeThreads=5 -finishing thread FetcherThread49, activeThreads=4 -finishing thread FetcherThread6, activeThreads=3 -finishing thread FetcherThread28, activeThreads=2 -finishing thread FetcherThread0, activeThreads=1 fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null -finishing thread FetcherThread1, activeThreads=0 0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: done Parsing : ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1404787293-26339 Parsing http://www.csdn.net/ http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561 Parsing http://www.itpub.net/ ParserJob: success CrawlDB update for csdnitpub DbUpdaterJob: starting DbUpdaterJob: done Indexing csdnitpub on SOLR index -> http://ip:8983/solr/ SolrIndexerJob: starting SolrIndexerJob: done. SOLR dedup -> http://ip:8983/solr/ Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5 Generating batchId Generating a new fetchlist GeneratorJob: starting at 2014-07-08 10:42:19 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05 GeneratorJob: generated batch id: 1404787338-30453 Fetching : FetcherJob: starting FetcherJob: batchId: 1404787338-30453 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1404798146676 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0
二、使用命令進行逐步抓取 1、InjectorJob 此步驟將seed.txt中的url注入抓取隊列中進行初始化。（1）基本命令

$ bin/nutch inject?

Usage: InjectorJob <url_dir> [-crawlId <id>]

$ bin/nutch inject urls

InjectorJob: starting at 2014-12-20 22:32:01

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14

其中urls/seed.txt的內容如下： http://stackoverflow.com/ （2）查看注入的url 上述步驟會在hbase中新建一個表，表名為test_1_webpage，url的相應內容會寫入這張表 hbase(main):002:0> scan '334_webpage' ROW ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?COLUMN+CELL ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/ ? ? ? ? column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=mk:_injmrk_, timestamp=1408953100271, value=y ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=mk:dist, timestamp=1408953100271, value=0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/ ? ? ? ? column=s:s, timestamp=1408953100271, value=?\x80\x00\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? 1 row(s) in 0.3020 seconds (3)關于**_webpage表對于每一個任務，均會生成一個crawlId_webpage的表，所有已抓取及未抓取的url相關信息均會存入此表。若url未抓取，則該url相應的行信息較少。若url已經抓取，則抓取到的內容也會放入該行，如網頁內容等。2、GeneratorJob （1）基本命令 [jediael@jediael local]$ ?bin/nutch generate -crawlId 334 GeneratorJob: starting at 2014-08-25 15:57:12 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: normalizing: true GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06 GeneratorJob: generated batch id: 1408953432-1171377744 （2）命令選項 [root@jediael local]# bin/nutch generate Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]-topN <N> ? ? ?- number of top URLs to be selected, default is Long.MAX_VALUE?-crawlId <id> ?- the id to prefix the schemas to operate on, default: storage.crawl.id)");?-noFilter ? ? ?- do not activate the filter plugin to filter the url, default is true?-noNorm ? ? ? ?- do not activate the normalizer plugin to normalize the url, default is true?-adddays ? ? ? - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.-batchId ? ? ? - the batch id? ---------------------- Please set the params. （3）查看數據庫 hbase(main):003:0> scan '334_webpage'? ROW ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?COLUMN+CELL ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/ ? ? ? ? column=f:bid, timestamp=1408953437910, value=1408953432-1171377744 ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/ ? ? ? ? column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=mk:_gnmrk_, timestamp=1408953437910, value=1408953432-1171377744 ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/ ? ? ? ? column=mk:_injmrk_, timestamp=1408953100271, value=y ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=mk:dist, timestamp=1408953100271, value=0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/ ? ? ? ? column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/ ? ? ? ? column=s:s, timestamp=1408953100271, value=?\x80\x00\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? 1 row(s) in 0.0490 seconds 此步驟新增了f:bid，mk:_gnmrk_ ?兩列。 3、FetcherJob （1）基本命令 [jediael@jediael local]$ ?bin/nutch generate -crawlId 334 GeneratorJob: starting at 2014-08-25 15:57:12 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: normalizing: true GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06 GeneratorJob: generated batch id: 1408953432-1171377744 [jediael@jediael local]$ ?bin/nutch fetch -all -crawlId 334 FetcherJob: starting FetcherJob: fetching all Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 1 records. Hit by time limit :0 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching http://stackoverflow.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread1, activeThreads=8 -finishing thread FetcherThread7, activeThreads=7 -finishing thread FetcherThread6, activeThreads=6 -finishing thread FetcherThread5, activeThreads=5 -finishing thread FetcherThread4, activeThreads=4 -finishing thread FetcherThread3, activeThreads=3 -finishing thread FetcherThread2, activeThreads=2 -finishing thread FetcherThread8, activeThreads=1 -finishing thread FetcherThread9, activeThreads=1 -finishing thread FetcherThread0, activeThreads=0 0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 102 102 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: done（2）查看數據庫見db1.txt 新增f:bas，column=f:cnt，column=f:prot，f:pts，f:st，f:ts，f:typ，h:Cache-Control，h:Connection，h:Content-Encoding，h:Content-Length, h:Content-Type,h:Date,h:Expires, h:Last-Modified,h:Set-Cookie,h:Vary,h:X-Frame-Options, mk:_ftcmrk_等字段 4、ParserJob （1）基本命令 [jediael@jediael local]$ bin/nutch parse ?-all -crawlId 334 ParserJob: starting ParserJob: resuming: ? ?false ParserJob: forced reparse: ? ? ?false ParserJob: parsing all Parsing http://stackoverflow.com/ ParserJob: success （2）命令參數 [root@jediael local]# bin/nutch parse? Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]<batchId> ? ? - symbolic batch ID created by Generator-crawlId <id> - the id to prefix the schemas to operate on,? ? ? ? ? ? ? ? ? ? ? (default: storage.crawl.id)-all ? ? ? ? ?- consider pages from all crawl jobs-resume ? ? ? - resume a previous incomplete job-force ? ? ? ?- force re-parsing even if a page is already parsed （3）查看數據庫見db_parse.txt 新增了很多類似column=ol:http://stackoverflow.com/help的列，在此例中共有115個。5、DbUpdaterJob （1）基本命令 [jediael@jediael local]$ bin/nutch updatedb -crawlId 334 DbUpdaterJob: starting DbUpdaterJob: done （2）查看數據庫見db_updatedb.txt 解釋了上述的115個column=ol:http，并生成了115行新數據，舉其中一個例子如下： com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?44974/silviu-oncioiu ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??44974/silviu-oncioiu ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09 ? ? ? ? ? ? ? ? ? ??44974/silviu-oncioiu ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??44974/silviu-oncioiu ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?44974/silviu-oncioiu ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??44974/silviu-oncioiu ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?74525/laosi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??74525/laosi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09 ? ? ? ? ? ? ? ? ? ??74525/laosi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??74525/laosi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?74525/laosi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??74525/laosi? 此時數據已準備好，等待下一輪的抓取。 6、SolrIndexerJob （1）基本命令 [jediael@jediael local]$ ?bin/nutch solrindex http://****/solr/ ?-all -crawlId 334 SolrIndexerJob: starting Adding 1 documents SolrIndexerJob: done. （2）命令參數 [root@jediael local]# bin/nutch solrindex? Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>] （3）查看數據庫無變化

總結

以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【Nutch2.2.1基础教程之1】nu
下一篇： Hadoop基本原理之一：MapRedu