日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析

發(fā)布時間:2024/1/23 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

請先參見“集成Nutch/Hbase/Solr構建搜索引擎之一:安裝及運行”,搭建測試環(huán)境

http://blog.csdn.net/jediael_lu/article/details/37329731


一、被索引的域 Schema.xml

1、文檔基本內容?

在使用solr對Nutch抓取到的網頁進行索引時,schema.xml被替換成以下內容。

文件中指定了哪些域被索引、存儲等內容。

<?xml version="1.0" encoding="UTF-8" ?><schema name="nutch" version="1.5"><types><fieldType name="string" class="solr.StrField" sortMissingLast="true"omitNorms="true"/>?<fieldType name="long" class="solr.TrieLongField" precisionStep="0"omitNorms="true" positionIncrementGap="0"/><fieldType name="float" class="solr.TrieFloatField" precisionStep="0"omitNorms="true" positionIncrementGap="0"/><fieldType name="date" class="solr.TrieDateField" precisionStep="0"omitNorms="true" positionIncrementGap="0"/><fieldType name="text" class="solr.TextField"positionIncrementGap="100"><analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory"ignoreCase="true" words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1"catenateWords="1" catenateNumbers="1" catenateAll="0"splitOnCaseChange="1"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.RemoveDuplicatesTokenFilterFactory"/></analyzer></fieldType><fieldType name="url" class="solr.TextField"positionIncrementGap="100"><analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1"/></analyzer></fieldType></types><fields><field name="id" type="string" stored="true" indexed="true"/> <field name="_version_" type="long" indexed="true" stored="true"/><!-- core fields --><field name="batchId" type="string" stored="true" indexed="false"/><field name="digest" type="string" stored="true" indexed="false"/><field name="boost" type="float" stored="true" indexed="false"/><!-- fields for index-basic plugin --><field name="host" type="url" stored="false" indexed="true"/><field name="url" type="url" stored="true" indexed="true"required="true"/><field name="content" type="text" stored="false" indexed="true"/><field name="title" type="text" stored="true" indexed="true"/><field name="cache" type="string" stored="true" indexed="false"/><field name="tstamp" type="date" stored="true" indexed="false"/><!-- fields for index-anchor plugin --><field name="anchor" type="string" stored="true" indexed="true"multiValued="true"/><!-- fields for index-more plugin --><field name="type" type="string" stored="true" indexed="true"multiValued="true"/><field name="contentLength" type="long" stored="true"indexed="false"/><field name="lastModified" type="date" stored="true"indexed="false"/><field name="date" type="date" stored="true" indexed="true"/><!-- fields for languageidentifier plugin --><field name="lang" type="string" stored="true" indexed="true"/><!-- fields for subcollection plugin --><field name="subcollection" type="string" stored="true"indexed="true" multiValued="true"/><!-- fields for feed plugin (tag is also used by microformats-reltag)--><field name="author" type="string" stored="true" indexed="true"/><field name="tag" type="string" stored="true" indexed="true" multiValued="true"/><field name="feed" type="string" stored="true" indexed="true"/><field name="publishedDate" type="date" stored="true"indexed="true"/><field name="updatedDate" type="date" stored="true"indexed="true"/><!-- fields for creativecommons plugin --><field name="cc" type="string" stored="true" indexed="true"multiValued="true"/><!-- fields for tld plugin --> ? ?<field name="tld" type="string" stored="false" indexed="false"/></fields><uniqueKey>id</uniqueKey><defaultSearchField>content</defaultSearchField><solrQueryParser defaultOperator="OR"/> </schema> 分析上述文件,主要指定了以下內容:

(1)FiledType:域的類型

(2)Field:哪些域被索引、存儲等,以及這個域是什么類型。

(3)uniqueKey:哪個域作為id,即文章的唯一標識。

(4)defaultSearchField:默認的搜索域

(5)solrQueryParser:OR,即使用OR來構建Query。


2、Fileds屬性分析

(1)Fileds中有2個Field是必填值

  • uniqueKey中指定的:id
  • 使用了required指定的:url

(2)Fields中有8個基本Field,經過索引后,它們的狀態(tài)如下:


由上可見,boost與tstamp是不被索引的,但被存儲了,因此不可以被搜索,但可以呈現(xiàn)。

而content則被索引,但不被存儲,因此可以搜索,但不可以在結果中呈現(xiàn)。


二、Nutch抓取的基本流程

關于更詳細的抓取流程說明,請見?

Nutch2.2.1抓取流程

./bin/crawl urls csdnitpub http://ip:8983/solr/ 5 crawl <seedDir> <crawlID> <solrURL> <numberOfRounds> <seedDir>:放置種子文件的目錄 <crawlID>?:這個抓取任務的ID <solrURL>:用于索引及搜索的solr地址 <numberOfRounds>:迭代次數,即抓取深度 一個nutch抓取流程如下: (1)InjectorJob 開始第一個迭代 (2)GeneratorJob (3)FetcherJob (4)ParserJob (5)DbUpdaterJob (6)SolrIndexerJob 開始第二個迭代 (2)GeneratorJob(3)FetcherJob(4)ParserJob(5)DbUpdaterJob(6)SolrIndexerJob 開始第三個迭代 ……
一個典型任務的日志如下: InjectorJob: starting at 2014-07-08 10:41:27 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 2 Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05 Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5 Generating batchId Generating a new fetchlist GeneratorJob: starting at 2014-07-08 10:41:34 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05 GeneratorJob: generated batch id: 1404787293-26339 Fetching : FetcherJob: starting FetcherJob: batchId: 1404787293-26339 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1404798101129 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 2 records. Hit by time limit :0 fetching http://www.csdn.net/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching http://www.itpub.net/ (queue crawl delay=5000ms) -finishing thread FetcherThread47, activeThreads=48 -finishing thread FetcherThread46, activeThreads=47 -finishing thread FetcherThread45, activeThreads=46 -finishing thread FetcherThread44, activeThreads=45 -finishing thread FetcherThread43, activeThreads=44 -finishing thread FetcherThread42, activeThreads=43 -finishing thread FetcherThread41, activeThreads=42 -finishing thread FetcherThread40, activeThreads=41 -finishing thread FetcherThread39, activeThreads=40 -finishing thread FetcherThread38, activeThreads=39 -finishing thread FetcherThread37, activeThreads=38 -finishing thread FetcherThread36, activeThreads=37 -finishing thread FetcherThread35, activeThreads=36 -finishing thread FetcherThread34, activeThreads=35 -finishing thread FetcherThread33, activeThreads=34 -finishing thread FetcherThread32, activeThreads=33 -finishing thread FetcherThread31, activeThreads=32 -finishing thread FetcherThread30, activeThreads=31 -finishing thread FetcherThread29, activeThreads=30 -finishing thread FetcherThread48, activeThreads=29 -finishing thread FetcherThread27, activeThreads=29 -finishing thread FetcherThread26, activeThreads=28 -finishing thread FetcherThread25, activeThreads=27 -finishing thread FetcherThread24, activeThreads=26 -finishing thread FetcherThread23, activeThreads=25 -finishing thread FetcherThread22, activeThreads=24 -finishing thread FetcherThread21, activeThreads=23 -finishing thread FetcherThread20, activeThreads=22 -finishing thread FetcherThread19, activeThreads=21 -finishing thread FetcherThread18, activeThreads=20 -finishing thread FetcherThread17, activeThreads=19 -finishing thread FetcherThread16, activeThreads=18 -finishing thread FetcherThread15, activeThreads=17 -finishing thread FetcherThread14, activeThreads=16 -finishing thread FetcherThread13, activeThreads=15 -finishing thread FetcherThread12, activeThreads=14 -finishing thread FetcherThread11, activeThreads=13 -finishing thread FetcherThread10, activeThreads=12 -finishing thread FetcherThread9, activeThreads=11 -finishing thread FetcherThread8, activeThreads=10 -finishing thread FetcherThread7, activeThreads=9 -finishing thread FetcherThread5, activeThreads=8 -finishing thread FetcherThread4, activeThreads=7 -finishing thread FetcherThread3, activeThreads=6 -finishing thread FetcherThread2, activeThreads=5 -finishing thread FetcherThread49, activeThreads=4 -finishing thread FetcherThread6, activeThreads=3 -finishing thread FetcherThread28, activeThreads=2 -finishing thread FetcherThread0, activeThreads=1 fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null -finishing thread FetcherThread1, activeThreads=0 0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: done Parsing : ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1404787293-26339 Parsing http://www.csdn.net/ http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561 Parsing http://www.itpub.net/ ParserJob: success CrawlDB update for csdnitpub DbUpdaterJob: starting DbUpdaterJob: done Indexing csdnitpub on SOLR index -> http://ip:8983/solr/ SolrIndexerJob: starting SolrIndexerJob: done. SOLR dedup -> http://ip:8983/solr/ Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5 Generating batchId Generating a new fetchlist GeneratorJob: starting at 2014-07-08 10:42:19 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05 GeneratorJob: generated batch id: 1404787338-30453 Fetching : FetcherJob: starting FetcherJob: batchId: 1404787338-30453 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1404798146676 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0


附一些說明:

Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.
When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is?generated by Apache Nutch which is nothing but a directory and which contains?details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache?Nutch keeps all the crawling data directly in the database. In our case, we have used?Apache HBase, so all crawling data would go inside Apache HBase. The following?are details of how each function of crawling works.


A crawling cycle has four steps, in which each is implemented as a Hadoop?MapReduce job:
? GeneratorJob
? FetcherJob
? ParserJob (optionally done while fetching using 'fetch.parse')
? DbUpdaterJob

Additionally, the following processes need to be understood:
? InjectorJob
? Invertlinks
? Indexing with Apache Solr

First of all, the job of an Injector is to populate initial rows for the web table. The?InjectorJob will initialize crawldb with the URLs that we have provided. We need?to run the InjectorJob by providing certain URLs, which will then be inserted into?crawlDB.


Then the GeneratorJob will use these injected URLs and perform the operation. The?table which is used for input and output for these jobs is called webpage, in which
every row is a URL (web page). The row key is stored as a URL with reversed host?components so that URLs from the same TLD and domain can be kept together and
form a group. In most NoSQL stores, row keys are sorted and give an advantage.
Using specific rowkey filtering, scanning will be faster over a subset, rather than?scanning over the entire table.?Following are the examples of rowkey listing:
? org.apache..www:http/
? org.apache.gora:http/
Let's define each step in depth so that we can understand crawling step-by-step.
Apache Nutch contains three main directories, crawlDB, linkdb, and a set of?segments. crawlDB is the directory which contains information about every URL?that is known to Apache Nutch. If it is fetched, crawlDB contains the details when?it was fetched. The linkdatabase or linkdb contains all the links to each URL?which will include source URL and also the anchor text of the link. A set of segments?is a URL set, which is fetched as a unit. This directory will contain the following?subdirectories:
? A crawl_generate job will be used for a set of URLs to be fetched
? A crawl_fetch job will contain the status of fetching each URL
? A content will contain the content of rows retrieved from every URL
Now let's understand each job of crawling in detail.


三、與抓取相關的一些hbase操作

(1)進入bhase shell

[root@jediael44 hbase-0.90.4]# ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

(2)查看所有表
hbase(main):001:0> list
TABLE ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
20140710_webpage ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
TestCrawl_webpage ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
csdnitpub_webpage ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
stackoverflow_webpage ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
4 row(s) in 0.7620 seconds

(3)查看某個表的結構
hbase(main):002:0> describe '20140710_webpage'
DESCRIPTION ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ENABLED ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?{NAME => '20140710_webpage', FAMILIES => [{NAME => 'f', BLOOMFILTER => 'NONE', ?true ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '21474 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?83647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAM ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?E => 'h', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COM ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?PRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fa ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?lse', BLOCKCACHE => 'true'}, {NAME => 'il', BLOOMFILTER => 'NONE', REPLICATION_ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?KSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'mk', B ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?LOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCK ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?CACHE => 'true'}, {NAME => 'mtdt', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?'0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?'65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'ol', BLOOMFILTE ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?R => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?'true'}, {NAME => 'p', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSION ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?S => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 's', BLOOMFILTER => 'NONE', ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '21474 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?83647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
1 row(s) in 0.0880 seconds

(4)查看某列的HTML文件

hbase(main):010:0* ?get '20140710_webpage', 'ar.com.admival.www:http/', 'f:cnt'
COLUMN ? ? ? ? ? ? ? ? ? ? ? ? ? CELL ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
?f:cnt ? ? ? ? ? ? ? ? ? ? ? ? ? timestamp=1404983104070, value=\x0D\x0A<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?o="urn:schemas-microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40">\x0D\x0
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?A\x0D\x0A<head>\x0D\x0A<meta http-equiv="Content-Type" content="text/html; charset=windows-
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1252">\x0D\x0A<meta http-equiv="Content-Language" content="es">\x0D\x0A\x0D\x0A<title>CAMBI
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?O DE CHEQUES | cambio de cheque | cheque | cheques | compra de cheques de terceros | cambio
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cheques | cheques no a la orden </title>\x0D\x0A<link rel="shortcut icon" ref="favivon.ico
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?" />\x0D\x0A<meta name="GENERATOR" content="Microsoft FrontPage 5.0">\x0D\x0A<meta name="Pr
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ogId" content="FrontPage.Editor.Document">\x0D\x0A<meta name=keywords content="cambio de ch
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?eques, compro cheques, canje de cheques, Compro cheques de terceros, cambio cheques">\x0D\x
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?0A<meta name="description" http-equiv="description" content="Admival.com.ar cambia por efec
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?tivo sus cheques de terceros, posdatados, al d\xECa, de Sociedades, personales, cheques no?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?a la orden, cheques de indemnizaci\xF3n por despido, etc. " /> \x0D\x0A<meta name="robots"?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?content="index,follow">\

未完……

其中行的名稱可以通過scan?'20140710_webpage'得到。


四、ant runtime所構建的內容

1、構建Nutch

tar -zxvf apache-nutch-2.2.1-src.tar.gz?

cd?apache-nutch-2.2.1

ant runtime


2、????ant構建之后,生成runtime文件夾,該文件夾下面有deploy和local文件夾,分別代表了nutch的兩種運行方式:

Deploy:的數據必須運行在Hadoop的HDFS中

local:是運行在本地目錄中。

(1)二者的目錄結構如下:

[jediael@jediael44 runtime]$ ls deploy/ local/

deploy/:

apache-nutch-2.2.1.job? bin?

local/:

bin? conf? lib?logs? plugins? test

在deploy中,文件被打包成一個Job,作為Hadoop的一個Job來運行。

?(2)二者目錄下均有一個bin的目錄,其內包含相同的crawl與nutch兩個執(zhí)行文件。

我們查看nutch文件的最后幾行

if $local; then

?# fix for the external Xerceslib issue with SAXParserFactory

?NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl$NUTCH_OPTS"

?EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"

else

?# check that hadoop can befound on the path

?if [?$(which hadoop | wc -l )-eq 0?]; then

??? echo "Can't findHadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."

????exit -1;

?fi

?# distributed mode

?EXEC_CALL="hadoop jar$NUTCH_JOB"

fi

?# run it

exec $EXEC_CALL $CLASS "$@"

即默認情況下為?EXEC_CALL="hadoop jar$NUTCH_JOB",若為Local,則?EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH",若未local,且hadoop不存在,則報錯。

(3)根據參數確定類文件

if [ "$COMMAND" = "crawl" ] ; then
CLASS=org.apache.nutch.crawl.Crawler
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.InjectorJob
elif [ "$COMMAND" = "hostinject" ] ; then
CLASS=org.apache.nutch.host.HostInjectorJob
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.GeneratorJob
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.FetcherJob

elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParserJob
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.DbUpdaterJob
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbUpdateJob
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.WebTableReader
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbReader
elif [ "$COMMAND" = "elasticindex" ] ; then
CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "parsechecker" ] ; then
? CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
? CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository

如,對于nutch fetch命令,對應的類文件應該是:org.apache.nutch.fetcher.FetcherJob

[jediael@jediael44 java]$ cat org/apache/nutch/fetcher/FetcherJob.java?

可以查看類文件。此方法可以查看一切的shell對應的源文件。


五、修改Solr中browse返回內容


1、從content域中搜索
從solr的example中得到的solrConfig.xml中,qf的定義如下: [html]?view plaincopy
  • <str?name="qf">??
  • ???text^0.5?features^1.0?name^1.2?sku^1.5?id^10.0?manu^1.1?cat^1.4??
  • ???title^10.0?description^5.0?keywords^5.0?author^2.0?resourcename^1.0??
  • </str>??
  • 由于content不占任何的權重,因此如果某個文檔只在content中包含關鍵字的話,搜索結果并不會返回這個文檔。因此,對于nutch提取的索引來說,要增加content的權重,以及url的權重(如果需要的話): [html]?view plaincopy
  • <str?name="qf">??
  • ???content^1.0?text^0.5?features^1.0?name^1.2?sku^1.5?id^10.0?manu^1.1?cat^1.4??
  • ???title^10.0?description^5.0?keywords^5.0?author^2.0?resourcename^1.0??
  • </str>??
  • 2、保存網頁的content內容

    將schema.xml中的

    <field name="content" type="text" stored="false" indexed="true"/>改為

    <field name="content" type="text" stored="true" indexed="true"/>
    3、同時顯示網頁文件與一般文本

    ?velocity/results_list.vm

    ##parse("hit_plain.vm")將注釋去掉。

    4、調整每個搜索返回項的顯示內容

    vi richtest_doc.vm

    <div>Id: #field('id') </div>改成:

    <div>time: #field('tstamp') </div> <div>score: #field('score') </div> 這個方法可以修改其它字段,詳見http://blog.csdn.net/jediael_lu/article/details/38039267







    1、從content域中搜索
    從solr的example中得到的solrConfig.xml中,qf的定義如下: [html]?view plaincopy
  • <str?name="qf">??
  • ???text^0.5?features^1.0?name^1.2?sku^1.5?id^10.0?manu^1.1?cat^1.4??
  • ???title^10.0?description^5.0?keywords^5.0?author^2.0?resourcename^1.0??
  • </str>??
  • 由于content不占任何的權重,因此如果某個文檔只在content中包含關鍵字的話,搜索結果并不會返回這個文檔。因此,對于nutch提取的索引來說,要增加content的權重,以及url的權重(如果需要的話): [html]?view plaincopy
  • <str?name="qf">??
  • ???content^1.0?text^0.5?features^1.0?name^1.2?sku^1.5?id^10.0?manu^1.1?cat^1.4??
  • ???title^10.0?description^5.0?keywords^5.0?author^2.0?resourcename^1.0??
  • </str>??
  • 2、保存網頁的content內容

    將schema.xml中的

    <field name="content" type="text" stored="false" indexed="true"/>改為

    <field name="content" type="text" stored="true" indexed="true"/>
    3、同時顯示網頁文件與一般文本

    ?velocity/results_list.vm

    ##parse("hit_plain.vm")將注釋去掉。

    4、調整每個搜索返回項的顯示內容

    vi richtest_doc.vm

    <div>Id: #field('id') </div>改成:

    <div>time: #field('tstamp') </div> <div>score: #field('score') </div> 這個方法可以修改其它字段,詳見http://blog.csdn.net/jediael_lu/article/details/38039267
    創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯,堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

    總結

    以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。