【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
nutch-site.xml
在nutch2.2.1中,有兩份配置文件:nutch-default.xml與nutch-site.xml。
其中前者是nutch自帶的默認(rèn)屬性,一般情況下不要修改。
如果需要修改默認(rèn)屬性,可以在nutch-site.xml中增加一個同名的屬性,并修改其值。nutch-site.xml中的屬性值會覆蓋nutch-default.xml中的值。
1、db.ignore.external.links
若為true,則只抓取本域名內(nèi)的網(wǎng)頁,忽略外部鏈接。
可以在?regex-urlfilter.txt中增加過濾器達(dá)到同樣效果,但如果過濾器過多,如幾千個,則會大大影響nutch的性能。
<property><name>db.ignore.external.links</name><value>true</value><description>If true, outlinks leading from a page to external hostswill be ignored. This is an effective way to limit the crawl to includeonly initially injected hosts, without creating complex URLFilters.</description> </property>2、fetcher.parse
能否在抓取的同時進(jìn)行解釋:可以,但不 建議這樣做。
<property><name>fetcher.parse</name><value>false</value><description>If true, fetcher will parse content. NOTE: previous releases woulddefault to true. Since 2.0 this is set to false as a safer default.</description> </property>官方解釋
N.B.?In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to?this?is usually observed in this situation.
In summary, if it is possible, users are advised?not?to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.
3、db.max.outlinks.per.page
默認(rèn)情況下,Nutch只抓取某個網(wǎng)頁的100個外部鏈接,導(dǎo)致部分鏈接無法抓取。若要改變此情況,可以修改此配置項。
<property><name>db.max.outlinks.per.page</name><value>100</value><description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed.</description> </property>官方說明如下:http://wiki.apache.org/nutch/FAQ/Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?
The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page?property to a higher value or simply -1 (unlimited).
file: conf/nutch-default.xml
<property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property>
see also:?http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html
4、file.content.limit ??http.content.limit ?ftp.content.limit
默認(rèn)情況下,nutch只抓取網(wǎng)頁的前65536個字節(jié),之后的內(nèi)容將被丟棄。
但對于某些大型網(wǎng)站,首頁的內(nèi)容遠(yuǎn)遠(yuǎn)不止65536個字節(jié),甚至前面65536個字節(jié)里面均是一些布局信息,并沒有任何的超鏈接。
因此修改默認(rèn)值如下:
?
總結(jié)
以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 8大排序算法图文讲解
- 下一篇: Hadoop入门经典:WordCount