當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件

發(fā)布時間：2024/1/23 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

nutch-site.xml

在nutch2.2.1中，有兩份配置文件：nutch-default.xml與nutch-site.xml。

其中前者是nutch自帶的默認(rèn)屬性，一般情況下不要修改。

如果需要修改默認(rèn)屬性，可以在nutch-site.xml中增加一個同名的屬性，并修改其值。nutch-site.xml中的屬性值會覆蓋nutch-default.xml中的值。

1、db.ignore.external.links

若為true，則只抓取本域名內(nèi)的網(wǎng)頁，忽略外部鏈接。

可以在?regex-urlfilter.txt中增加過濾器達(dá)到同樣效果，但如果過濾器過多，如幾千個，則會大大影響nutch的性能。

<property><name>db.ignore.external.links</name><value>true</value><description>If true, outlinks leading from a page to external hostswill be ignored. This is an effective way to limit the crawl to includeonly initially injected hosts, without creating complex URLFilters.</description> </property>

2、fetcher.parse

能否在抓取的同時進(jìn)行解釋：可以，但不建議這樣做。

<property><name>fetcher.parse</name><value>false</value><description>If true, fetcher will parse content. NOTE: previous releases woulddefault to true. Since 2.0 this is set to false as a safer default.</description> </property>

官方解釋

N.B.?In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to?this?is usually observed in this situation.

In summary, if it is possible, users are advised?not?to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.

3、db.max.outlinks.per.page

默認(rèn)情況下，Nutch只抓取某個網(wǎng)頁的100個外部鏈接，導(dǎo)致部分鏈接無法抓取。若要改變此情況，可以修改此配置項。

<property><name>db.max.outlinks.per.page</name><value>100</value><description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed.</description> </property>官方說明如下：http://wiki.apache.org/nutch/FAQ/

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page?property to a higher value or simply -1 (unlimited).

file: conf/nutch-default.xml

<property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property>

see also:?http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html

4、file.content.limit ??http.content.limit ?ftp.content.limit

默認(rèn)情況下，nutch只抓取網(wǎng)頁的前65536個字節(jié)，之后的內(nèi)容將被丟棄。
但對于某些大型網(wǎng)站，首頁的內(nèi)容遠(yuǎn)遠(yuǎn)不止65536個字節(jié)，甚至前面65536個字節(jié)里面均是一些布局信息，并沒有任何的超鏈接。
因此修改默認(rèn)值如下：

<property><name>file.content.limit</name><value>-1</value><description>The length limit for downloaded content using the fileprotocol, in bytes. If this value is nonnegative (>=0), content longerthan it will be truncated; otherwise, no truncation at all. Do notconfuse this setting with the http.content.limit setting.</description> </property><property><name>http.content.limit</name><value>-1</value><description>The length limit for downloaded content using the httpprotocol, in bytes. If this value is nonnegative (>=0), content longerthan it will be truncated; otherwise, no truncation at all. Do notconfuse this setting with the file.content.limit setting.</description> </property><property><name>ftp.content.limit</name><value>-1</value> <description>The length limit for downloaded content, in bytes.If this value is nonnegative (>=0), content longer than it will be truncated;otherwise, no truncation at all.Caution: classical ftp RFCs never defines partial transfer and, in fact,some ftp servers out there do not handle client side forced close-down verywell. Our implementation tries its best to handle such situations smoothly.</description> </property>

總結(jié)

以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 8大排序算法图文讲解
下一篇： Hadoop入门经典:WordCount