日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件

發(fā)布時間:2024/1/23 编程问答 20 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.


nutch-site.xml

在nutch2.2.1中,有兩份配置文件:nutch-default.xml與nutch-site.xml。

其中前者是nutch自帶的默認(rèn)屬性,一般情況下不要修改。

如果需要修改默認(rèn)屬性,可以在nutch-site.xml中增加一個同名的屬性,并修改其值。nutch-site.xml中的屬性值會覆蓋nutch-default.xml中的值。


1、db.ignore.external.links

若為true,則只抓取本域名內(nèi)的網(wǎng)頁,忽略外部鏈接。

可以在?regex-urlfilter.txt中增加過濾器達(dá)到同樣效果,但如果過濾器過多,如幾千個,則會大大影響nutch的性能。

<property><name>db.ignore.external.links</name><value>true</value><description>If true, outlinks leading from a page to external hostswill be ignored. This is an effective way to limit the crawl to includeonly initially injected hosts, without creating complex URLFilters.</description> </property>

2、fetcher.parse

能否在抓取的同時進(jìn)行解釋:可以,但不 建議這樣做。

<property><name>fetcher.parse</name><value>false</value><description>If true, fetcher will parse content. NOTE: previous releases woulddefault to true. Since 2.0 this is set to false as a safer default.</description> </property>

官方解釋

N.B.?In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to?this?is usually observed in this situation.

In summary, if it is possible, users are advised?not?to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.


3、db.max.outlinks.per.page

默認(rèn)情況下,Nutch只抓取某個網(wǎng)頁的100個外部鏈接,導(dǎo)致部分鏈接無法抓取。若要改變此情況,可以修改此配置項。

<property><name>db.max.outlinks.per.page</name><value>100</value><description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed.</description> </property>官方說明如下:http://wiki.apache.org/nutch/FAQ/

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page?property to a higher value or simply -1 (unlimited).

file: conf/nutch-default.xml

<property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property>

see also:?http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html


4、file.content.limit ??http.content.limit ?ftp.content.limit

默認(rèn)情況下,nutch只抓取網(wǎng)頁的前65536個字節(jié),之后的內(nèi)容將被丟棄。
但對于某些大型網(wǎng)站,首頁的內(nèi)容遠(yuǎn)遠(yuǎn)不止65536個字節(jié),甚至前面65536個字節(jié)里面均是一些布局信息,并沒有任何的超鏈接。
因此修改默認(rèn)值如下:

<property><name>file.content.limit</name><value>-1</value><description>The length limit for downloaded content using the fileprotocol, in bytes. If this value is nonnegative (>=0), content longerthan it will be truncated; otherwise, no truncation at all. Do notconfuse this setting with the http.content.limit setting.</description> </property><property><name>http.content.limit</name><value>-1</value><description>The length limit for downloaded content using the httpprotocol, in bytes. If this value is nonnegative (>=0), content longerthan it will be truncated; otherwise, no truncation at all. Do notconfuse this setting with the file.content.limit setting.</description> </property><property><name>ftp.content.limit</name><value>-1</value> <description>The length limit for downloaded content, in bytes.If this value is nonnegative (>=0), content longer than it will be truncated;otherwise, no truncation at all.Caution: classical ftp RFCs never defines partial transfer and, in fact,some ftp servers out there do not handle client side forced close-down verywell. Our implementation tries its best to handle such situations smoothly.</description> </property>

?








總結(jié)

以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。