日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

heritrix3.x--SURT / 限定heritrix的爬行域

發布時間:2024/1/8 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 heritrix3.x--SURT / 限定heritrix的爬行域 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在heritrix3.x的CXML文件中經常出現surt這個屬性,這個屬性到底是什么呢,因為是一個縮寫,而且比較小眾,從字面上看不出意思,還是來看下官方的完整解釋吧:

Sort-friendly?URI?Reordering?Transform.?Converts?URIs?of?the?form:?scheme://userinfo@domain.tld:port/path?query#fragment?...into...?scheme://(tld,domain,:port@userinfo)/path?query#fragment?The?'('?')'?characters?serve?as?an?unambiguous?notice?that?the?so-called?'authority'?portion?of?the?URI?([userinfo@]host[:port]?in?http?URIs)?has?been?transformed;?the?commas?prevent?confusion?with?regular?hostnames.?This?remedies?the?'problem'?with?standard?URIs?that?the?host?portion?of?a?regular?URI,?with?its?dotted-domains,?is?actually?in?reverse?order?from?the?natural?hierarchy?that's?usually?helpful?for?grouping?and?sorting.?The?value?of?respecting?URI?case?variance?is?considered?negligible:?it?is?vanishingly?rare?for?case-variance?to?be?meaningful,?while?URI?case-?variance?often?arises?from?people's?confusion?or?sloppiness,?and?they?only?correct?it?insofar?as?necessary?to?avoid?blatant?problems.?Thus?the?usual?SURT?form?is?considered?to?be?flattened?to?all?lowercase,?and?not?completely?reversible.

地址為:http://crawler.archive.org/apidocs/org/archive/util/SURT.html

?

各類人體藝術寫真、攝影、模特攝影、寫真照片?
???

?

簡單的說,意思是將傳統的點號域名轉化為另一種避免歧義的域名格式了,在配置文件中應該會用到。

?

配置實例:http://tech.groups.yahoo.com/group/archive-crawler/message/7375

?

各類人體藝術寫真、攝影、模特攝影、寫真照片?
???

<bean?class="org.archive.modules.deciderules.DecideRuleSequence">
<property?name="rules">
<list>
<bean?class="org.archive.modules.deciderules.RejectDecideRule"?/>
<bean
class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property?name="seedsAsSurtPrefixes"?value="false"?/>
<property?name="surtsSource">
<bean?class="org.archive.spring.ConfigString">
<property?name="value">
<value>
+http://(com,blogs,test,)/between_the_lines/page
+http://(com,blogs,test,)/between_the_lines/archive
</value>
</property>
</bean>
</property>
</bean>
<bean?class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property?name="regexList">
<list>
<value>^http://test\.blogs\.com/between_the_lines/$</value>
<value>^.*index.html*$</value>
</list>
</property>
</bean>
</list>
</property>
</bean>

上述配置的效果是:爬行下列目錄中包含index.html的頁面

http://test.blogs.com/between_the_lines/
>
>?http://test.blogs.com/between_the_lines/page*
>
>?http://test.blogs.com/between_the_lines/archives*

?————————————————————————————————————————

經測試,surtsSource下限定的爬行域名解析當前頁面,并仍然會爬到外鏈(有待進一步求解)

?

?

各類人體藝術寫真、攝影、模特攝影、寫真照片?
???

?

? 具體的做法如下:

??? 1.在org.archive.crawler.frontier下新建一個ELFHashQueueAssignmentPolicy類,這個類要注意繼承自 QueueAssignmentPolicy。

??? 2.在該類下編寫代碼如下:

1. publicclass ELFHashQueueAssignmentPolicyextends QueueAssignmentPolicy

2.? {

3. ??? privatestatic finalLogger logger= Logger?

4. ??? .getLogger(ELFHashQueueAssignmentPolicy .class.getName());

5. ?

6. ??? publicString getClassKey(CrawlController controller,??

7.????????CandidateURI cauri){?

8. ??? ??? String uri = cauri.getUURI().toString();?

9. ??? ???long hash = ELFHash(uri);?

10.?????????????????String a = Long.toString(hash % 100);?

11.?????????????????returna;?

12.?????????????}?

13.????????????publiclong ELFHash(String str){?

14.????????????????long hash = 0;?

15.????????????????long x = 0;?

16.????????????????for(inti = 0; i < str.length(); i++){?

17.???????????????????? hash = (hash << 4) + str.charAt(i);?

18.????????????????????if((x = hash & 0xF0000000L) != 0){?

19.???????????????????????? hash ^= (x >> 24);?

20.?????????????????????????hash &= ~x;?

21.???????????????????? }?

22.???????????????? }?

23.????????????????return (hash & 0x7FFFFFFF);?

24.????????????}?

?

各類人體藝術寫真、攝影、模特攝影、寫真照片?
???

總結

以上是生活随笔為你收集整理的heritrix3.x--SURT / 限定heritrix的爬行域的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。