當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

crawler4j_迷你搜索引擎–使用Neo4j，Crawler4j，Graphstream和Encog的基础知识

發(fā)布時(shí)間：2023/12/3 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 crawler4j_迷你搜索引擎–使用Neo4j，Crawler4j，Graphstream和Encog的基础知识小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

crawler4j

繼續(xù)執(zhí)行正在實(shí)現(xiàn)搜索引擎的Programming Collection Intelligence （PCI）的第4章。

我可能比做一次運(yùn)動(dòng)所咬的東西要多。我認(rèn)為，與其使用本書(shū)中所使用的常規(guī)關(guān)系數(shù)據(jù)庫(kù)結(jié)構(gòu)，不如說(shuō)我一直想看看Neo4J，所以現(xiàn)在是時(shí)候了。只是說(shuō)，這不一定是圖數(shù)據(jù)庫(kù)的理想用例，但是用1塊石頭殺死3只鳥(niǎo)可能有多難。

在嘗試重置SQL Server的教程中，Oracle的想法花了比預(yù)期更長(zhǎng)的時(shí)間，但是幸運(yùn)的是Neo4j周圍有一些很棒的資源。

只是幾個(gè)：

neo4j –學(xué)習(xí)
忙碌的開(kāi)發(fā)人員的圖論
Graph數(shù)據(jù)庫(kù)

由于我只是想作為一個(gè)小練習(xí)來(lái)運(yùn)行它，所以我決定采用內(nèi)存中的實(shí)現(xiàn)方式，而不是將其作為服務(wù)在我的機(jī)器上運(yùn)行。事后看來(lái)，這可能是一個(gè)錯(cuò)誤，而工具和Web界面將幫助我從一開(kāi)始就更快地可視化數(shù)據(jù)圖。

因?yàn)槟荒茉趦?nèi)存中實(shí)現(xiàn)1個(gè)可寫(xiě)實(shí)例，所以我做了一個(gè)雙鎖單例工廠來(lái)創(chuàng)建和清除數(shù)據(jù)庫(kù)。

package net.briandupreez.pci.chapter4;import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.factory.GraphDatabaseFactory; import org.neo4j.kernel.impl.util.FileUtils;import java.io.File; import java.io.IOException; import java.util.HashMap; import java.util.Map;public class CreateDBFactory {private static GraphDatabaseService graphDb = null;public static final String RESOURCES_CRAWL_DB = "resources/crawl/db";public static GraphDatabaseService createInMemoryDB() {if (null == graphDb) {synchronized (GraphDatabaseService.class) {if (null == graphDb) {final Map<String, String> config = new HashMap<>();config.put("neostore.nodestore.db.mapped_memory", "50M");config.put("string_block_size", "60");config.put("array_block_size", "300");graphDb = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder(RESOURCES_CRAWL_DB).setConfig(config).newGraphDatabase();registerShutdownHook(graphDb);}}}return graphDb;}private static void registerShutdownHook(final GraphDatabaseService graphDb) {Runtime.getRuntime().addShutdownHook(new Thread() {@Overridepublic void run() {graphDb.shutdown();}});}public static void clearDb() {try {if(graphDb != null){graphDb.shutdown();graphDb = null;}FileUtils.deleteRecursively(new File(RESOURCES_CRAWL_DB));} catch (final IOException e) {throw new RuntimeException(e);}}}

然后使用Crawler4j創(chuàng)建了一個(gè)圖形，其中包含以我的博客開(kāi)頭的所有URL，它們與其他URL的關(guān)系以及這些URL包含的所有單詞和單詞的索引。

package net.briandupreez.pci.chapter4;import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Relationship; import org.neo4j.graphdb.Transaction; import org.neo4j.graphdb.index.Index;import java.util.ArrayList; import java.util.Arrays; import java.util.List;public class Neo4JWebCrawler extends WebCrawler {private final GraphDatabaseService graphDb;/*** Constructor.*/public Neo4JWebCrawler() {this.graphDb = CreateDBFactory.createInMemoryDB();}@Overridepublic boolean shouldVisit(final WebURL url) {final String href = url.getURL().toLowerCase();return !NodeConstants.FILTERS.matcher(href).matches();}/*** This function is called when a page is fetched and ready* to be processed by your program.*/@Overridepublic void visit(final Page page) {final String url = page.getWebURL().getURL();System.out.println("URL: " + url);final Index<Node> nodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX);if (page.getParseData() instanceof HtmlParseData) {HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();String text = htmlParseData.getText();//String html = htmlParseData.getHtml();List<WebURL> links = htmlParseData.getOutgoingUrls();Transaction tx = graphDb.beginTx();try {final Node pageNode = graphDb.createNode();pageNode.setProperty(NodeConstants.URL, url);nodeIndex.add(pageNode, NodeConstants.URL, url);//get all the wordsfinal List<String> words = cleanAndSplitString(text);int index = 0;for (final String word : words) {final Node wordNode = graphDb.createNode();wordNode.setProperty(NodeConstants.WORD, word);wordNode.setProperty(NodeConstants.INDEX, index++);final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS);relationship.setProperty(NodeConstants.SOURCE, url);}for (final WebURL webURL : links) {System.out.println("Linking to " + webURL);final Node linkNode = graphDb.createNode();linkNode.setProperty(NodeConstants.URL, webURL.getURL());final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO);relationship.setProperty(NodeConstants.SOURCE, url);relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL());}tx.success();} finally {tx.finish();}}}private static List<String> cleanAndSplitString(final String input) {if (input != null) {final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");return Arrays.asList(dic);}return new ArrayList<>();}}

收集完數(shù)據(jù)后，我可以查詢它并執(zhí)行搜索引擎的功能。為此，我決定使用Java Futures，因?yàn)檫@是我僅讀過(guò)但尚未實(shí)現(xiàn)的另一件事。在我的日常工作環(huán)境中，我們使用應(yīng)用程序服務(wù)器中的Weblogic / CommonJ工作管理器來(lái)執(zhí)行相同的任務(wù)。

final ExecutorService executorService = Executors.newFixedThreadPool(4);final String[] searchTerms = {"java", "spring"};List<Callable<TaskResponse>> tasks = new ArrayList<>();tasks.add(new WordFrequencyTask(searchTerms));tasks.add(new DocumentLocationTask(searchTerms));tasks.add(new PageRankTask(searchTerms));tasks.add(new NeuralNetworkTask(searchTerms));final List<Future<TaskResponse>> results = executorService.invokeAll(tasks);

然后，我開(kāi)始為以下每個(gè)任務(wù)創(chuàng)建一個(gè)任務(wù)，對(duì)單詞頻率，文檔位置，頁(yè)面排名和神經(jīng)網(wǎng)絡(luò)（帶有虛假輸入/訓(xùn)練數(shù)據(jù)）進(jìn)行計(jì)數(shù)，以根據(jù)搜索條件對(duì)返回的頁(yè)面進(jìn)行排名。所有代碼都在我的公共github博客倉(cāng)庫(kù)中。

免責(zé)聲明：神經(jīng)網(wǎng)絡(luò)任務(wù)要么沒(méi)有足夠的數(shù)據(jù)來(lái)有效，要么我沒(méi)有正確實(shí)現(xiàn)數(shù)據(jù)標(biāo)準(zhǔn)化，所以它目前不是很有用，我將在完成while PCI的旅程后再返回書(shū)。

值得共享的一項(xiàng)任務(wù)是Page Rank，我很快就讀懂了一些理論，認(rèn)為我不那么聰明，然后去尋找實(shí)現(xiàn)它的圖書(shū)館。我發(fā)現(xiàn)Graphstream是一個(gè)很棒的開(kāi)源項(xiàng)目，它不僅可以完成PageRank的全部工作，還可以查看他們的視頻。

因此，很容易實(shí)現(xiàn)本練習(xí)的PageRank任務(wù)。

package net.briandupreez.pci.chapter4.tasks;import net.briandupreez.pci.chapter4.NodeConstants; import net.briandupreez.pci.chapter4.NormalizationFunctions; import org.graphstream.algorithm.PageRank; import org.graphstream.graph.Graph; import org.graphstream.graph.implementations.SingleGraph; import org.neo4j.cypher.javacompat.ExecutionEngine; import org.neo4j.cypher.javacompat.ExecutionResult; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Relationship;import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.concurrent.Callable;public class PageRankTask extends SearchTask implements Callable<TaskResponse> {public PageRankTask(final String... terms) {super(terms);}@Overrideprotected ExecutionResult executeQuery(final String... words) {final ExecutionEngine engine = new ExecutionEngine(graphDb);final StringBuilder bob = new StringBuilder("START page=node(*) MATCH (page)-[:CONTAINS]->words ");bob.append(", (page)-[:LINK_TO]->related ");bob.append("WHERE words.word in [");bob.append(formatArray(words));bob.append("] ");bob.append("RETURN DISTINCT page, related");return engine.execute(bob.toString());}public TaskResponse call() {final ExecutionResult result = executeQuery(searchTerms);final Map<String, Double> returnMap = convertToUrlTotalWords(result);final TaskResponse response = new TaskResponse();response.taskClazz = this.getClass();response.resultMap = NormalizationFunctions.normalizeMap(returnMap, true);return response;}private Map<String, Double> convertToUrlTotalWords(final ExecutionResult result) {final Map<String, Double> uniqueUrls = new HashMap<>();final Graph g = new SingleGraph("rank", false, true);final Iterator<Node> pageIterator = result.columnAs("related");while (pageIterator.hasNext()) {final Node node = pageIterator.next();final Iterator<Relationship> relationshipIterator = node.getRelationships().iterator();while (relationshipIterator.hasNext()) {final Relationship relationship = relationshipIterator.next();final String source = relationship.getProperty(NodeConstants.SOURCE).toString();uniqueUrls.put(source, 0.0);final String destination = relationship.getProperty(NodeConstants.DESTINATION).toString();g.addEdge(String.valueOf(node.getId()), source, destination, true);}}computeAndSetPageRankScores(uniqueUrls, g);return uniqueUrls;}/*** Compute score** @param uniqueUrls urls* @param graph the graph of all links*/private void computeAndSetPageRankScores(final Map<String, Double> uniqueUrls, final Graph graph) {final PageRank pr = new PageRank();pr.init(graph);pr.compute();for (final Map.Entry<String, Double> entry : uniqueUrls.entrySet()) {final double score = 100 * pr.getRank(graph.getNode(entry.getKey()));entry.setValue(score);}}}

在這兩者之間，我發(fā)現(xiàn)了一種通過(guò)Stackoverflow上的值對(duì)映射進(jìn)行排序的出色實(shí)現(xiàn)。

package net.briandupreez.pci.chapter4;import java.util.*;public class MapUtil {/*** Sort a map based on values.* The values must be Comparable.** @param map the map to be sorted* @param ascending in ascending order, or descending if false* @param <K> key generic* @param <V> value generic* @return sorted list*/public static <K, V extends Comparable<? super V>> List<Map.Entry<K, V>> entriesSortedByValues(final Map<K, V> map, final boolean ascending) {final List<Map.Entry<K, V>> sortedEntries = new ArrayList<>(map.entrySet());Collections.sort(sortedEntries,new Comparator<Map.Entry<K, V>>() {@Overridepublic int compare(final Map.Entry<K, V> e1, final Map.Entry<K, V> e2) {if (ascending) {return e1.getValue().compareTo(e2.getValue());} else {return e2.getValue().compareTo(e1.getValue());}}});return sortedEntries;}}

用于實(shí)現(xiàn)所有這些功能的Maven依賴項(xiàng)

<dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>14.0.1</version></dependency><dependency><groupId>org.encog</groupId><artifactId>encog-core</artifactId><version>3.2.0-SNAPSHOT</version></dependency><dependency><groupId>edu.uci.ics</groupId><artifactId>crawler4j</artifactId><version>3.5</version><type>jar</type><scope>compile</scope></dependency><dependency><groupId>org.neo4j</groupId><artifactId>neo4j</artifactId><version>1.9</version></dependency><dependency><groupId>org.graphstream</groupId><artifactId>gs-algo</artifactId><version>1.1.2</version></dependency>

現(xiàn)在進(jìn)入關(guān)于PCI…優(yōu)化的第5章。

參考：迷你搜索引擎–只是基礎(chǔ)知識(shí)，它們使用了Zen機(jī)構(gòu)IT博客領(lǐng)域的 JCG合作伙伴 Brian Du Preez的Neo4j，Crawler4j，Graphstream和Encog 。

翻譯自: https://www.javacodegeeks.com/2013/07/mini-search-engine-just-the-basics-using-neo4j-crawler4j-graphstream-and-encog.html

crawler4j

總結(jié)

以上是生活随笔為你收集整理的crawler4j_迷你搜索引擎–使用Neo4j，Crawler4j，Graphstream和Encog的基础知识的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：安卓跳舞app（跳舞安卓）
下一篇： OWASP依赖性检查Maven插件–必须