當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

tf/idf_Neo4j：带密码的TF / IDF（和变体）

發布時間：2023/12/3 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 tf/idf_Neo4j：带密码的TF / IDF（和变体）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

tf/idf

幾周前，我寫了一篇博客文章，介紹了如何使用scikit-learn在HIMYM成績單上運行TF / IDF，以按情節找到最重要的短語，然后我很好奇在Neo4j中很難做到。

我首先將Wikipedia的TF / IDF示例之一翻譯為cypher，以查看該算法的外觀：

WITH 3 as termFrequency, 2 AS numberOfDocuments, 1 as numberOfDocumentsWithTerm WITH termFrequency, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency return termFrequency * inverseDocumentFrequency0.9030899869919435

接下來，我需要檢查HIMYM情節成績單，并提取每個情節中的短語及其對應的計數。我使用scikit-learn的CountVectorizer進行了此操作，并將結果寫入了CSV文件。

這是該文件的預覽：

$ head -n 10 data/import/words_scikit.csv EpisodeId,Phrase,Count 1,2005,1 1,2005 seven,1 1,2005 seven just,1 1,2030,3 1,2030 kids,1 1,2030 kids intently,1 1,2030 narrator,1 1,2030 narrator kids,1 1,2030 son,1

現在，使用LOAD CSV工具將其導入Neo4j：

// phrases USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row MERGE (phrase:Phrase {value: row.Phrase});// episode -> phrase USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row MATCH (phrase:Phrase {value: row.Phrase}) MATCH (episode:Episode {id: TOINT(row.EpisodeId)}) MERGE (episode)-[:CONTAINED_PHRASE {times:TOINT(row.Count)}]->(phrase);

現在，所有數據都可以轉換為TF / IDF查詢，以利用我們的圖表。我們將從第1集開始：

match (e:Episode) WITH COUNT(e) AS numberOfDocuments match (p:Phrase)<-[r:CONTAINED_PHRASE]-(e:Episode {id: 1}) WITH numberOfDocuments, p, r.times AS termFrequency MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode) WITH p, COUNT(otherEpisode) AS numberOfDocumentsWithTerm, numberOfDocuments, termFrequency WITH p, numberOfDocumentsWithTerm, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score ORDER BY score DESC LIMIT 10==> +-----------------------------------------------------------------------------------+ ==> | p.value | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score | ==> +-----------------------------------------------------------------------------------+ ==> | "olives" | 18 | 2 | 2.0170333392987803 | 36.306600107378046 | ==> | "yasmine" | 13 | 1 | 2.3180633349627615 | 30.1348233545159 | ==> | "signal" | 11 | 5 | 1.6127838567197355 | 17.740622423917088 | ==> | "goanna" | 10 | 4 | 1.7160033436347992 | 17.16003343634799 | ==> | "flashback date" | 6 | 1 | 2.3180633349627615 | 13.908380009776568 | ==> | "scene" | 17 | 37 | 0.6989700043360189 | 11.88249007371232 | ==> | "flashback date robin" | 5 | 1 | 2.3180633349627615 | 11.590316674813808 | ==> | "ted yasmine" | 5 | 1 | 2.3180633349627615 | 11.590316674813808 | ==> | "smurf pen1s" | 5 | 2 | 2.0170333392987803 | 10.085166696493902 | ==> | "eye patch" | 5 | 2 | 2.0170333392987803 | 10.085166696493902 | ==> +-----------------------------------------------------------------------------------+ ==> 10 rows

我們計算出的分數不同于scikit-learn的分數，但相對順序似乎不錯，所以很好。在Neo4j中計算這一點的整潔之處在于，我們現在可以更改等式的“逆文檔”部分，例如，找出一個季節而不是一個情節中最重要的短語：

match (:Season) WITH COUNT(*) AS numberOfDocuments match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "1"}) WITH p, SUM(r.times) AS termFrequency, numberOfDocuments MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)-[:IN_SEASON]->(s:Season) WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments WITH p, numberOfDocumentsWithTerm, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score ORDER BY score DESC LIMIT 10==> +-----------------------------------------------------------------------------------+ ==> | p.value | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score | ==> +-----------------------------------------------------------------------------------+ ==> | "moby" | 46 | 1 | 0.9542425094393249 | 43.895155434208945 | ==> | "int" | 71 | 3 | 0.47712125471966244 | 33.87560908509603 | ==> | "ellen" | 53 | 2 | 0.6020599913279624 | 31.909179540382006 | ==> | "claudia" | 104 | 4 | 0.3010299956639812 | 31.307119549054043 | ==> | "ericksen" | 59 | 3 | 0.47712125471966244 | 28.150154028460083 | ==> | "party number" | 29 | 1 | 0.9542425094393249 | 27.67303277374042 | ==> | "subtitle" | 27 | 1 | 0.9542425094393249 | 25.76454775486177 | ==> | "vo" | 47 | 3 | 0.47712125471966244 | 22.424698971824135 | ==> | "ted vo" | 47 | 3 | 0.47712125471966244 | 22.424698971824135 | ==> | "future ted vo" | 45 | 3 | 0.47712125471966244 | 21.47045646238481 | ==> +-----------------------------------------------------------------------------------+ ==> 10 rows

通過此查詢，我們了解到“ Moby”在整個系列中僅被提及一次，實際上所有提及都在同一集中。 “ int”的出現似乎更多是數據問題–在某些情節中，成績單描述了位置，但在許多情節中卻沒有：

$ ack -iw "int" data/import/sentences.csv 2361,8,1,8,"INT. LIVING ROOM, YEAR 2030" 2377,8,1,8,INT. CHINESE RESTAURANT 2395,8,1,8,INT. APARTMENT 2412,8,1,8,INT. APARTMENT 2419,8,1,8,INT. BAR 2472,8,1,8,INT. APARTMENT 2489,8,1,8,INT. BAR 2495,8,1,8,INT. APARTMENT 2506,8,1,8,INT. BAR 2584,8,1,8,INT. APARTMENT 2629,8,1,8,INT. RESTAURANT 2654,8,1,8,INT. APARTMENT 2682,8,1,8,INT. RESTAURANT 2689,8,1,8,(Robin gets up and leaves restaurant) INT. HOSPITAL WAITING AREA

“ vo”代表語音，應該在停用詞中刪除它，因為它不會帶來太多價值。之所以出現在這里，是因為成績單在表達“未來泰德”演講的方式上不一致。

讓我們看一下最后一個賽季，看看票價如何：

match (:Season) WITH COUNT(*) AS numberOfDocuments match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "9"}) WITH p, SUM(r.times) AS termFrequency, numberOfDocuments MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode:Episode)-[:IN_SEASON]->(s:Season) WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments WITH p, numberOfDocumentsWithTerm, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score ORDER BY score DESC LIMIT 10==> +-----------------------------------------------------------------------------------+ ==> | p.value | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score | ==> +-----------------------------------------------------------------------------------+ ==> | "ring bear" | 28 | 1 | 0.9542425094393249 | 26.718790264301095 | ==> | "click options" | 26 | 1 | 0.9542425094393249 | 24.810305245422448 | ==> | "thank linus" | 26 | 1 | 0.9542425094393249 | 24.810305245422448 | ==> | "vow" | 39 | 2 | 0.6020599913279624 | 23.480339661790534 | ==> | "just click" | 24 | 1 | 0.9542425094393249 | 22.901820226543798 | ==> | "rehearsal dinner" | 23 | 1 | 0.9542425094393249 | 21.947577717104473 | ==> | "linus" | 36 | 2 | 0.6020599913279624 | 21.674159687806647 | ==> | "just click options" | 22 | 1 | 0.9542425094393249 | 20.993335207665147 | ==> | "locket" | 32 | 2 | 0.6020599913279624 | 19.265919722494797 | ==> | "cassie" | 19 | 1 | 0.9542425094393249 | 18.13060767934717 | ==> +-----------------------------------------------------------------------------------+

Barney＆Robin的婚禮有幾個特定的??短語（“誓言”，“小熊”，“排練晚宴”），因此將這些放在首位是有道理的。這里的“ linus”主要是指與Lily互動的酒吧服務器，盡管對筆錄進行了快速搜索后發現她還有Linus叔叔！

$ ack -iw "linus" data/import/sentences.csv | head -n 5 18649,61,3,17,Lily: Why don't we just call Duluth Mental Hospital and say my Uncle Linus can live with us? 59822,185,9,1,Linus. 59826,185,9,1,"Are you my guy, Linus?" 59832,185,9,1,Thank you Linus. 59985,185,9,1,"Thank you, Linus." ...

通過執行此練習，我認為TF / IDF是探索非結構化數據的一種有趣方式，但是對于一個對我們來說真的很有趣的短語，它應該出現在多個情節/季節中。

實現該目標的一種方法是對這些功能進行更多加權，因此我將在下一步進行嘗試。

如果您想看看并加以改進，則本文中的所有代碼都在github上。

翻譯自: https://www.javacodegeeks.com/2015/03/neo4j-tfidf-and-variants-with-cypher.html

tf/idf

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的tf/idf_Neo4j：带密码的TF / IDF（和变体）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：腾冲海拔多少米腾冲有多高的海拔
下一篇： spark 流式计算_流式传输大数据：S