Elasticsearch:运用 Elasticsearch 查找类似文档:more_like_this
More Like This Query 查找與給定文檔集 “相似” 的文檔。 為此,More Like This?選擇這些輸入文檔的一組代表性術語,使用這些術語形成查詢,執行查詢并返回結果。 用戶控制輸入文檔、應如何選擇術語以及如何形成查詢。
最簡單的用例包括請求與提供的文本片段相似的文檔。 在這里,我們要求所有在 “title” 和 “description” 字段中包含類似于 “Once upon a time” 的文本的所有電影,將所選術語的數量限制為 12。
GET /_search {"query": {"more_like_this" : {"fields" : ["title", "description"],"like" : "Once upon a time","min_term_freq" : 1,"max_query_terms" : 12}} }一個更復雜的用例包括將文本與索引中已經存在的文檔混合。 在這種情況下,指定文檔的語法類似于 Multi GET API 中使用的語法。
GET /_search {"query": {"more_like_this": {"fields": [ "title", "description" ],"like": [{"_index": "imdb","_id": "1"},{"_index": "imdb","_id": "2"},"and potentially some more text here as well"],"min_term_freq": 1,"max_query_terms": 12}} }最后,用戶可以混合一些文本、一組選定的文檔,但也可以提供不一定出現在索引中的文檔。 為了提供索引中不存在的文檔,語法類似于人工文檔。
GET /_search {"query": {"more_like_this": {"fields": [ "name.first", "name.last" ],"like": [{"_index": "marvel","doc": {"name": {"first": "Ben","last": "Grimm"},"_doc": "You got no idea what I'd... what I'd give to be invisible."}},{"_index": "marvel","_id": "2"}],"min_term_freq": 1,"max_query_terms": 12}} }動手實踐
在下面,我將使用一個簡單的例子來展示如何使用 more_like_this 查詢來查找相似的文檔。盡管這個查詢是一個非常有趣的功能,但是可能很多開發者不會選擇使用這種查詢,一方面是對這個查詢不是很理解,另一方面,開發者可能會選擇使用傳統的查詢,比如 match, term 及 range。希望通過這篇文章的介紹,你會在未來的工作中根據自己使用案例選擇使用 more_like_this 查詢。
準備數據
未來這個展示,我們將使用 movies 這個數據集。
請點擊上面的 Download 鏈接下載這個數據集。把這個數據集下載下來并保存于項目的 data 子目錄中。
?然后,我們可以在地址?https://github.com/liu-xiao-guo/searchflix?下載整個源碼,并把如下的文件拷貝出來:
- pipeline/movies.conf 文件拷貝出來,放入到項目的根目錄中
- elastic/elasticsearch/mappings/movies.mapping 文件拷貝出來,放入到項目的根目錄中
- dictionaries/countries_geo.csv 文件拷貝出來,并放入到 dictionaries 子目錄下
經過這樣的操作過后,我們可以看到的文件是這樣的:
$ pwd /Users/liuxg/data/morelikethis $ ls data dictionaries movies.conf movies.mapping $ tree -L 3 . ├── data │?? └── movies_metadata.csv ├── dictionaries │?? ├── countries_geo.csv │?? └── source.txt ├── movies.conf └── movies.mapping進入到項目的子目錄,我們在 terminal 中打入如下的命令:
curl -XPUT -H'Content-type: application/json' localhost:9200/movies -d@mappings/movies.mapping我們接下來導入數據:
sudo <path_to_logstash_unzipped>/bin/logstash -f movies.conf在這里,我們必須使用 sudo,這是因為在 movies.conf 里,我們有使用?/dev/null。經過上面的導入,我們可以在 Kibana 中可以查看到已經導入的文檔:
GET movies/_count {"count" : 45432,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0} }movies 索引中,一個典型的文檔是這樣的:
GET movies/_search {"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : 1.0,"hits" : [{"_index" : "movies","_type" : "_doc","_id" : "12110","_score" : 1.0,"_source" : {"original_title" : "Dracula: Dead and Loving It","adult" : false,"vote_average" : 5.7,"genres" : [{"id" : 35,"name" : "Comedy"},{"id" : 27,"name" : "Horror"}],"tagline" : null,"production_companies" : [{"id" : 5,"name" : "Columbia Pictures"},{"id" : 97,"name" : "Castle Rock Entertainment"},{"id" : 6368,"name" : "Enigma Pictures"}],"imdb_id" : "tt0112896","spoken_languages" : [{"iso_639_1" : "en","name" : "English"},{"iso_639_1" : "de","name" : "Deutsch"}],"production_countries_name_list" : ["France","United States of America"],"@version" : "1","title" : "Dracula: Dead and Loving It","homepage" : null,"original_language" : "en","belongs_to_collection" : null,"production_countries_location_list" : ["46.227638,2.213749","37.09024,-95.712891"],"popularity" : 5.430331,"budget" : 0.0,"revenue" : 0.0,"production_countries" : [{"iso_3166_1" : "FR","location" : "46.227638,2.213749","name" : "France"},{"iso_3166_1" : "US","location" : "37.09024,-95.712891","name" : "United States of America"}],"release_date" : "1995-12-22","poster_path" : "/xve4cgfYItnOhtzLYoTwTVy5FGr.jpg","@timestamp" : "1995-12-21T16:00:00.000Z","id" : 12110,"runtime" : 88.0,"status" : "Released","genres_list" : ["Comedy","Horror"],"overview" : "When a lawyer shows up at the vampire's doorstep, he falls prey to his charms and joins him in his search for fresh blood. Enter Dr. van Helsing, who may be the only one able to vanquish the count.","vote_count" : 210,"video" : "false"}}...在上面,我們可以看到有一個叫做 overview 的字段。
More Like This 查詢
more_like_this 查詢的目的是在索引文檔中查找與用戶通知的某些條目相似的文檔。他們通過從知情條目中選擇相關術語,然后使用這些術語構建查詢來做到這一點。
知情條目可以是自由文本或其他索引文檔。也就是說,你可以輕松搜索與已在同一索引或其他索引中編入索引的任何文檔相似的文檔。也就是說,我想用一個用例來演示此查詢的用法,即向用戶提供與他選擇的電影或他剛剛觀看的電影相似的電影概要。
此查詢的唯一必需參數是,你必須輸入要搜索相似文檔的文本,或包含一個對象的數組,該對象指示要搜索的文檔的索引/ID 文件。在第二種情況下,還可以將現有和索引文檔與人工文檔混合,即可以模擬帶有自由文本的文檔。下面是一個例子:
GET movies/_search {"fields": ["overview"], "query": {"more_like_this": {"fields": ["overview"],"like": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire.","min_term_freq": 1,"max_query_terms": 12}}, "_source": false }通常,雖然不是強制性的,但你還需要輸入 fields 參數,這是一個包含字段名稱的數組,將在其中檢查相似性。 另一個有趣的參數是 unlike,它與 like 結合使用(它們不相互排斥),它遵循相同的語法,并將通過排除與我們通知我們不知道的文檔相似的文檔來減少結果的數量 想。 基本上(像 X)AND(不像 Y)。
此查詢中的其余參數分為兩種類型。
用于選擇術語的參數
- max_query_terms:要選擇的最大術語數。我們擁有的術語越多,準確度就越高,但以犧牲性能為代價。默認值為 25。
- min_term_freq:應忽略輸入文檔/文本中的術語的最小頻率。默認值為 2。
- min_doc_freq:文檔的最小頻率,低于該頻率的輸入文檔應被忽略。默認值為 5。
- max_doc_freq:最大文檔頻率,高于該頻率時,輸入文檔的術語應被忽略。忽略非常頻繁的術語(如停用詞)會很有用。默認情況下它被禁用 (0)。
- min_word_length:最小術語長度,低于該長度的術語應被忽略。默認值為 0。
- max_word_length:最大術語大小,超過該術語應被忽略。舊名稱 max_word_len 已棄用。默認情況下它被禁用 (0)。
- stop_words:一組停用詞,要忽略的術語。
分析器:用于輸入文本的分析器。默認情況下,它是與 fields 參數中通知的第一個字段關聯的分析器。
查詢構造參數
- minimum_should_match:控制必須找到的術語數。 使用與最小值應該匹配的相同語法。 默認值為 “30%”。
- fail_on_unsupported_field:如果提供的任何字段(字段)不屬于任何受支持的類型(關鍵字或文本),則控制查詢是否應失敗。 默認為真。
- boost_terms:將構建的查詢中的每個術語都可以通過其 TF-IDF 分數來增強。 默認情況下它被禁用 (0),任何正值都會激活此功能。
- include:定義查詢結果中是否應返回輸入文檔。 默認為假。
- boost:定義整個查詢的 boost 值。 默認值為 1.0。
實踐
回到之前提到的用例(尋找類似的電影向用戶推薦),讓我們做一些實驗。
在下面的示例中,我將使用電影 “Jaws”(大白鯊)?的概要并嘗試找到類似的電影:
GET movies/_search {"size": 5,"_source": ["title","overview"],"query": {"more_like_this": {"fields": ["overview"],"min_term_freq": 1,"max_query_terms": 12,"like": "An insatiable great white shark terrorizes the townspeople of Amity Island, The police chief, an oceanographer and a grizzled shark hunter seek to destroy the bloodthirsty beast."}} }以下是前 5 個結果:
"hits" : [{"_index" : "movies","_type" : "_doc","_id" : "578","_score" : 100.41875,"_source" : {"overview" : "An insatiable great white shark terrorizes the townspeople of Amity Island, The police chief, an oceanographer and a grizzled shark hunter seek to destroy the bloodthirsty beast.","title" : "Jaws"}},{"_index" : "movies","_type" : "_doc","_id" : "52454","_score" : 19.65117,"_source" : {"overview" : "When the prehistoric warm-water beast the Crocosaurus crosses paths with that cold-water monster the Mega Shark, all hell breaks loose in the oceans as the world's top scientists explore every option to halt the aquatic frenzy. Swallowing everything in their paths -- including a submarine or two -- Croc and Mega lead an explorer and an oceanographer on a wild chase. Eventually, the desperate men turn to a volcano for aid.","title" : "Mega Shark vs. Crocosaurus"}},{"_index" : "movies","_type" : "_doc","_id" : "246594","_score" : 18.51667,"_source" : {"overview" : "When another Mega Shark returns from the depths of the sea, world militaries go on high alert. Ocean traffic grinds to a standstill as everyone lives in fear of the insatiable beast. Out of options, the US government unleashes the top secret Mecha Shark project -- a mechanical shark built to have the same exact characteristics as Mega. A pair of scientists pilot the mechanical creature as they fight Mega in a pitched battle to save the planet. But when faulty mechanics cause the Mecha to go after humans, the scientists must somehow guide Mega to Mecha in hopes that the two titans will kill each other - or risk untold worldwide destruction.","title" : "Mega Shark vs. Mecha Shark"}},{"_index" : "movies","_type" : "_doc","_id" : "43084","_score" : 14.461939,"_source" : {"overview" : "Wealthy big game hunter, Wilson Frields, funds an expedition going deep into the Florida Everglades to search for the Calusa: a lost tribe of Native Americans. When the team discover the gruesome remains of another expedition, Friels admits he is searching for the Calusa's Fountain of Youth and its guardian, a mythical and deadly beast. As they delve deeper into the Everglades, the bloodthirsty beast begins to stalk and kill members of the group and, in one struggle, their leader Brinson Thomas is injured and begins to metamorphose into a creature himself. His only hope: to drink from the waters of the Fountain. The terrible truth behind the Calusa must be discovered if any of them are going to get out of there alive!","title" : "Deadly Species"}},{"_index" : "movies","_type" : "_doc","_id" : "385232","_score" : 11.46097,"_source" : {"overview" : "When the powerful wizard, Lord Tensley, is jilted by Princess Ennogard, he vows to rid the land of love. He commands his fire-breathing dragon to destroy any sign of affection seen throughout the kingdom. As the death toll rises, Camilan, a brave but arrogant warrior seeks to marry his true love despite the curse upon the land. In order to fulfill his destiny, he seeks the help of his estranged brother Ramicus, a bounty hunter with no desire to get involved. It takes an enchanted distress message and the promise of great reward from the beautiful Princess Ennogard, to lure Ramicus into the quest to defeat the wizard and his terrible beast.","title" : "Dudes & Dragons"}}]請注意,第一個結果正是 “Jaws” 本身,因為我沒有執行指示該電影的文檔的查詢(如果我這樣做了,因為 include 參數默認為 false,文檔本身將不會返回), 但是在類似參數中,我告知了它在索引文檔中出現的概要,并且索引中肯定不會有比文檔本身更相似的文檔。
至于其他結果,它們都與鯊魚有關也就不足為奇了,因為這肯定是知情文本中的相關術語。
讓我們嘗試通知代表用戶剛剛觀看的電影的文檔:
GET movies/_search {"fields": ["overview"], "query": {"match": {"title": "rocky"}},"_source": false }在上面,我們搜索文檔, 并查看結果:
"hits" : [{"_index" : "movies","_type" : "_doc","_id" : "1366","_score" : 11.072304,"fields" : {"overview" : ["When world heavyweight boxing champion, Apollo Creed wants to give an unknown fighter a shot at the title as a publicity stunt, his handlers choose palooka Rocky Balboa, an uneducated collector for a Philadelphia loan shark. Rocky teams up with trainer Mickey Goldmill to make the most of this once in a lifetime break."]}},{"_index" : "movies","_type" : "_doc","_id" : "1371","_score" : 9.325746,"fields" : {"overview" : ["Now the world champion, Rocky Balboa is living in luxury and only fighting opponents who pose no threat to him in the ring. His lifestyle of wealth and idleness is shaken when a powerful young fighter known as Clubber Lang challenges him to a bout. After taking a pounding from Lang, the humbled champ turns to former bitter rival Apollo Creed to help him regain his form for a rematch with Lang."]}},{"_index" : "movies","_type" : "_doc","_id" : "41288","_score" : 9.325746,"fields" : {"overview" : ["""Step into the ring with one of America's greatest legends...and stand a couple of rounds with greatness! "Pulling no punches" (LA Daily News), Jon Favreau (Swingers) and Oscar?(r) winner* George C. Scott give TKO performances in this outstanding biography of the only undefeated world heavyweight champion in the history of boxing! In the small blue-collar town of Brockton, Massachusetts, young Rocky Marciano (Favreau) turns to the ring as his ticket out. Training twice as hard and twice as long as anyone else, he pounds his way to victory and his reputation quickly spreads as "the guy to beat." But behind the gloves Rocky is unhappy with his gift and he's thinking of retiring. So, with the fate of his career hanging in the balance, he finds a way to unleash his thunder againthis time against his biggest hero: Joe Louis!"""]}}... ]在上面,我們可以看到一個 _id 為?1366 的文檔。我們接下來查找和這個 _id 相似的文檔。我們可以這么來查詢:
GET movies/_search {"size": 5,"fields": ["overview"], "query": {"more_like_this": {"fields": ["overview"],"like": {"_index": "movies","_id": "1366"},"min_term_freq": 1,"max_query_terms": 12}},"_source": false }用戶肯定有可能對特許經營中的其他電影感興趣,這就是我們得到的結果:
"hits" : [{"_index" : "movies","_type" : "_doc","_id" : "312221","_score" : 57.172073,"fields" : {"overview" : ["The former World Heavyweight Champion Rocky Balboa serves as a trainer and mentor to Adonis Johnson, the son of his late friend and former rival Apollo Creed."]}},{"_index" : "movies","_type" : "_doc","_id" : "184741","_score" : 29.255241,"fields" : {"overview" : ["A chorus girl (Marion Davies) and a heavyweight boxer (Clark Gable) are paired romantically as a publicity stunt."]}},{"_index" : "movies","_type" : "_doc","_id" : "1371","_score" : 27.638262,"fields" : {"overview" : ["Now the world champion, Rocky Balboa is living in luxury and only fighting opponents who pose no threat to him in the ring. His lifestyle of wealth and idleness is shaken when a powerful young fighter known as Clubber Lang challenges him to a bout. After taking a pounding from Lang, the humbled champ turns to former bitter rival Apollo Creed to help him regain his form for a rematch with Lang."]}},{"_index" : "movies","_type" : "_doc","_id" : "1246","_score" : 26.655176,"fields" : {"overview" : ["When he loses a highly publicized virtual boxing match to ex-champ Rocky Balboa, reigning heavyweight titleholder, Mason Dixon retaliates by challenging Rocky to a nationally televised, 10-round exhibition bout. To the surprise of his son and friends, Rocky agrees to come out of retirement and face an opponent who's faster, stronger and thirty years his junior."]}},{"_index" : "movies","_type" : "_doc","_id" : "1367","_score" : 25.732662,"fields" : {"overview" : ["""After Rocky goes the distance with champ Apollo Creed, both try to put the fight behind them and move on. Rocky settles down with Adrian but can't put his life together outside the ring, while Creed seeks a rematch to restore his reputation. Soon enough, the "Master of Disaster" and the "Italian Stallion" are set on a collision course for a climactic battle that is brutal and unforgettable."""]}}]總結
more_like_this 查詢具有很大的潛力,可以在我們的搜索解決方案中提供額外的功能,如果一方面它使用起來非常簡單,另一方面它提供了一個有趣的參數來專門針對我們搜索類似的文件。 這個查詢也用于 NLP 的上下文中,更具體地用于文本分類
無論如何,我希望這篇文章引起了人們對 Elasticsearch 上可用的這種和其他類型的查詢進行試驗的興趣,盡管這些查詢有點不常用。
總結
以上是生活随笔為你收集整理的Elasticsearch:运用 Elasticsearch 查找类似文档:more_like_this的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 2022-03微软漏洞通告
- 下一篇: Unity3D Webplayer So