當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

字符串匹配数据结构 --Trie树高效实现搜索词提示 / IDE自动补全

發(fā)布時(shí)間：2023/11/27 生活经验 27 豆豆

生活随笔收集整理的這篇文章主要介紹了字符串匹配数据结构 --Trie树高效实现搜索词提示 / IDE自动补全小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- 1. 算法背景
- 2. Trie 樹實(shí)現(xiàn)原理
- - 2.1 Trie 樹的構(gòu)建
  - 2.2 Trie樹的查找
  - 2.3 Trie樹的遍歷
  - 2.4 Trie樹的時(shí)間/空間復(fù)雜度
  - 2.5 Trie 樹 Vs 散列表/紅黑樹
- 3. Trie樹的應(yīng)用 -- 搜索詞提示功能

1. 算法背景

之前我們了解過單模式串匹配的相關(guān)高效算法 – BM/KMP，雖難以理解，缺能夠給予我們足夠的寬度來擴(kuò)展思維。
1. BF 和 RK 算法實(shí)現(xiàn)
2. BM 和 KMP 算法詳解

但單模式串的匹配僅僅限于一個(gè)模式串從一個(gè)主串中查找，實(shí)際場(chǎng)景中我們卻需要從多個(gè)主串中查找模式串，像IDE/文本編輯器甚至搜索引擎這樣的龐大的數(shù)據(jù)量下多模式串中的高效查找卻是單模式串查找效率無法滿足的。

基于多模式串的高效搜索能力是需要我們重點(diǎn)關(guān)注的方向，也就是我們今天要推出的Tire 樹數(shù)據(jù)結(jié)構(gòu)。

Trie 樹能夠比較友好得實(shí)現(xiàn)搜索詞提示功能，接下來詳細(xì)看看Trie樹的原理。

2. Trie 樹實(shí)現(xiàn)原理

Trie樹的核心目的是為了讓擁有公共前綴的主串能夠以一個(gè)樹的形態(tài)存在。就像樹有主干分支，有葉子分支一樣，trie樹讓公共前綴的部分形成主干分支，沒有公共前綴的部分就各自形成葉子分支。

這樣的一個(gè)字符串樹能夠極大得方面模式串的查找，順著主干分支直接移動(dòng)模式串的下標(biāo)匹配，非主干分支也能夠快速匹配。而不需要像單模式串那樣，為了加速匹配的過程還需要考慮前綴/后綴子串。

2.1 Trie 樹的構(gòu)建

Trie 樹的大體形態(tài)如下：

將:adas,ada,adf,am,ao,aok,dk,dqe,deq 這樣的獨(dú)立主串中的字符組織成一個(gè)個(gè)樹的節(jié)點(diǎn)，從樹的根節(jié)點(diǎn)開始，沿著主干分支即可能夠快速確認(rèn)模式串是否能夠匹配。

Trie樹中不一定是葉子結(jié)點(diǎn)才是一個(gè)字符串的結(jié)束字符，葉子節(jié)點(diǎn)中間也可能有結(jié)束字符。

所以構(gòu)建trie樹的過程需要為上下節(jié)點(diǎn)之間建立連接關(guān)系，從而保證查找能夠從上一個(gè)節(jié)點(diǎn)準(zhǔn)確得落到下一個(gè)節(jié)點(diǎn)。

構(gòu)建形態(tài)如下：

為每一個(gè)節(jié)點(diǎn)維護(hù)一個(gè)字符串全集的數(shù)組，比如 am這個(gè)字符串，在a節(jié)點(diǎn)處維護(hù)一個(gè)26長度的節(jié)點(diǎn)數(shù)組，其中a字符串所在下標(biāo)不為空且指向下一個(gè)字符數(shù)組中的m，而b,c,d…等其他字符的數(shù)組為空即可。

ps : 這里大家也能夠發(fā)現(xiàn)一個(gè)問題，就是構(gòu)建Trie樹的過程需要消耗大量的空間，雖然有公共前綴的公共存儲(chǔ)，但是對(duì)于一個(gè)字符存儲(chǔ)來說，需要26個(gè)額外的指針空間，所以Trie樹的內(nèi)存消耗問題顯而易見。

定義TrieNode節(jié)點(diǎn)如下：

// Trie nodeinfo
class TrieNode {public:char data_;TrieNode *children_[26];bool isEndingChar_;TrieNode(char data='/') :data_(data),isEndingChar_(false){memset(children_, 0, sizeof(TrieNode *)* 26);};
};

構(gòu)建的主要過程如下：

主串?dāng)?shù)組逐個(gè)交給初始化后的根節(jié)點(diǎn)
根節(jié)點(diǎn)逐個(gè)遍歷輸入主串的字符：
- 確認(rèn)每個(gè)字符所處下一層的children_數(shù)組中的位置（因?yàn)檫@里是26個(gè)字母，index = input[i] - ‘a(chǎn)’）
- 核對(duì)下一層的children數(shù)組是否為空，不為空則表明是公共前綴，繼續(xù)處理下一個(gè)輸入字符
- 為空則說明需要為當(dāng)前input[i]構(gòu)建一個(gè)新的TrieNode添加進(jìn)來
完成將一個(gè)輸入主串的所有字符添加到Trie樹之后更新結(jié)尾標(biāo)記（表示當(dāng)前位置為這個(gè)主串的結(jié)尾標(biāo)記）。

void Trie::insert(string des) {if (des.size() <= 0) {return;}TrieNode *tmp = root_;int i;// Traverse every character in desfor (i = 0;i < des.size(); i++) {// The des[i] insert position at trie tree.int index = des[i] - 'a';if (tmp->children_[index] == nullptr) {TrieNode *newNode = new TrieNode(des[i]);tmp->children_[index] = newNode; }tmp = tmp->children_[index];}tmp->isEndingChar_ = true;
}

2.2 Trie樹的查找

完成了Trie樹的構(gòu)建，剩下的查找就比較容易了。

拿著輸入的字符串逐個(gè)字符遍歷，確認(rèn)每一個(gè)字符的index
如果這個(gè)字符index對(duì)應(yīng)的TrieNode為空，且這個(gè)字符不是整個(gè)字符串的最后一個(gè)字符，則說明不匹配
如果不為空，則說明Trie樹中有這個(gè)節(jié)點(diǎn)，那表示當(dāng)前字符匹配，繼續(xù)后續(xù)字符的處理
當(dāng)最后一個(gè)字符對(duì)應(yīng)的TrieNode中的End標(biāo)記為真，則說明字符串匹配；否則不匹配

代碼如下：

// Judge if a string is match with Trie tree
bool Trie::find(string des) {if (des.size() == 0) {return false;}TrieNode *tmp = root_;int i;for (i = 0;i < des.size(); i++) {// The index of the current char's positionint index = des[i] - 'a';if (tmp->children_[index] == nullptr) {return false;}// Move the tmp to the next linetmp = tmp->children_[index];}// End position to ensure wether the input str is match.if (tmp->isEndingChar_ == false) {return false;}return true;
}

2.3 Trie樹的遍歷

Trie樹的遍歷就是一個(gè)深搜的過程，沿著一個(gè)方向直接找到最后一個(gè)節(jié)點(diǎn)即可。

// Traverse the trie tree recursion
// para1: TrieNode
// Para2: prefix string
// para3: result vector
void Trie::dfs_traverse(TrieNode *p, string buf, vector<string> &tmp_str) {if (p == nullptr) {return;}// if match, just and the result to vectorif (p->isEndingChar_ == true) {tmp_str.push_back(buf);}for (int i = 0; i < 26; i++) {if (p->children_[i] != nullptr) {// Just add the prefix every timedfs_traverse(p->children_[i], buf+(p->children_[i]->data_), tmp_str);}}
}// Print the all trie tree string with dictionary order
void Trie::printTrie() {vector<string> tmp_str;int i, j;for (i = 0;i < 26; i++) {string buff = "";if (root_->children_[i] != nullptr) {// Will be called recursion.// Input with TrieNode, the prefix character and the result vectordfs_traverse(root_->children_[i], buff + root_->children_[i]->data_, tmp_str);}}cout << "Trie string: " << tmp_str.size() << endl;for (j = 0;j < tmp_str.size(); j++) {cout << tmp_str[j] << endl;}
}

2.4 Trie樹的時(shí)間/空間復(fù)雜度

空間復(fù)雜度：空間消耗不用說，對(duì)于總共n個(gè)字符的所有主串來說，上僅僅是26個(gè)字母，以上為每一個(gè)字符都實(shí)現(xiàn)了一個(gè)26位的指針數(shù)組。64位機(jī)器下的最壞空間消耗：(26*8 + 1)*n B，顯然Trie樹的空間消耗是一個(gè)非常大的問題。當(dāng)然對(duì)于公共前綴比較多的場(chǎng)景，構(gòu)建Trie的空間會(huì)一定程度的降低。
時(shí)間復(fù)雜度：構(gòu)建Trie樹需要遍歷 n個(gè)字符中的每一個(gè)字符消耗O(n)；構(gòu)建好Trie樹之后，每一個(gè)模式串的匹配同樣只需要遍歷一次消耗O(k)，整個(gè)時(shí)間復(fù)雜度是O(k+n)。

2.5 Trie 樹 Vs 散列表/紅黑樹

	Trie樹（26字符）	散列表/紅黑樹
內(nèi)存消耗	268n	O(n)
查找效率	O(n+k)	O(1)
工業(yè)實(shí)現(xiàn)	無，需手動(dòng)實(shí)現(xiàn)	有且完備
適用場(chǎng)景	搜索詞提示/IDE自動(dòng)補(bǔ)全	字符串精確查找

綜上，如果需要多模式串的精確功能，紅黑樹/散列表等工業(yè)實(shí)現(xiàn)會(huì)更合適；如果需要搜索詞提示這樣的功能，則Trie樹的結(jié)構(gòu)天然適合。

以上完整測(cè)試代碼：

#include <iostream>
#include <string>
#include <vector>using namespace std;// Trie nodeinfo
class TrieNode {public:char data_;TrieNode *children_[26];bool isEndingChar_;TrieNode(char data='/') :data_(data),isEndingChar_(false){memset(children_, 0, sizeof(TrieNode *)* 26);};
};// Trie tree info with a root node
class Trie {
public: Trie() {root_ = new TrieNode();}~Trie() {destory(root_);}void insert(string des);bool find(string des);void printTrie();void destory(TrieNode *p);void dfs_traverse(TrieNode *p, string buf, vector<string> &tmp_str);private:TrieNode *root_;
};// Delete the TrieNode, and release the space
void Trie::destory(TrieNode *p) {if (p == nullptr) {return;}for (int i = 0;i < 26; i++) {destory(p->children_[i]);}delete p;p = nullptr;
}void Trie::insert(string des) {if (des.size() <= 0) {return;}TrieNode *tmp = root_;int i;for (i = 0;i < des.size(); i++) {// The des[i] insert position at trie tree.int index = des[i] - 'a';if (tmp->children_[index] == nullptr) {TrieNode *newNode = new TrieNode(des[i]);tmp->children_[index] = newNode; }tmp = tmp->children_[index];}tmp->isEndingChar_ = true;
}// Traverse the trie tree recursion
void Trie::dfs_traverse(TrieNode *p, string buf, vector<string> &tmp_str) {if (p == nullptr) {return;}// if match, just and the result to vectorif (p->isEndingChar_ == true) {tmp_str.push_back(buf);}for (int i = 0; i < 26; i++) {if (p->children_[i] != nullptr) {// Just add the prefix every timedfs_traverse(p->children_[i], buf+(p->children_[i]->data_), tmp_str);}}
}// Print the trie tree with dictionary order
void Trie::printTrie() {vector<string> tmp_str;int i, j;for (i = 0;i < 26; i++) {string buff = "";if (root_->children_[i] != nullptr) {// Will be called recursiondfs_traverse(root_->children_[i], buff + root_->children_[i]->data_, tmp_str);}}cout << "Trie string: " << tmp_str.size() << endl;for (j = 0;j < tmp_str.size(); j++) {cout << tmp_str[j] << endl;}
} // Judge if a string is match with Trie tree
bool Trie::find(string des) {if (des.size() == 0) {return false;}TrieNode *tmp = root_;int i;for (i = 0;i < des.size(); i++) {// The index of the current char's positionint index = des[i] - 'a';if (tmp->children_[index] == nullptr) {return false;}// Move the tmp to the next linetmp = tmp->children_[index];}// End position to ensure wether the input str is match.if (tmp->isEndingChar_ == false) {return false;}return true;
}int main() {string s[5] = {"adafs", "dfgh", "amkil", "doikl", "aop"};Trie *trie = new Trie();for (int i = 0; i < 5; i++) {trie->insert(s[i]);}trie->printTrie();string in_str;cout << "Inpunt a string :" << endl;cin >> in_str;if (trie->find(in_str)) {cout << "Trie tree has the str: " << in_str << endl;} else {cout << "Trie tree doesn't have the str : " << in_str << endl;}return 0;
}

輸出如下：

> ./trie_alg
Trie string: 5
adafs
amkil
aop
dfgh
doiklInpunt a string :
aoe
Trie tree doesn't have the str : aoe

3. Trie樹的應(yīng)用 – 搜索詞提示功能

想要實(shí)現(xiàn)搜索詞提升這樣的功能，需要基于Trie樹實(shí)現(xiàn)做一些邏輯的添加。比如用戶輸入h,則能夠返回h為開頭的字符串；輸入he，則能夠返回he開頭的字符。。。

類似如下：

source code: https://github.com/BaronStack/DATA_STRUCTURE/blob/master/string/trie_alg.cc

實(shí)現(xiàn)邏輯如下：

// Traverse the trie tree recursion
void Trie::dfs_traverse(TrieNode *p, string buf, vector<string> &tmp_str) {if (p == nullptr) {return;}// if match, just and the result to vectorif (p->isEndingChar_ == true) {tmp_str.push_back(buf);}for (int i = 0; i < 26; i++) {if (p->children_[i] != nullptr) {// Just add the prefix every timedfs_traverse(p->children_[i], buf+(p->children_[i]->data_), tmp_str);}}
}// Input the prefix, and search the prefix related string
void Trie::printTrieWithPrefix(string start) {vector<string> tmp_str;int i, j;TrieNode *tmp = root_;// Ensure prefix is existfor (int i = 0;i < start.size(); i++) {int index = start[i] - 'a';if (tmp->children_[index] == nullptr) {cout << "No prefix with " << start << endl;return;} tmp = tmp->children_[index];}// Prefix is a matched stringtmp_str.push_back(start);for (i = 0;i < 26; i++) {string buff = start;if (tmp->children_[i] != nullptr) {// Will be called recursiondfs_traverse(tmp->children_[i], buff + tmp->children_[i]->data_, tmp_str);}}cout << "Trie string: " << tmp_str.size() << endl;for (j = 0;j < tmp_str.size(); j++) {cout << tmp_str[j] << endl;}
}