犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(
犀牛建模軟件的英文語言包
In this article, we will learn to do Topic Model using tidytext and textmineR packages with Latent Dirichlet Allocation (LDA) Algorithm.
在本文中,我們將學習使用帶有潛在Dirichlet分配 (LDA)算法的tidytext和textmineR包進行主題模型。
Natural Language Processing has a wide area of knowledge and implementation, one of them is Topic Model. Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. For example “dog”, “bone”, and “obedient” will appear more often in the document about dogs, “cute”, “evil”, and “home owner” will appear in the document about cats. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.
自然語言處理具有廣泛的知識和實現領域,其中之一就是主題模型。 主題模型是一種統計模型,用于發現文檔集合中出現的抽象“主題” 。 主題建模是一種常用的文本挖掘工具,用于發現文本主體中的隱藏語義結構。 例如,“狗”,“骨頭”和“服從”將在有關狗的文檔中更頻繁地出現,“可愛”,“邪惡”和“房主”將在關于貓的文檔中出現。 通過主題建模技術產生的“主題”是相似單詞的簇。 主題模型在數學框架中捕獲了這種直覺,該模型允許檢查一組文檔并基于每個單詞的統計信息來發現主題可能是什么以及每個文檔的主題平衡是什么。
背景 (Background)
What is Topic Modeling Topic Modeling is how the machine collect a group of words within a document to build ‘topic’ which contain group of words with similar dependencies. With Topic models (or topic modeling, or topic model, its just the same) methods we can organize, understand and summarize large collections of textual information. It helps in:
什么是主題建模主題建模是機器如何在文檔中收集一組單詞以構建“主題”,其中包含具有相似依賴性的一組單詞。 使用主題模型(或主題模型,或主題模型,相同),我們可以組織,理解和總結大量文本信息。 它有助于:
- Discovering hidden topical patterns that are present across the collection 發現集合中存在的隱藏主題模式
- Annotating documents according to these topics 根據這些主題注釋文檔
- Using these annotations to organize, search and summarize texts 使用這些注釋來組織,搜索和總結文本
In a business approach, topic modeling power for discovering hidden topics can help the organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing. It also can summarize text for company’s meetings. A high-quality meeting document can enable users to recall the meeting content efficiently. Topic tracking and detection can also use to build a recommender system.
在業務方法中,用于發現隱藏主題的主題建模功能可以幫助組織更好地了解其客戶反饋,從而使他們可以專注于客戶面臨的那些問題。 它還可以匯總公司會議的文本。 高質量的會議文檔可以使用戶有效地回憶會議內容。 主題跟蹤和檢測也可以用于構建推薦系統。
There are many techniques that are used to obtain topic models, namely: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Correlated Topic Models (CTM), and TextRank. In this study, we will focus to implement LDA algorithm to build topic model with tidytext and textmineR package. Not only building model, we will also evaluate the goodness of fit of the model using some metrics like R-squared or log-likelihood. There are also some metrics like coherence and prevalence to measure the quality of topics.
有許多技術可用于獲取主題模型,即:潛在Dirichlet分配(LDA),潛在語義分析(LSA),相關主題模型(CTM)和TextRank。 在本研究中,我們將重點實現LDA算法,以使用tidytext和textmineR包構建主題模型。 不僅建立模型,我們還將使用一些指標(例如R平方或對數似然)評估模型的擬合優度。 還有一些度量標準,例如coherence和prevalence來衡量主題的質量。
Load these libraries in your working machine:
將這些庫加載到您的工作計算機中:
# data wranglinglibrary(dplyr)
library(tidyr)
library(lubridate)
# visualization
library(ggplot2)
# dealing with text
library(textclean)
library(tm)
library(SnowballC)
library(stringr)
# topic model
library(tidytext)
library(topicmodels)
library(textmineR)
主題模型 (Topic Model)
From the introduction above we know that there are several ways to do topic model. In this study, we will use the LDA algorithm. LDA is a mathematical model that is used to find a mixture of words to each topic, also determine the mixture of topics that describe each document. LDA answer these following principles of topic modeling:
通過上面的介紹,我們知道有幾種方法可以進行主題建模。 在這項研究中,我們將使用LDA算法。 LDA是一種數學模型,用于查找每個主題的單詞組合,也可以確定描述每個文檔的主題的組合。 LDA回答以下主題建模的原則:
Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.” This also can be symbolized as Θ theta
每個文檔都是主題的混合體。 我們認為每個文檔可能包含特定比例的多個主題的單詞。 例如,在兩個主題的模型中,我們可以說“文檔1是主題A的90%和主題B的10%,而文檔2是主題A的30%和主題B的70%。” 這也可以被符號化為Θ theta
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the political topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally. This also can be symbolized as Φ phi
每個主題都是單詞的混合體。 例如,我們可以想象美國新聞有兩個主題的模型,一個主題是“政治”,另一個主題是“娛樂”。 政治話題中最常見的詞可能是“總統”,“國會”和“政府”,而娛樂話題可能是由“電影”,“電視”和“演員”等詞組成的。 重要的是,可以在主題之間共享單詞。 諸如“預算”之類的詞可能會同時出現在兩者中。 這也可以表示為phi
We will use two packages: tidytext including tidymodels package and textmineR. Tidytext package build topic model easily and they provide a method for extracting the per-topic-per-word probabilities, called β (“beta”), from the model. But they don’t provide metrics to calculate the goodness of model like textmineR do.
我們將使用兩個軟件包: tidytext包括tidymodels軟件包和textmineR 。 Tidytext包可以輕松地建立主題模型,并且它們提供了一種從模型中提取每個主題/單詞的概率(稱為β(“ beta”))的方法。 但是它們沒有像textmineR一樣提供度量標準來計算模型的textmineR 。
潛在狄利克雷分配(LDA) (Latent Dirichlet Allocation (LDA))
LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. Plate Notation (picture below) is a concise way of visually representing the dependencies among the model parameters.
LDA是一種生成統計模型,它允許由未觀察組解釋一組觀察結果,這些觀察組解釋了為什么某些數據相似的原因。 例如,如果觀察是收集到文檔中的單詞,則假定每個文檔都是少量主題的混合,并且每個單詞的出現都可歸因于文檔的主題之一。 Plate Notation (下圖)是一種直觀地表示模型參數之間依賴性的簡潔方法。
LDA Plate NotationLDA板符號- Area in M denotes the number of documents M中的區域表示文件數
- N is the number of words in a given document N是給定文檔中的單詞數
- α is the parameter of the Dirichlet prior on the per-document topic distributions. High α indicates that each documents is likely to contain a mixture of most of the topics (not just one or two). Low αα indicates each document will likely contain just a few of topics α是每個文檔主題分布上的Dirichlet優先級的參數。 高α表示每個文檔可能包含大多數主題的混合體(不僅僅是一個或兩個)。 αα低表示每個文檔可能只包含幾個主題
- β is the parameter of the Dirichlet prior to the per-topic word distribution. High β indicates that each topic will contain a mixture of most in the words. low β indicates the topic have a low mixture of words. β是按主題分布之前的Dirichlet的參數。 高β表示每個主題將包含大部分單詞。 低β表示主題的單詞混合度較低。
θm is the topic distribution for document m
θm是文檔m的主題分布
zmn is the topic for the n-th word in document m
zmn是文檔m中第n個單詞的主題
- wmn is the specific word wmn是特定詞
LDA is a generative process. LDA assumes that new documents are created in the following way:1. Determine the number of words in document2. Choose a topic mixture for the document over a fixed set of topics (example: 20% topic A, 50$ topic B, 30% topic C)3. Generate the words in the document by:- pick a topic based on the document’s multinomial distribution (zm,n~Multinomial(θm))- pick a word based on topic’s multinomial distribution (wm,n~Multinomial(φzmn)) (where φzmn is the word distribution for topic z)4. Repeat the process for n number of iteration until the distribution of the words in the topics meet the criteria (number 2)
LDA是一個生成過程。 LDA假定以下列方式創建新文檔:1。 確定document2中的單詞數。 在固定的主題集上選擇文檔的主題組合(例如:20%主題A,50 $主題B,30%主題C)3。 通過以下方式生成文檔中的單詞:-根據文檔的多項式分布(zm,n?Multinomial(θm))選擇一個主題-根據主題的多項式分布(wm,n?Multinomial(φzmn))選擇一個單詞(其中φzmn是主題z )4的單詞分布。 重復此過程n次迭代,直到主題中單詞的分布符合條件(第2個)
數據導入和目標 (Data Import & Objectives)
The data is from this kaggle. It's about customers' feedback on Amazon musical instruments. Every row represents one feedback from one user. There are several columns but we only need reviewText which contain the text of the review, overall the product rating from 1-5 given by the user, and reviewTime which contain the time review was given.
數據來自該kaggle 。 這是關于客戶對亞馬遜樂器的反饋。 每行代表一個用戶的一個反饋。 有幾列,但是我們只需要reviewText包含評論的文本, overall上用戶給出的產品評分為1-5,而reviewTime包含給出評論的時間。
# data import and preparationdata <- read.csv("Musical_instruments_reviews.csv")
data <- data %>%
mutate(overall = as.factor(overall),
reviewTime = str_replace_all(reviewTime, pattern = " ",replacement = "-"),
reviewTime = str_replace(reviewTime, pattern = ",",replacement = ""),
reviewTime = mdy(reviewTime)) %>%
select(reviewText, overall,reviewTime)
head(data)
So the objectives of this project is to discover what users are talking about for each rating. This will help the organization to understand better about their customer feedback So that they can concentrate on those issues customers are facing.
因此,該項目的目標是發現用戶對每個評級都在談論什么 。 這將幫助組織更好地了解他們的客戶反饋,以便他們可以專注于客戶面臨的那些問題。
整潔的文字 (Tidytext)
文字清理過程 (Text cleaning process)
Before we put the text to LDA model, we need to clean the text. We gonna build textcleaner function using several functions from tm, textclean, and stringr package. We also need to convert the text to Document Term Matrix (DTM) format because LDA() function from tidytext package needs dtm format.
在將文本放入LDA模型之前,我們需要清理文本。 我們將使用tm , textclean和stringr包中的幾個函數來構建textcleaner函數。 我們還需要將文本轉換為文檔術語矩陣(DTM)格式,因為tidytext包中的LDA()函數需要dtm格式。
# build textcleaner functiontextcleaner <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en"))
# convert corpus to document term matrix
return(DocumentTermMatrix(xdtm))
}
Because we want to know the topic from each rating, we should split/subset the data by its rating.
因為我們想從每個評分中了解主題,所以我們應該按評分對數據進行拆分/細分。
data_1 <- data %>% filter(overall == 1)data_2 <- data %>% filter(overall == 2)
data_3 <- data %>% filter(overall == 3)
data_4 <- data %>% filter(overall == 4)
data_5 <- data %>% filter(overall == 5)
table(data$overall)>
##
## 1 2 3 4 5
## 14 21 77 245 735
From the table above we know that most of the feedback has the highest rating. Because the distributions are different, each rating will have different treatments especially in choosing highest terms frequency. I’ll make sure we will use at least 700–1000 words to be analyzed for each rating.
從上表可以知道,大多數反饋的評分最高。 因為分布不同,所以每個等級都會有不同的處理方式,尤其是在選擇最高條款頻率時。 我將確保每個等級至少要使用700-1000個單詞進行分析。
主題建模等級5 (Topic Modeling rating 5)
# apply textcleaner function for review textdtm_5 <- textcleaner(data_5$reviewText)
# find most frequent terms. i choose words that at least appear in 50 reviews
freqterm_5 <- findFreqTerms(dtm_5,50)
# we have 981 words. subset the dtm to only choose those selected words
dtm_5 <- dtm_5[,freqterm_5]
# only choose words that appear once in each rows
rownum_5 <- apply(dtm_5,1,sum)
dtm_5 <- dtm_5[rownum_5>0,]# apply to LDA function. set the k = 6, means we want to build 6 topic
lda_5 <- LDA(dtm_5,k = 6,control = list(seed = 1502))
# apply auto tidy using tidy and use beta as per-topic-per-word probabilities
topic_5 <- tidy(lda_5,matrix = "beta")# choose 15 words with highest beta from each topic
top_terms_5 <- topic_5 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)# plot the topic and words for easy interpretation
plot_topic_5 <- top_terms_5 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_5Rating 5 topic modeling using tidytext使用tidytext對5個主題建模進行評級
主題建模等級4 (Topic Modeling rating 4)
dtm_4 <- textcleaner(data_4$reviewText)freqterm_4 <- findFreqTerms(dtm_4,20)
dtm_4 <- dtm_4[,freqterm_4]
rownum_4 <- apply(dtm_4,1,sum)
dtm_4 <- dtm_4[rownum_4>0,]lda_4 <- LDA(dtm_4,k = 6,control = list(seed = 1502))
topic_4 <- tidy(lda_4,matrix = "beta")top_terms_4 <- topic_4 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_4 <- top_terms_4 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_4Rating 4 topic modeling using tidytext使用tidytext對4個主題建模進行評級
主題建模等級3 (Topic Modeling rating 3)
dtm_3 <- textcleaner(data_3$reviewText)freqterm_3 <- findFreqTerms(dtm_3,10)
dtm_3 <- dtm_3[,freqterm_3]
rownum_3 <- apply(dtm_3,1,sum)
dtm_3 <- dtm_3[rownum_3>0,]lda_3 <- LDA(dtm_3,k = 6,control = list(seed = 1502))
topic_3 <- tidy(lda_3,matrix = "beta")top_terms_3 <- topic_3 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_3 <- top_terms_3 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_3Rating 3 topic modeling using tidytext使用tidytext對3個主題建模進行評級
主題建模等級2 (Topic Modeling rating 2)
dtm_2 <- textcleaner(data_2$reviewText)freqterm_2 <- findFreqTerms(dtm_2,5)
dtm_2 <- dtm_2[,freqterm_2]
rownum_2 <- apply(dtm_2,1,sum)
dtm_2 <- dtm_2[rownum_2>0,]lda_2 <- LDA(dtm_2,k = 6,control = list(seed = 1502))
topic_2 <- tidy(lda_2,matrix = "beta")top_terms_2 <- topic_2 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_2 <- top_terms_2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_2Rating 2 topic modeling using tidytext使用tidytext對2個主題建模進行評級
主題建模等級1 (Topic Modeling rating 1)
dtm_1 <- textcleaner(data_1$reviewText)freqterm_1 <- findFreqTerms(dtm_1,5)
dtm_1 <- dtm_1[,freqterm_1]
rownum_1 <- apply(dtm_1,1,sum)
dtm_1 <- dtm_1[rownum_1>0,]lda_1 <- LDA(dtm_1,k = 6,control = list(seed = 1502))
topic_1 <- tidy(lda_1,matrix = "beta")top_terms_1 <- topic_1 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_1 <- top_terms_1 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_1Rating 1 topic modeling using tidytext使用tidytext對1個主題建模進行評級
文本 (textmineR)
文字清理過程 (Text cleaning process)
Just like previous text cleaning method, we will build a text cleaner function to automate the cleaning process. The difference is we don’t need to convert the text to dtm format. textmineR package has its own dtm converter, CreateDtm(). Fitting LDA model with textmineR need dtm format made by CreateDtm() function. We also can set n-gram size, remove punctuation, stopwords, and any simple text cleaning process.
就像以前的文本清理方法一樣,我們將構建文本清理器功能來自動執行清理過程。 區別在于我們不需要將文本轉換為dtm格式。 textmineR軟件包具有自己的dtm轉換器CreateDtm() 。 用textmineR擬合LDA模型需要CreateDtm()函數制作的dtm格式。 我們還可以設置n-gram大小,刪除標點符號,停用詞以及任何簡單的文本清理過程。
textcleaner_2 <- function(x){x <- as.character(x)
x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
return(as.data.frame(x))
主題建模等級5 (Topic Modeling rating 5)
# apply textcleaner function. note: we only clean the text without convert it to dtmclean_5 <- textcleaner_2(data_5$reviewText)
clean_5 <- clean_5 %>% mutate(id = rownames(clean_5))# crete dtm
set.seed(1502)
dtm_r_5 <- CreateDtm(doc_vec = clean_5$x,
doc_names = clean_5$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)dtm_r_5 <- dtm_r_5[,colSums(dtm_r_5)>2]
create LDA model using `textmineR`. Here we gonna make 20 topics. the reason why we build so many topics is that `textmineR` has metrics to calculate the quality of topics. we will choose some topics with the best quality
使用`textmineR`創建LDA模型。 在這里,我們將提出20個主題。 我們建立這么多主題的原因是`textmineR`具有衡量主題質量的指標。 我們將選擇質量最好的一些主題
set.seed(1502)mod_lda_5 <- FitLdaModel(dtm = dtm_r_5,
k = 20, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
Once we have created a model, we need to evaluate it. For overall goodness of fit, textmineR has R-squared and log-likelihood. R-squared is interpretable as the proportion of variability in the data explained by the model, as with linear regression.
創建模型后,我們需要對其進行評估。 為了整體上的貼合度,textmineR具有R平方和對數似然性。 與線性回歸一樣,R平方可解釋為模型解釋的數據中的可變性比例。
mod_lda_5$r2>
## [1] 0.2183867
The primary goodness of fit measures in topic modeling is likelihood methods. Likelihoods, generally the log-likelihood, are naturally obtained from probabilistic topic models. the log_likelihood is P(tokens|topics) at each iteration.
主題建模中擬合度量的主要優點是似然法。 可能性,通常是對數可能性,自然是從概率主題模型中獲得的。 log_likelihood在每次迭代中為P(tokens | topics)。
plot(mod_lda_5$log_likelihood,type = "l")log likelhood for every iteration in rating 5等級5中每次迭代的記錄似然度get 15 top terms with the highest phi. phi representing a distribution of words over topics. Words with high phi have the most frequency in a topic.
獲得phi最高的15個熱門術語。 代表主題上單詞分布的phi。 phi較高的單詞在主題中的出現頻率最高。
mod_lda_5$top_terms <- GetTopTerms(phi = mod_lda_5$phi,M = 15)data.frame(mod_lda_5$top_terms)top terms in topic rating 5主題評分最高的詞5
Let’s see the coherence value for each topic. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. For each pair of words {a,b}, then probabilistic coherence calculates P(b|a)?P(b) where {a} is more probable than {b} in the topic. In simple words, coherence tell us how associated words are in a topic
讓我們看看每個主題的連貫性值。 主題連貫性度量通過測量主題中高分單詞之間的語義相似度來對單個主題評分。 這些度量有助于區分在語義上可以解釋的主題和作為統計推斷的工件的主題。 對于每對單詞{a,b}, probabilistic coherence計算出P(b | a)-P(b),其中在主題中{a}比{b}更有可能。 簡單來說, 連貫性告訴我們主題中的相關詞如何
mod_lda_5$coherence>
## t_1 t_2 t_3 t_4 t_5 t_6 t_7
## 0.12140404 0.08349523 0.05510456 0.11607445 0.16397834 0.05472121 0.09739406
## t_8 t_9 t_10 t_11 t_12 t_13 t_14
## 0.14221823 0.24856426 0.79310008 0.28175270 0.10231907 0.58667185 0.05449207
## t_15 t_16 t_17 t_18 t_19 t_20
## 0.09204392 0.10147505 0.07949897 0.04519463 0.13664781 0.21586105
We also want to look at prevalence value. Prevalence tells us the most frequent topics in the corpus. Prevalence is the probability of topics distribution in the whole documents.
我們還想看看患病率值。 患病率告訴我們語料庫中最常見的話題。 患病率是主題在整個文檔中分布的概率 。
mod_lda_5$prevalence <- colSums(mod_lda_5$theta)/sum(mod_lda_5$theta)*100mod_lda_5$prevalence
>
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 5.514614 5.296280 4.868778 7.484032 9.360072 2.748069 4.269445 4.195638
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 5.380414 3.541380 5.807442 5.305865 3.243890 4.657203 5.488087 2.738993
## t_17 t_18 t_19 t_20
## 4.821128 4.035630 7.385820 3.857221
Now we have the top terms at each topic, the goodness of model by r2 and log_likelihood, also the quality of topics by calculating coherence and prevalence. let’s compile them in summary
現在,我們在每個主題上都擁有最高級的詞匯,r2和log_likelihood的模型優度,以及通過計算一致性和普遍性得出的主題質量。 讓我們總結一下
mod_lda_5$summary <- data.frame(topic = rownames(mod_lda_5$phi),coherence = round(mod_lda_5$coherence,3),
prevalence = round(mod_lda_5$prevalence,3),
top_terms = apply(mod_lda_5$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_5 <- mod_lda_5$summary %>%
`rownames<-`(NULL)
We know that the quality of the model can be described with coherence and prevalence value. let’s build a plot to identify which topic has the best quality
我們知道,模型的質量可以用相關性和流行度值來描述。 讓我們來建立一個情節,以確定哪個主題的質量最高
modsum_5 %>% pivot_longer(cols = c(coherence,prevalence)) %>%ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 5 rating",
x = "Topics", y = "Value")coherence and prevalence score in rating 5等級5的連貫性和患病率得分
From the graph above we know that topic 10 has the highest quality, which means the words in that topic are associated with each other. But in the terms of probability of topics distribution in the whole documents (prevalence), topic 10 has a low score. Mean the review is unlikely using combination of words in topic 10 even tough the words inside that topic are supporting each other.
從上圖可以看出, topic 10的質量最高,這意味著該主題中的單詞相互關聯。 但就整個文檔中主題分布的可能性(普遍性)而言, topic 10得分較低。 這意味著使用topic 10的單詞組合很難回顧,即使該主題中的單詞相互支持也很難。
We can see if topics can be grouped together using Dendogram. A Dendrogram uses Hellinger distance (distance between 2 probability vectors) to decide if the topics are closely related. For instance, the Dendrogram below suggests that there are greater similarity between topic 10 and 13.
我們可以查看是否可以使用Dendogram將主題分組在一起。 樹狀圖使用Hellinger距離 (兩個概率向量之間的距離)來確定主題是否緊密相關。 例如,下面的樹狀圖表明主題10和主題13之間存在更大的相似性。
mod_lda_5$linguistic <- CalcHellingerDist(mod_lda_5$phi)mod_lda_5$hclust <- hclust(as.dist(mod_lda_5$linguistic),"ward.D")
mod_lda_5$hclust$labels <- paste(mod_lda_5$hclust$labels, mod_lda_5$labels[,1])
plot(mod_lda_5$hclust)cluster dendrogram rating 5聚類樹狀圖評分5
Now we have complete to build topic model in rating 5 and its interpretation, let’s apply the same step for every rating and see the difference of what people are talk about.
現在,我們已經完成了在等級5及其解釋中建立主題模型的工作,讓我們對每個等級應用相同的步驟,并了解人們在談論什么。
I won’t copy and paste the process for every rating because its just the same process and i think it will waste the space. But if you really want to look at it please visit my publications in my rpubs.
我不會為每個評級復制并粘貼該過程,因為它只是一個相同的過程,我認為這會浪費空間。 但是,如果您真的想查看它,請訪問我在rpubs中的出版物。
結論 (Conclusion)
We’ve done topic model process from cleaning text to interpretation and analysis. Finally, let’s see what people are talking about for each rating. We will choose 5 different topics with the highest quality (coherence). Each topic will have 15 words with the highest value of phi (distribution of words over topics).
我們已經完成了從清潔文本到解釋和分析的主題模型過程。 最后,讓我們看看人們在談論每個評級。 我們將選擇質量最高(一致性)最高的5個不同主題。 每個主題將包含15個具有phi最高值的單詞(單詞在主題上的分布)。
等級5 (Rating 5)
modsum_5 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 5)主題中的最高術語(按最高連貫性排序)(等級5)
Highest coherence score, topic 10 and topic 13 contains lots of ‘sticking’ and ‘tongue’ words. Maybe its just a phrase for a specific instrument. It has similar words that make their coherence score rising but low prevalence means they are rarely used in other reviews, that’s why i suggest its from ‘specific’ instrument. in topic 11 and other people are talking about how good the product is, for example, there are words like ‘good’, ‘accurate’, ‘clean’, ‘easy’, ‘recommend’, and ‘great’ that indicates positive sentiment.
最高連貫分數, topic 10和topic 13包含許多“黏”字和“舌”字。 也許只是特定工具的一句話。 它具有相似的詞,使他們的連貫性得分上升,但流行率低意味著它們很少在其他評論中使用,這就是為什么我從“特定”工具中建議使用它。 在topic 11 ,其他人在談論產品的好壞,例如,有“好”,“準確”,“干凈”,“簡單”,“推薦”和“偉大”之類的詞表示積極的情緒。
等級4 (Rating 4)
modsum_4 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 4)主題中的最高術語,按最高連貫性排序(4級)
Same as before, topic with the highest coherence score is filled with sticking and tongue stuff. In this rating, people are still praising the product but not as much as rating 5. Keep in mind, the dtm is built using bigrams, words with 2 words like solid_state or e_tongue are captured and calculated just like single word does. With that information, we know that all words showed here have their own phi value and actually represent the review.
與以前一樣,具有最高連貫性得分的主題充滿了黏糊糊的內容。 在此評級中,人們仍然對產品贊不絕口,但沒有達到5評級。請記住,dtm是使用雙字母組構建的,捕獲并計算了2個單詞(例如solid_state或e_tongue),就像單個單詞一樣。 有了這些信息,我們知道這里顯示的所有單詞都有自己的phi值,實際上代表了評論。
等級3 (Rating 3)
modsum_3 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 3)主題中的最高術語,按最高連貫性排序(3級)
Looks like stick and tongue words are everywhere. `topic 15` has high coherence and prevalence value in rating 3, means lots of review in this rating are talking about them. On the other hand, in this rating, positive words are barely seen. most of the topics filled with guitar or string related words.
看起來到處都是棍子和舌頭的話。 “主題15”在等級3中具有較高的連貫性和普遍性值,意味著該等級中有很多評論都在談論它們。 另一方面,在此評級中,幾乎看不到正面的話。 大多數主題充滿了與吉他或弦樂相關的單詞。
等級2 (Rating 2)
modsum_2 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 2)主題中的最高術語,按最高連貫性排序(等級2)
等級1 (Rating 1)
modsum_1 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 1)主題中的最高術語,按最高連貫性排序(等級1)
In the worst rating, people are highly complaint. words like ‘junk’, ‘cheap’ , ‘just’, ‘back’ are everywhere. there’s a lot of difference compared with rating 5.
在最差的評分中,人們高度抱怨。 像“垃圾”,“便宜”,“公正”,“后退”之類的詞無處不在。 與等級5相比有很多差異。
Overall let's keep in mind this dataset is a combination of products, so its obvious if the topic filled with nonsense. But for every rating we’re able to build topics with different instruments. Most of them are talking about with particular instrument with its positive or negative review. In this project we managed to build topic model that separated by instrument, it shows LDA is able to build topic with its semantic words. It will be better if we do topic model with a specific product only and discover the problems to remove or goodness to keep. It surely help organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing, especially for those who have lots of reviews to analyze.
總體而言,讓我們記住該數據集是產品的組合,因此,如果主題中充斥著廢話,這是顯而易見的。 但是,對于每個評級,我們都可以使用不同的工具構建主題。 他們中的大多數人都在談論帶有正面或負面評論的特定工具。 在這個項目中,我們設法建立了以儀器分隔的主題模型,它表明LDA能夠使用其語義詞來建立主題。 如果我們僅對特定產品進行主題建模,然后發現要刪除的問題或保留的優點,那將更好。 它肯定有助于組織更好地了解他們的客戶反饋,從而使他們可以專注于客戶面臨的那些問題,尤其是對于那些需要分析大量評論的客戶。
翻譯自: https://medium.com/@joenathanchristian/topic-modeling-in-r-with-tidytext-and-textminer-package-latent-dirichlet-allocation-764f4483be73
犀牛建模軟件的英文語言包
總結
以上是生活随笔為你收集整理的犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 国产手游2022出海总收入达624亿《原
- 下一篇: 使用Keras和TensorFlow构建