當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

复现经典：《统计学习方法》第20章潜在狄利克雷分配

發布時間：2025/3/8 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了复现经典：《统计学习方法》第20章潜在狄利克雷分配小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

0章潛在狄利克雷分配

本文是李航老師的《統計學習方法》一書的代碼復現。作者：黃海廣

備注：代碼都可以在github中下載。我將陸續將代碼發布在公眾號“機器學習初學者”，可以在這個專輯在線閱讀。

1.狄利克雷分布的概率密度函數為

其中狄利克雷分布是多項分布的共軛先驗。

2.潛在狄利克雷分配2.潛在狄利克雷分配（LDA）是文本集合的生成概率模型。模型假設話題由單詞的多項分布表示，文本由話題的多項分布表示，單詞分布和話題分布的先驗分布都是狄利克雷分布。LDA模型屬于概率圖模型可以由板塊表示法表示LDA模型中，每個話題的單詞分布、每個文本的話題分布、文本的每個位置的話題是隱變量，文本的每個位置的單詞是觀測變量。

3.LDA生成文本集合的生成過程如下：

（1）話題的單詞分布：隨機生成所有話題的單詞分布，話題的單詞分布是多項分布，其先驗分布是狄利克雷分布。

（2）文本的話題分布：隨機生成所有文本的話題分布，文本的話題分布是多項分布，其先驗分布是狄利克雷分布。

（3）文本的內容：隨機生成所有文本的內容。在每個文本的每個位置，按照文本的話題分布隨機生成一個話題，再按照該話題的單詞分布隨機生成一個單詞。

4.LDA模型的學習與推理不能直接求解。通常采用的方法是吉布斯抽樣算法和變分EM算法，前者是蒙特卡羅法而后者是近似算法。

5.LDA的收縮的吉布斯抽樣算法的基本想法如下。目標是對聯合概率分布進行估計。通過積分求和將隱變量和消掉，得到邊緣概率分布；對概率分布進行吉布斯抽樣，得到分布的隨機樣本；再利用樣本對變量，和的概率進行估計，最終得到LDA模型的參數估計。具體算法如下對給定的文本單詞序列，每個位置上隨機指派一個話題，整體構成話題系列。然后循環執行以下操作。對整個文本序列進行掃描，在每一個位置上計算在該位置上的話題的滿條件概率分布，然后進行隨機抽樣，得到該位置的新的話題，指派給這個位置。

6.變分推理的基本想法如下。假設模型是聯合概率分布，其中是觀測變量（數據），是隱變量。目標是學習模型的后驗概率分布。考慮用變分分布近似條件概率分布，用KL散度計算兩者的相似性找到與在KL散度意義下最近的，用這個分布近似。假設中的的所有分量都是互相獨立的。利用Jensen不等式，得到KL散度的最小化可以通過證據下界的最大化實現。因此，變分推理變成求解以下證據下界最大化問題：

7.LDA的變分EM算法如下。針對LDA模型定義變分分布，應用變分EM算法。目標是對證據下界進行最大化，其中和是模型參數，和是變分參數。交替迭代E步和M步，直到收斂。

（1）E步：固定模型參數，，通過關于變分參數，的證據下界的最大化，估計變分參數，。
（2）M步：固定變分參數，，通過關于模型參數，的證據下界的最大化，估計模型參數，。

潛在狄利克雷分配（latent Dirichlet allocation,LDA），作為基于貝葉斯學習的話題模型，是潛在語義分析、概率潛在語義分析的擴展，于2002年由Blei等提出dA在文本數據挖掘、圖像處理、生物信息處理等領域被廣泛使用。

LDA模型是文本集合的生成概率模型假設每個文本由話題的一個多項分布表示，每個話題由單詞的一個多項分布表示，特別假設文本的話題分布的先驗分布是狄利克雷分布，話題的單詞分布的先驗分布也是狄利克雷分布。先驗分布的導入使LDA能夠更好地應對話題模型學習中的過擬合現象。

LDA的文本集合的生成過程如下：首先隨機生成一個文本的話題分布，之后在該文本的每個位置，依據該文本的話題分布隨機生成一個話題，然后在該位置依據該話題的單詞分布隨機生成一個單詞，直至文本的最后一個位置，生成整個文本。重復以上過程生成所有文本。

LDA模型是含有隱變量的概率圖模型。模型中，每個話題的單詞分布，每個文本的話題分布，文本的每個位置的話題是隱變量；文本的每個位置的單詞是觀測變量。LDA模型的學習與推理無法直接求解通常使用吉布斯抽樣（ Gibbs sampling）和變分EM算法（variational EM algorithm），前者是蒙特卡羅法，而后者是近似算法。

from gensim import corpora, models, similarities from pprint import pprint import warnings f = open('data/LDA_test.txt') stop_list = set('for a of the and to in'.split()) # texts = [line.strip().split() for line in f] # print 'Before' # pprint(texts) print('After') After texts = [[word for word in line.strip().lower().split() if word not in stop_list ] for line in f] print('Text = ') pprint(texts) Text = [['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],['eps', 'user', 'interface', 'management', 'system'],['system', 'human', 'system', 'engineering', 'testing', 'eps'],['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],['generation', 'random', 'binary', 'unordered', 'trees'],['interp', 'graph', 'paths', 'trees'],['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],['graph', 'minors', 'survey']] dictionary = corpora.Dictionary(texts) print(dictionary) Dictionary(35 unique tokens: ['abc', 'applications', 'computer', 'human', 'interface']...) V = len(dictionary) corpus = [dictionary.doc2bow(text) for text in texts] corpus_tfidf = models.TfidfModel(corpus)[corpus] corpus_tfidf = corpusprint('TF-IDF:') for c in corpus_tfidf:print(c) TF-IDF: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)] [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)] [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)] [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)] [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)] [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)] [(24, 1), (26, 1), (27, 1), (28, 1)] [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)] [(9, 1), (26, 1), (30, 1)] print('\nLSI Model:') lsi = models.LsiModel(corpus_tfidf, num_topics=2, id2word=dictionary) topic_result = [a for a in lsi[corpus_tfidf]] pprint(topic_result) LSI Model: [[(0, 0.9334981916792652), (1, 0.10508952614086528)],[(0, 2.031992374687025), (1, -0.047145314121742235)],[(0, 1.5351342836582078), (1, 0.13488784052204628)],[(0, 1.9540077194594532), (1, 0.21780498576075008)],[(0, 1.2902472956004092), (1, -0.0022521437499372337)],[(0, 0.022783081905505403), (1, -0.7778052604326754)],[(0, 0.05671567576920905), (1, -1.1827703446704851)],[(0, 0.12360003320647955), (1, -2.6343068608236835)],[(0, 0.23560627195889133), (1, -0.9407936203668315)]] print('LSI Topics:') pprint(lsi.print_topics(num_topics=2, num_words=5)) LSI Topics: [(0,'0.579*"system" + 0.376*"user" + 0.270*"eps" + 0.257*"time" + ''0.257*"response"'),(1,'-0.480*"graph" + -0.464*"trees" + -0.361*"minors" + -0.266*"widths" + ''-0.266*"ordering"')] similarity = similarities.MatrixSimilarity(lsi[corpus_tfidf]) # similarities.Similarity() print('Similarity:') pprint(list(similarity)) Similarity: [array([ 1. , 0.9908607 , 0.9997008 , 0.9999994 , 0.9935261 ,-0.08272626, -0.06414512, -0.06517283, 0.13288835], dtype=float32),array([0.9908607 , 0.99999994, 0.9938636 , 0.99100804, 0.99976987,0.0524564 , 0.07105229, 0.070025 , 0.2653665 ], dtype=float32),array([ 0.9997008 , 0.9938636 , 0.99999994, 0.999727 , 0.99600756,-0.05832579, -0.03971674, -0.04074576, 0.15709123], dtype=float32),array([ 0.9999994 , 0.99100804, 0.999727 , 1. , 0.9936501 ,-0.08163348, -0.06305084, -0.06407862, 0.13397504], dtype=float32),array([0.9935261 , 0.99976987, 0.99600756, 0.9936501 , 0.99999994,0.03102366, 0.04963995, 0.04861134, 0.24462426], dtype=float32),array([-0.08272626, 0.0524564 , -0.05832579, -0.08163348, 0.03102366,0.99999994, 0.99982643, 0.9998451 , 0.97674036], dtype=float32),array([-0.06414512, 0.07105229, -0.03971674, -0.06305084, 0.04963995,0.99982643, 1. , 0.9999995 , 0.9805657 ], dtype=float32),array([-0.06517283, 0.070025 , -0.04074576, -0.06407862, 0.04861134,0.9998451 , 0.9999995 , 1. , 0.9803632 ], dtype=float32),array([0.13288835, 0.2653665 , 0.15709123, 0.13397504, 0.24462426,0.97674036, 0.9805657 , 0.9803632 , 1. ], dtype=float32)] print('\nLDA Model:') num_topics = 2 lda = models.LdaModel(corpus_tfidf,num_topics=num_topics,id2word=dictionary,alpha='auto',eta='auto',minimum_probability=0.001,passes=10) doc_topic = [doc_t for doc_t in lda[corpus_tfidf]] print('Document-Topic:\n') pprint(doc_topic) LDA Model: Document-Topic:[[(0, 0.02668742), (1, 0.97331256)],[(0, 0.9784582), (1, 0.021541778)],[(0, 0.9704323), (1, 0.02956772)],[(0, 0.97509205), (1, 0.024907947)],[(0, 0.9785106), (1, 0.021489413)],[(0, 0.9703556), (1, 0.029644381)],[(0, 0.04481229), (1, 0.9551877)],[(0, 0.023327617), (1, 0.97667235)],[(0, 0.058409944), (1, 0.9415901)]] for doc_topic in lda.get_document_topics(corpus_tfidf):print(doc_topic) [(0, 0.026687337), (1, 0.9733126)] [(0, 0.9784589), (1, 0.021541081)] [(0, 0.97043234), (1, 0.029567692)] [(0, 0.9750935), (1, 0.024906479)] [(0, 0.9785101), (1, 0.021489937)] [(0, 0.9703557), (1, 0.029644353)] [(0, 0.044812497), (1, 0.9551875)] [(0, 0.02332762), (1, 0.97667235)] [(0, 0.058404233), (1, 0.9415958)] for topic_id in range(num_topics):print('Topic', topic_id)# pprint(lda.get_topic_terms(topicid=topic_id))pprint(lda.show_topic(topic_id)) similarity = similarities.MatrixSimilarity(lda[corpus_tfidf]) print('Similarity:') pprint(list(similarity))hda = models.HdpModel(corpus_tfidf, id2word=dictionary) topic_result = [a for a in hda[corpus_tfidf]] print('\n\nUSE WITH CARE--\nHDA Model:') pprint(topic_result) print('HDA Topics:') print(hda.print_topics(num_topics=2, num_words=5)) Topic 0 [('system', 0.094599016),('user', 0.073440075),('eps', 0.052545987),('response', 0.052496374),('time', 0.052453455),('survey', 0.031701956),('trees', 0.03162545),('human', 0.03161709),('computer', 0.031570844),('testing', 0.031543963)] Topic 1 [('graph', 0.0883405),('trees', 0.06323685),('minors', 0.06296622),('interface', 0.03810195),('computer', 0.03798469),('human', 0.03792907),('applications', 0.03792245),('abc', 0.037920628),('machine', 0.037917122),('lab', 0.037909806)] Similarity: [array([1. , 0.04940351, 0.05783966, 0.05292428, 0.04934979,0.05791992, 0.99981046, 0.99999374, 0.99940336], dtype=float32),array([0.04940351, 1. , 0.99996436, 0.9999938 , 1. ,0.99996364, 0.06883725, 0.04587576, 0.08387101], dtype=float32),array([0.05783966, 0.99996436, 1.0000001 , 0.99998796, 0.99996394,1. , 0.07726298, 0.05431345, 0.09228647], dtype=float32),array([0.05292428, 0.9999938 , 0.99998796, 1. , 0.9999936 ,0.9999875 , 0.07235384, 0.04939714, 0.08738345], dtype=float32),array([0.04934979, 1. , 0.99996394, 0.9999936 , 1. ,0.99996316, 0.06878359, 0.04582203, 0.08381741], dtype=float32),array([0.05791992, 0.99996364, 1. , 0.9999875 , 0.99996316,0.99999994, 0.07734313, 0.05439373, 0.09236652], dtype=float32),array([0.99981046, 0.06883725, 0.07726298, 0.07235384, 0.06878359,0.07734313, 0.99999994, 0.9997355 , 0.9998863 ], dtype=float32),array([0.99999374, 0.04587576, 0.05431345, 0.04939714, 0.04582203,0.05439373, 0.9997355 , 0.99999994, 0.9992751 ], dtype=float32),array([0.99940336, 0.08387101, 0.09228647, 0.08738345, 0.08381741,0.09236652, 0.9998863 , 0.9992751 , 1. ], dtype=float32)]

USE WITH CARE-- HDA Model: [[(0, 0.18174982193320122), (1, 0.02455260642448283), (2, 0.741340573910992), (3, 0.013544078061059922), (4, 0.010094377639823477)], [(0, 0.39419292675663636), (1, 0.2921969355337328), (2, 0.26125786014858376), (3, 0.013539627392486701), (4, 0.01009410883245766)], [(0, 0.5182077872999125), (1, 0.3880947736463974), (2, 0.023895609845034207), (3, 0.01805202212531745), (4, 0.013458421673222807)], [(0, 0.03621384798236036), (1, 0.5504573172680752), (2, 0.020442846194997377), (3, 0.348529241707211), (4, 0.011535562414627153)], [(0, 0.9049762450848856), (1, 0.024748801100993395), (2, 0.017919024335434904), (3, 0.013543460312481508), (4, 0.010093932388992328)], [(0, 0.04681359723231631), (1, 0.03233799461088905), (2, 0.8510430252219996), (3, 0.01805587061936895), (4, 0.013458128836093802)], [(0, 0.42478083784052273), (1, 0.03858547281122597), (2, 0.4528531768644199), (3, 0.021680841796584305), (4, 0.016150009359845837), (5, 0.011953757612369628)], [(0, 0.2466808290730598), (1, 0.6908552821243853), (2, 0.015924569811569197), (3, 0.012039668311419834)], [(0, 0.500366457263008), (1, 0.048221177670061226), (2, 0.34671234963274666), (3, 0.02707530995137571), (4, 0.02018763747377598), (5, 0.014942188361070167), (6, 0.010992923111633942)]] HDA Topics: [(0, '0.122graph + 0.115minors + 0.098management + 0.075random + 0.063error'), (1, '0.114human + 0.106system + 0.086user + 0.064iv + 0.063measurement')]

代碼參考：鄒博

下載地址

https://github.com/fengdu78/lihang-code

參考資料：

[1] 《統計學習方法》: https://baike.baidu.com/item/統計學習方法/10430179

[2] 黃海廣: https://github.com/fengdu78

[3] ?github: https://github.com/fengdu78/lihang-code

[4] ?wzyonggege: https://github.com/wzyonggege/statistical-learning-method

[5] ?WenDesi: https://github.com/WenDesi/lihang_book_algorithm

[6] ?火燙火燙的: https://blog.csdn.net/tudaodiaozhale

[7] ?hktxt: https://github.com/hktxt/Learn-Statistical-Learning-Method

總結

以上是生活随笔為你收集整理的复现经典：《统计学习方法》第20章潜在狄利克雷分配的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：复现经典：《统计学习方法》第19章马尔
下一篇：复现经典：《统计学习方法》第18章概率