机器学习项目搭建试验 where2go
https://github.com/da248/where2go
這個(gè)項(xiàng)目感覺(jué)還是挺好的,雖然沒(méi)給各個(gè)數(shù)據(jù)集的下載鏈接,也有一些莫名其妙的bug,但是錯(cuò)誤調(diào)試提示都還挺全,能一直有進(jìn)展。
(看了下這個(gè)好像不太頂用-純html調(diào)用APIhttps://github.com/alex-engelmann/Where2Go)
?
目錄
1) Gathering Data
wikivoyage_xml_to_json.py
New York Times
2) EDA
Wiki voyage
更改默認(rèn)保存位置
Weather
Nyt
3) Model
Webapp
Final Remarks
啟動(dòng)項(xiàng)目
Anaconda常用命令
常用算法
自然語(yǔ)言處理natural language processing:
推薦系統(tǒng)中常見(jiàn)的文本處理方法:
Word2vec原理
網(wǎng)站細(xì)節(jié)
Html
Flask后臺(tái)接口
Model
模型分析
Eda
Model
基于 H-softmax 模型的梯度計(jì)算
?
here2go 是專為推薦您的地方,根據(jù)你喜歡或不喜歡的地方/字符,而不是基于有廉價(jià)航班的目的地。
有很多網(wǎng)站告訴你最便宜的方式去目的地和最便宜的酒店住宿。但他們忘了問(wèn)你一個(gè)非常根本的問(wèn)題...你知道去哪里嗎?文章,如"XX的25大旅游目的地"或"YY的100個(gè)地方,你必須訪問(wèn)!
此應(yīng)用程序的動(dòng)機(jī)之一是建立一個(gè)公正的推薦系統(tǒng),該系統(tǒng)將考慮目的地的特征,而不是查看其他人喜歡的目的地。為此,我決定使用旅行指南來(lái)收集目的地信息。我發(fā)現(xiàn),Wikivoyage提供了偉大的旅游指南,告訴你關(guān)于這個(gè)地方的歷史和文化,以及什么看,如何四處走動(dòng),等等。
Try it out on www.where2go.help
下一個(gè)問(wèn)題是找出使用哪種模型。傳統(tǒng)的自然語(yǔ)言處理推薦系統(tǒng)包括 TF-IDF + cos-similarity和 TF-IDF + SVD + k - means聚類等模型。這些模型可能做偉大的工作,找到類似的目的地,但我想使用模型,讓我添加地方字符,如'海灘'或'酒'在我的搜索。因此,我決定去與谷歌最近創(chuàng)建的模型稱為word2vec。Word2vec 是一個(gè)驚人的模型,它將單詞轉(zhuǎn)換為捕捉單詞"含義"的矢量。此模型的酷功能是,您可以添加/減去單詞,因?yàn)樗鼈兪鞘噶俊@?#xff0c;你可以做類似操作'king' - 'man' + 'woman' 產(chǎn)生 a vector that ~= 'queen'。我的 Word2vec 模型了解了 wikivoyage 文章中介紹的單詞和地點(diǎn)的旅游特定上下文,允許矢量操作推薦類似位置。
?
使用 word2vec,我能夠獲得與搜索查詢具有最接近語(yǔ)義含義的單詞/目的地的建議。但是,我必須找出一種方法,確定哪些建議實(shí)際上是地理位置,哪些只是接近的話。我能夠使用Wikivoyage的地理定位數(shù)據(jù)來(lái)檢查這一點(diǎn)。
?
一旦我訓(xùn)練了旅行環(huán)境模型,我就構(gòu)建了一個(gè) Web 應(yīng)用程序來(lái)交付我的數(shù)據(jù)科學(xué)項(xiàng)目。我使用 javascript 執(zhí)行 AJAX 調(diào)用,將用戶查詢的結(jié)果更新到 MapBox map and Bootstrap to format the pages。
?
*我還收集了《紐約時(shí)報(bào)》的《旅行、世界和科學(xué)》(其中有很多環(huán)保文章)新聞來(lái)豐富我的數(shù)據(jù)源,但決定將其排除在外,因?yàn)榻Y(jié)果過(guò)于"新聞化"。
Methodology
The code folder is divided into three sections 1) data collection, 2) EDA, 3) model.
1) Gathering Data
####Wikivoyage There are three files for wikivoyage data.三個(gè)wikivoyage數(shù)據(jù)的文件
wikivoyage_xml_to_json.py
The purpose of this file is to convert Wikivoyage travel guide articles to JSON format. Wikivoyage provided a data dump of its articles in XML format and I converted it to JSON format to go through exploratory data analysis with pandas.
wikivoyage_xml_to_json.py
此文件的目的是將 Wikivoyage 旅行指南文章轉(zhuǎn)換為 JSON 格式。Wikivoyage 以 XML 格式提供了文章的數(shù)據(jù)轉(zhuǎn)儲(chǔ),我將其轉(zhuǎn)換為 JSON 格式,以便用pandas進(jìn)行探索性數(shù)據(jù)分析。
運(yùn)行:
| ImportError: No module named xmltodict |
圖形化界面安裝
| ImportError: No module named pandas |
圖形化界面安裝
| Traceback (most recent call last): ? File "wikivoyage_xml_to_json.py", line 25, in <module> ??? jdata = convert_xml_to_json('data/wikivoyage/enwikivoyage-latest-pages-articles.xml') ? File "wikivoyage_xml_to_json.py", line 12, in convert_xml_to_json ??? xml_str = open(filename).read() IOError: [Errno 2] No such file or directory: 'data/wikivoyage/enwikivoyage-latest-pages-articles.xml' |
在https://dumps.wikimedia.org/enwikivoyage/latest/找數(shù)據(jù)集
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python wikivoyage_xml_to_json.py |
成功運(yùn)行該文件后在where2go\code\data_collection\data\wikivoyage獲得wikivoyage.json一份,耶
2.wikivoyage_geotags_sql.py
The purpose of this file is to gather the geolocations of articles (places). Wikivoyage provided the geolocations of articles as a sql file. I created my own MySQL database to load in and query the data. I also did a bit of data cleaning in this file to remove the accents.
維基航行_地理標(biāo)記_sql.py
此文件的目的是收集文章(地點(diǎn))的地理位置。Wikivoyage 提供了文章的地理位置作為 sql 文件。我創(chuàng)建自己的 MySQL 數(shù)據(jù)庫(kù)來(lái)加載和查詢數(shù)據(jù)。我還在這個(gè)文件做了一些數(shù)據(jù)清理,刪除口音。
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python wikivoyage_geotags_sql.py ? File "wikivoyage_geotags_sql.py", line 72 ??? geotag_dict = create_geotag_dict(): ????????????????????????????????????? ^ SyntaxError: invalid syntax |
嘗試刪除這個(gè)冒號(hào)
| No module named pymysql.cursors |
pip install pymysql
| Traceback (most recent call last): ? File "wikivoyage_geotags_sql.py", line 9, in <module> ??? cursorclass=pymysql.cursors.DictCursor) ? ?pymysql.err.OperationalError: (1045, u"Access denied for user 'admin'@'localhost' (using password: NO)") |
查看源碼:
| # Connect to the database connection = pymysql.connect(user='admin', ???????????????????????????? db='wiki', cursorclass=pymysql.cursors.DictCursor) |
查看連接方法:https://www.cnblogs.com/woider/p/5926744.html
| pymysql.Connect()參數(shù)說(shuō)明 host(str):????? MySQL服務(wù)器地址 port(int):????? MySQL服務(wù)器端口號(hào) user(str):????? 用戶名 passwd(str):??? 密碼 db(str):??????? 數(shù)據(jù)庫(kù)名稱 charset(str):?? 連接編碼 ? connection對(duì)象支持的方法 cursor()??????? 使用該連接創(chuàng)建并返回游標(biāo) commit()??????? 提交當(dāng)前事務(wù) rollback()????? 回滾當(dāng)前事務(wù) close()???????? 關(guān)閉連接 ? cursor對(duì)象支持的方法 execute(op)???? 執(zhí)行一個(gè)數(shù)據(jù)庫(kù)的查詢命令 fetchone()????? 取得結(jié)果集的下一行 fetchmany(size) 獲取結(jié)果集的下幾行 fetchall()????? 獲取結(jié)果集中的所有行 rowcount()????? 返回?cái)?shù)據(jù)條數(shù)或影響行數(shù) close()???????? 關(guān)閉游標(biāo)對(duì)象 |
修改連接時(shí)用戶名密碼,創(chuàng)建數(shù)據(jù)庫(kù)
| pymysql.err.ProgrammingError: (1146, u"Table 'wiki.geo_tags' doesn't exist") |
看項(xiàng)目介紹中Wikivoyage 提供了文章的地理位置作為 sql 文件,繼續(xù)找數(shù)據(jù)集https://github.com/baturin/wikivoyage-listings
還是在這里找到(霧):https://dumps.wikimedia.org/hewikivoyage/latest/
| pymysql.err.ProgrammingError: (1146, u"Table 'wiki.page' doesn't exist") |
還在剛剛的頁(yè)面找到pages.sql
下載的一個(gè)sql貌似不是英文,(?????_?????????)
看到這個(gè)貌似是官網(wǎng)https://www.wikidata.org/wiki/Wikidata:Wikivoyage/Resources
全語(yǔ)言長(zhǎng)這樣:https://www.wikivoyage.org/
英文版的地址長(zhǎng)這樣:https://en.wikivoyage.org/
同理類推:https://dumps.wikimedia.org/enwikivoyage/latest/
成功找到英文版sql
| IOError: [Errno 2] No such file or directory: '../data/geotag_dict.pkl' |
查看源碼為輸出文件,新建
成功運(yùn)行
3.scrap_wikivoyage_banners.py
This file contains code that I used to scrap the banner images of articles from wikivoyage. I also used this to collect the canonical url of the wikivoyage page. I had to search destinations using a special search page on Wikivoyage to overcome minor syntax differences in place names.
此文件包含用于從 wikivoyage 中抓取文章的橫幅圖像的代碼。我也用這個(gè)來(lái)收集wikivoyage page的標(biāo)準(zhǔn)URL。我不得不在Wikivoyage上使用一個(gè)特殊的搜索頁(yè)面搜索目的地,以克服地名中的微小語(yǔ)法差異。
| ??? self.locations = pkl.load(open('../../data/pickles/geotag_dict.pkl', 'rb')) IOError: [Errno 2] No such file or directory: '../../data/pickles/geotag_dict.pkl' |
復(fù)制剛剛的pkl
| CONNECTION ERROR!!! RECONNECT TO? page Traceback (most recent call last): ? File "scrap_wikivoyage_banners.py", line 109, in <module> ??? swb.scrap_banners() ? File "scrap_wikivoyage_banners.py", line 95, in scrap_banners ??? img_path, wiki_url = self.get_image_and_link(key) ? File "scrap_wikivoyage_banners.py", line 57, in get_image_and_link ??? return make_default_img_url(place) NameError: global name 'make_default_img_url' is not defined |
| INDEX ERROR!!!? page did not exist |
?
查看源碼
| def get_image_and_link(self, place): ?????? ''' ?????? For a given place, get the canonical wikivoyage url and save the banner. ?????? If the banner is just a default banner, save the img path as the default ?????? banner to minimize duplicates. ? ?????? input: place as string ?????? output: img_path and wiki_url + (save image in the process) ?????? ''' ? ?????? base_url = "https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=" ?????? full_url = base_url + place.title() ? ?????? try: ?????????? response = requests.get(full_url).text ?????????? soup = BeautifulSoup(response, 'html.parser') ?????????? wiki_url = soup.find(rel='canonical')['href'] ?????????? img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src'] ? ?????? except IndexError: ?????????? print 'INDEX ERROR!!! %s page did not exist' % place ?????????? return make_default_img_url(place) ? ?????? except ConnectionError: ?????????? print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place ?????????? return make_default_img_url(place) ? ?????? if 'Pagebanner_default' in img_src or 'default_banner' in img_src: ?????????? print '%s has default banner!' % place ?????????? img_path = 'static/banners/default.png' ? ?????? else: ?????????? place = place.replace('/', '_')? # REPLACE SLASH BECAUSE IT CREATES A DIRECTORY ? ?????????? try: ????????????? img_response = requests.get(img_src, stream=True) ????????????? img_path = 'static/banners/%s.png' % place ? ?????????? except IndexError: ????????????? print 'INDEX ERROR!!! %s page did not exist' % place ????????????? return make_default_img_url(place) ? ?????????? except ConnectionError: ????????????? print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place ????????????? return make_default_img_url(place) ? ?????????? # save the img file if it doesn't already exist. if it already exists, dont overwrite. ?????????? if not os.path.exists('../../webapp/static/banners/%s.png' % place): ????????????? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file: ????????????????? shutil.copyfileobj(img_response.raw, out_file) ????????????? del img_response ????????????? print '%s.png successfully created' % place ? ?????????? else: ????????????? print '%s.png already exists!' % place ? ?????? return img_path, wiki_url |
| def scrap_banners(self): ?????? ''' ?????? Go through every key in the locations dictionary and scrape the wiki url and img_path. ?????? ''' ?????? for key in self.locations.iterkeys(): ??? ?????? # print 'key %s,' % key ?????????? img_path, wiki_url = self.get_image_and_link(key) ?????????? self.locations[key]['wiki_url'] = wiki_url ?????????? self.locations[key]['img_path'] = img_path |
| def load_location(self): ?????? ''' ?????? load the geolocation data. ?????? ''' ?????? self.locations = pkl.load(open('../../data/pickles/geotag_dict.pkl', 'rb')) |
| 看來(lái)還是pkl中的location出問(wèn)題了,查看pkl import cPickle as pickle? ??? f = open('path')? ??? info = pickle.load(f)? ??? print info?? #show file? |
| {'': {u'gt_lat': Decimal('56.83330000'), u'page_id': 18192, u'gt_lon': Decimal('60.58330000'), u'page_len': 27110}, '__': {u'gt_lat': Decimal('49.85944444'), u'page_id': 13920, u'gt_lon': Decimal('20.27472222'), u'page_len': 3453}, "_(')": {u'gt_lat': Decimal('-53.32000000'), u'page_id': 14305, u'gt_lon': Decimal('-70.91000000'), u'page_len': 3408}, "__'/": {u'gt_lat': Decimal('-22.92000000'), u'page_id': 13410, u'gt_lon': Decimal('-43.22000000'), u'page_len': 56927}, "/-'_": {u'gt_lat': Decimal('41.94610000'), u'page_id': 14123, u'gt_lon': Decimal('-87.66940000'), u'page_len': 28496},…… |
| 這些key真的詭異極了 |
嘗試打印full url
| CONNECTION ERROR!!! RECONNECT TO https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search= page |
嘗試打印key
保存副本
更改sql語(yǔ)言版本后成功獲得正確key
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba -- did not exist Traceback (most recent call last): ? File "scrap_wikivoyage_banners.py", line 115, in <module> ??? swb.scrap_banners() ? File "scrap_wikivoyage_banners.py", line 101, in scrap_banners ??? img_path, wiki_url = self.get_image_and_link(key) ? File "scrap_wikivoyage_banners.py", line 57, in get_image_and_link ??? return make_default_img_url(place) NameError: global name 'make_default_img_url' is not defined |
嘗試訪問(wèn)url:
https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba
發(fā)get請(qǐng)求的狀態(tài)碼是302
wiki可以正常訪問(wèn),但不是這個(gè)網(wǎng)址,跳轉(zhuǎn)到
| https://en.wikivoyage.org/wiki/Eastern_Cuba 和make_default_img_url中的地址一樣呢 https://en.wikivoyage.org/wiki/ |
但是https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default
這個(gè)搜索頁(yè)面還在
用搜索框搜索查看發(fā)出的請(qǐng)求是
| https://en.wikivoyage.org/w/index.php?=Eastern_Cuba&sort=relevance&search=Eastern_Cuba&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 |
加&fulltext=1或無(wú)該名稱則不會(huì)跳轉(zhuǎn)
https://en.wikivoyage.org/w/index.php?search=Eastern_Cuba&title=Special%3ASearch&profile=advanced&fulltext=1
所以URL可能沒(méi)有問(wèn)題
?
make_default_img_url并不是全局變量,是不是什么錯(cuò)誤讓人把它當(dāng)成全局變量了
嘗試修改make_default_img_url
| def make_default_img_url(self, place): ?????? ''' ?????? input = place ?????? output = return the default values for img_path and wiki_url ?????? ''' ? ?????? img_path = 'static/banners/default.png' ?????? wiki_url = 'https://en.wikivoyage.org/wiki/%s' % place ?????? return img_path, wiki_url |
| except IndexError: ?????????? # print 'INDEX ERROR!!! %s page did not exist' % place ?????????? print 'INDEX ERROR!!! %s -- did not exist' % full_url ?????????? # return make_default_img_url(place) ?????????? return self.make_default_img_url(place) |
更改后雖然還是無(wú)法訪問(wèn),但可以連續(xù)運(yùn)行了,最后報(bào)錯(cuò)如下
| ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 753, in generate ??? raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ("Connection broken: error(10053, '')", error(10053, '')) |
?
?
在特殊搜索界面看到一個(gè)Developers:
https://www.mediawiki.org/wiki/How_to_contribute
網(wǎng)頁(yè)API:https://www.mediawiki.org/wiki/API:Web_APIs_hub
API:Geosearch:https://www.mediawiki.org/wiki/API:Geosearch
GET 請(qǐng)求用地理位置的附近坐標(biāo)或頁(yè)面名稱搜索 wiki 頁(yè)面。
This module is supported through the Extension:GeoData currently not installed on MediaWiki but Wikipedia. So, in this document, we will use the URL en.wikipedia.org in all API endpoints.
此模塊通過(guò)擴(kuò)展支持:地理數(shù)據(jù)當(dāng)前未安裝在 MediaWiki 上,而是維基百科。因此,在本文中,我們將在所有 API 終結(jié)點(diǎn)中使用 URL en.wikipedia.org。
?
GET Request[edit]
?
Search for pages near Wikimedia Foundation headquarters by specifying the geographic coordinates of its location:
api.php?action=query&list=geosearch&gscoord=37.7891838|-122.4033522&gsradius=10000&gslimit=10 [try in ApiSandbox]
通過(guò)指定維基媒體基金會(huì)總部附近的頁(yè)面,指定其位置的地理坐標(biāo)
API documentation:https://en.wikipedia.org/w/api.php?action=help&modules=query+geosearch
https://en.wikivoyage.org/w/api.php?action=help&modules=query
API查閱方法https://www.mediawiki.org/wiki/API:Main_page
Examples:
Fetch site info and revisions of Main Page.
api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&rvprop=user|comment&continue= [open in sandbox]
?
我之前用過(guò)request.urlopen,源碼為requests.get,查看這兩種區(qū)別https://blog.csdn.net/dead_cicle/article/details/86747593
構(gòu)造一個(gè)Request對(duì)象,然后使用urlopen拿回來(lái)的還是對(duì)象
requests是python實(shí)現(xiàn)的簡(jiǎn)單易用的HTTP庫(kù),返回一個(gè)HTTPresp,該類有屬性:text,content,code等。
直接打印的狀態(tài)碼為200,但還是報(bào)錯(cuò),說(shuō)明請(qǐng)求這一步是沒(méi)有問(wèn)題的
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py <Response [200]> INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba -- did n ot exist |
查看bs:
https://blog.csdn.net/weixin_42231070/article/details/82225529
| importurllib.request frombs4 importBeautifulSoup douban_path = "https://movie.douban.com"response = urllib.request.urlopen(douban_path) soup = BeautifulSoup(response, 'html.parser') # 可以接受response對(duì)象soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser') # 可以接受字符串soup = BeautifulSoup(open(test.html),'html.parser') # 可以接受本地文件 |
剛才嘗試打印text報(bào)錯(cuò)編碼不對(duì),但打印soup能打印出一堆html源碼
查看wiki_url成功
| soup.find(rel='canonical')['href'] |
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py https://en.wikivoyage.org/wiki/Eastern_Cuba |
所以可能是取img_src的問(wèn)題
| 'https:'+soup.select('div.topbanner a.image')[0].select('img')[0]['src'] |
soup.select :https://blog.csdn.net/geerniya/article/details/77842421
通過(guò)采用soup.select()方法,可以得到所需的內(nèi)容。
其中關(guān)鍵點(diǎn)在于,對(duì)于所需內(nèi)容的精準(zhǔn)定位,通過(guò)()內(nèi)的語(yǔ)句來(lái)實(shí)現(xiàn)
https://blog.csdn.net/weixin_40425640/article/details/79470617
select 的功能跟find和find_all 一樣用來(lái)選取特定的標(biāo)簽,它的選取規(guī)則依賴于css,我們把它叫做css選擇器,如果之前有接觸過(guò)jquery ,可以發(fā)現(xiàn)select的選取規(guī)則和jquery有點(diǎn)像。
標(biāo)簽名不加任何修飾,會(huì)返回一個(gè)數(shù)組(所以div是標(biāo)簽名
類名前加點(diǎn),id名前加 #
?
組合查找可以分為兩種,一種是在一個(gè)tag中進(jìn)行兩個(gè)條件的查找,一種是樹(shù)狀的查找一層一層之間的查找。
| print soup.select('a#link2') |
選擇標(biāo)簽名為a,id為link2的tag。
猜測(cè)可能是最后的'src'下標(biāo)無(wú)效
查找select('img') https://www.jianshu.com/p/ed2f044bd1fa
Tag或BeautifulSoup對(duì)象的.select()方法。
| res = soup.select('#wrapperto') | -> tag's id |
| res = soup.select('img[src]') | -> 'img' tags有'src' attributes |
| res = soup.select('img[src=...]') | -> 'src' attributes是... |
soup.select 查找Img src
https://www.cnblogs.com/calmzone/p/11139980.html
| # soup.a.arrts? # 獲取a標(biāo)簽所有屬性和值,返回一個(gè)字典 # soup.a.attrs['href']? # 獲取href屬性 # soup.a['href']? # 也可簡(jiǎn)寫(xiě)成這種 #上面兩種方式都可以獲取a標(biāo)簽的href屬性值 |
https://blog.csdn.net/weixin_42231070/article/details/82225529
當(dāng)屬性不存在時(shí),使用 get 返回None,字典形式取值會(huì)報(bào)錯(cuò)
| print soup.select('div.topbanner a.image') |
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py [] |
難道這返回了一個(gè)空數(shù)組,topbanner類的div中根本就沒(méi)有image類的a
查看https://en.wikivoyage.org/wiki/Eastern_Cuba的源碼
發(fā)現(xiàn)含有topbanner類的div是有的,但是有兩個(gè),而且這個(gè)類名字只是包含,是好幾個(gè)類其中有個(gè)wpb-topbanner
一個(gè)div元素為了能被多個(gè)樣式表匹配到(樣式復(fù)用),通常div的class中由好幾段組成,如<div class="user login">能被.user和.login兩個(gè)選擇器選中。如果這兩個(gè)選擇器中有相同的屬性值,則該屬性值先被改為.user中的值,再被改為.login中的值,即重復(fù)的屬性以最后一個(gè)選擇器中的屬性值為準(zhǔn)。(這個(gè)div就有好幾個(gè)類)
嘗試改select中的類名
| ?????????? a_img_tag=soup.select('div.wpb-topbanner a.image') ?????????? print a_img_tag ?????????? # print soup.select('div.topbanner a.image')[0].select('img')[0] ? ?????????? # img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src'] ?????????? img_src = 'https:' + soup.select('div.wpb-topbanner a.image')[0].select('img')[0]['src'] |
打印不再是空數(shù)組了
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py [<a class="image" dir="ltr" href="/wiki/File:WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" title="Eastern Cuba"><img class="wpb-banner-image" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardal avaca.jpg" srcset="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca .jpg/640px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 640w,https://upload.wikimedia.org/wikipedia/commons/thumb/7/7 d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/1280px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 1280w,https://u pload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 2560w"/></a>] Traceback (most recent call last): ? File "scrap_wikivoyage_banners.py", line 123, in <module> ??? swb.scrap_banners() ? File "scrap_wikivoyage_banners.py", line 109, in scrap_banners ??? img_path, wiki_url = self.get_image_and_link(key) ? File "scrap_wikivoyage_banners.py", line 80, in get_image_and_link ??? img_response = requests.get(img_src, stream=True) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\api.py", line 75, in get ??? return request('get', url, params=params, **kwargs) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\api.py", line 60, in request ??? return session.request(method=method, url=url, **kwargs) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\sessions.py", line 519, in request ??? prep = self.prepare_request(req) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\sessions.py", line 462, in prepare_request ??? hooks=merge_hooks(request.hooks, self.hooks), ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 313, in prepare ??? self.prepare_url(url, params) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 390, in prepare_url ??? raise InvalidURL("Invalid URL %r: No host supplied" % url) requests.exceptions.InvalidURL: Invalid URL u'https:https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Easter n_Cuba_Road_to_Guardalavaca.jpg': No host supplied |
找到的只有一個(gè)a標(biāo)簽,里面也只有一個(gè)img標(biāo)簽
Src中的https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Easter可以訪問(wèn),出去額外添加的“https:”,報(bào)錯(cuò)
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py [<a class="image" dir="ltr" href="/wiki/File:WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" title="Eastern Cuba"><img class="wpb-banner-image" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" srcset="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/640px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 640w,https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/1280px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 1280w,https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 2560w"/></a>] CONNECTION ERROR!!! RECONNECT TO eastern_cuba page Traceback (most recent call last): ? File "scrap_wikivoyage_banners.py", line 123, in <module> ??? swb.scrap_banners() ? File "scrap_wikivoyage_banners.py", line 109, in scrap_banners ??? img_path, wiki_url = self.get_image_and_link(key) ? File "scrap_wikivoyage_banners.py", line 89, in get_image_and_link ??? return make_default_img_url(place) NameError: global name 'make_default_img_url' is not defined |
嘗試打印請(qǐng)求的response
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg <Response [200]> Traceback (most recent call last): ? File "scrap_wikivoyage_banners.py", line 124, in <module> ??? swb.scrap_banners() ? File "scrap_wikivoyage_banners.py", line 110, in scrap_banners ??? img_path, wiki_url = self.get_image_and_link(key) ? File "scrap_wikivoyage_banners.py", line 94, in get_image_and_link ??? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file: IOError: [Errno 2] No such file or directory: '../../webapp/static/banners/eastern_cuba.png' |
嘗試創(chuàng)建banners
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg <Response [200]> eastern_cuba.png successfully created https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ardrossan_-_SA_WV_Banner.jpg/2560px-Ardrossan_-_SA_WV_Banner.jpg <Response [200]> ardrossan_(south_australia).png successfully created |
?
不想斷網(wǎng)的時(shí)候爬信息一直往下滾,漏過(guò)了好多,嘗試在爬圖片網(wǎng)址的時(shí)候加了sleep
| import time ? except ConnectionError: ?????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place ?????????? print 'CONNECTION ERROR!!! RECONNECT TO -- %s ' % full_url ?????????? time.sleep(20) ?????????? # return make_default_img_url(place) ?????????? return self.make_default_img_url(place) |
這樣斷網(wǎng)的時(shí)候error就不會(huì)一直刷屏了,給我一點(diǎn)時(shí)間,把網(wǎng)重新連上,怎么下到一半anaconda還卡了呢= =
這回爬得順利一點(diǎn),少量index error
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Fjrland&ns0=1 -- img src did not exist Fjrland這個(gè)我直接在維基上搜也搜不到,There were no results matching the query. 下拉框有個(gè)帶梅花a的(打不出來(lái)) 找到真實(shí)鏈接為https://en.wikivoyage.org/wiki/Fj%C3%A6rland |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Heisy_Bordel/Prague/East_Bank_Of_Vltava&ns0=1 -- img src did not exist Heisy_Bordel/Prague/East_Bank_Of_Vltava也搜不到 Heisy Bordel是Prague/East_Bank_Of_Vltava的一個(gè)貢獻(xiàn)用戶 真實(shí)鏈接https://en.wikivoyage.org/wiki/Prague/East_bank_of_Vltava |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Berlinichthyosaur_State_Park&ns0=1 下拉框有Berlin–Ichthyosaur State Park,真實(shí)鏈接https://en.wikivoyage.org/wiki/Berlin%E2%80%93Ichthyosaur_State_Park 圖片:https://en.wikivoyage.org/wiki/File:Berlin%E2%80%93Ichthyosaur_State_Park_banner.JPG |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Hafnarfjorur&ns0=1 -- img src did not exist 真實(shí)鏈接https://en.wikivoyage.org/wiki/Hafnarfj%C3%B6r%C3%B0ur |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Drivingukbanner1.Jpg&ns0=1 -- img src did not exist Drivingukbanner1.Jpg這很奇怪,地名怎么變成jpg了,而且前面的Driving uk也不知道怎么查了 |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Owl_Ad_Wouters.Jpg&ns0=1 可能是https://en.wikivoyage.org/wiki/Ad%27s_Path |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Snogebk&ns0=1 -- img src did not exist 真實(shí)鏈接https://en.wikivoyage.org/wiki/Snogeb%C3%A6k |
| INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Nstved&ns0=1 -- img src did not exist 真實(shí)鏈接 https://en.wikivoyage.org/wiki/N%C3%A6stved |
因?yàn)榫W(wǎng)老斷,需要重復(fù)多次運(yùn)行,每次都重復(fù)請(qǐng)求url然后判斷圖片存在太慢了,先判斷一波
| def get_image_and_link(self, place): ?????? ''' ?????? For a given place, get the canonical wikivoyage url and save the banner. ?????? If the banner is just a default banner, save the img path as the default ?????? banner to minimize duplicates. ? ?????? input: place as string ?????? output: img_path and wiki_url + (save image in the process) ?????? ''' ?????? if not os.path.exists('../../webapp/static/banners/%s.png' % place): ?????????? #look over before request ?????????? base_url = "https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=" ?????????? full_url = base_url + place.title() ?????????? # print 'place %s,' % place ?????????? # print 'place_title %s' % place.title() ? ?????????? try: ????????????? response = requests.get(full_url).text ????????????? soup = BeautifulSoup(response, 'html.parser') ????????????? wiki_url = soup.find(rel='canonical')['href'] ????????????? a_img_tag=soup.select('div.wpb-topbanner a.image') ????????????? # img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src'] ????????????? img_src = soup.select('div.wpb-topbanner a.image')[0].select('img')[0]['src'] ??? # mark ?????????? except IndexError: ????????????? # print 'INDEX ERROR!!! %s page did not exist' % place ????????????? print 'INDEX ERROR!!! %s -- img src did not exist' % wiki_url ??? ?????????? # return make_default_img_url(place) ????????????? return self.make_default_img_url(place) ? ?????????? except ConnectionError: ????????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place ????????????? print 'CONNECTION ERROR!!! RECONNECT TO -- %s ' % full_url ????????????? time.sleep(20) ????????????? # return make_default_img_url(place) ????????????? return self.make_default_img_url(place) ? ?????????? if 'Pagebanner_default' in img_src or 'default_banner' in img_src: ????????????? print '%s has default banner!' % place ????????????? img_path = 'static/banners/default.png' ? ?????????? else: ????????????? place = place.replace('/', '_')? # REPLACE '/' with '_' BECAUSE IT CREATES A DIRECTORY ? ????????????? try: ????????????????? img_response = requests.get(img_src, stream=True) ????????????????? # print img_response ????????????????? img_path = 'static/banners/%s.png' % place ? ????????????? except IndexError: ????????????????? print 'INDEX ERROR!!! %s img did not exist' % place ????????????????? return self.make_default_img_url(place) ? ????????????? except ConnectionError: ????????????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s img' % place ????????????????? print 'CONNECTION ERROR!!! RECONNECT TO %s img' % img_src ????????????????? return self.make_default_img_url(place) ? ????????????? # save the img file if it doesn't already exist. if it already exists, dont overwrite. ????????????? if not os.path.exists('../../webapp/static/banners/%s.png' % place): ????????????????? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file: ???????????????????? shutil.copyfileobj(img_response.raw, out_file) ????????????????? del img_response ????????????????? print '%s.png successfully created' % place ? ????????????? else: ????????????????? print '%s.png already exists!' % place ?????????? #look over before request ?????? else: ?????????? print '%s.png already exists!' % place ?????????? return self.make_default_img_url(place) ?????? return img_path, wiki_url |
?
(真不容易,電腦總是斷網(wǎng),一直修不好。去工作室蹭網(wǎng)下圖片,老師還說(shuō)有領(lǐng)導(dǎo)參觀,不讓呆)
網(wǎng)不好真的太難爬了,使用默認(rèn)網(wǎng)址的模型,剩下的交給別人爬
全部運(yùn)行完在D:\anacondaProject\where2go\data生成了一個(gè)geotag_imglink_wikibanner.pkl
?
New York Times
4.nyt_articles_api.py
This file was use to gather the most recent NYT articles in World, Science, and Travel sections. MongoDB was used to save the articles called with the official NYT API. Data was collected but was not incorporated to the model because the articles contained too much news like semantics.
此文件用于收集《世界、科學(xué)和旅行》部分中最新的《紐約時(shí)報(bào)》文章。MongoDB 用于保存使用官方的 NYT API 調(diào)用的文章。數(shù)據(jù)收集但未納入模型,因?yàn)槲恼掳嘞裾Z(yǔ)義的新聞。
?
?
| ImportError: No module named pymongo |
去蹭網(wǎng)下叭
| pip install pymongo |
運(yùn)行了一會(huì)兒后報(bào)錯(cuò)
| pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [Errno 10061] |
想起來(lái)這個(gè)是要mongodb的
https://www.jianshu.com/p/c9777b063593
https://blog.csdn.net/huasonl88/article/details/51755621
MongoDB 不同于關(guān)系型結(jié)構(gòu)的三層結(jié)構(gòu)——database--> table --> record,它的層級(jí)為 database -->collection --> document
https://blog.csdn.net/zwq912318834/article/details/77689568
| import pymongo ? # mongodb服務(wù)的地址和端口號(hào) mongo_url = "127.0.0.1:27017" ? # 連接到mongodb,如果參數(shù)不填,默認(rèn)為“l(fā)ocalhost:27017” client = pymongo.MongoClient(mongo_url) ? #連接到數(shù)據(jù)庫(kù)myDatabase DATABASE = "myDatabase" db = client[DATABASE] ? #連接到集合(表):myDatabase.myCollection COLLECTION = "myCollection" db_coll = db[COLLECTION ] ? # 在表myCollection中尋找date字段等于2017-08-29的記錄,并將結(jié)果按照age從大到小排序 queryArgs = {'date':'2017-08-29'} search_res = db_coll.find(queryArgs).sort('age',-1) for record in search_res: ????? print(f"_id = {record['_id']}, name = {record['name']}, age = {record['age']}") |
?
源碼:
| # Define the MongoDB database and table db_cilent = MongoClient() db = db_cilent['nyt_dump'] table = db['articles'] |
| ''' ??? Get all the links, visit the page and scrape the content ??? ''' ??? if not section: ??????? links = table.find({'content_txt': {'$exists': False}}, {'web_url': 1}) ??? else: ??????? links = table.find({'$and': [{'content_txt': {'$exists': False}}, ?? ????????????????????????{'section_name': section}]}, {'web_url': 1}) |
開(kāi)啟mongodb
| D:\Program Files\Mongo\bin>mongod.exe --dbpath "D:\MongoDB\DBData" |
Mongo還用不了了,卸載重裝https://www.cnblogs.com/6luv-ml/p/9174818.html
看了下,可能因?yàn)樯洗沃匮b系統(tǒng)的問(wèn)題,程序與功能里并沒(méi)有mongodb,直接刪除了安裝
https://www.baidu.com/link?url=aA78IHXRSyxzObA9ArXLH43I1blC1eDEdnj9io1WJtH5LeR-cHl-gJgEwVfOkuJzsJiWNx_78t_CHZFXGHGNwzY9Vtz5wBluVD2AobNJiaW&wd=&eqid=b831449c000210e4000000035d5e441d
沒(méi)再報(bào)錯(cuò)了(沒(méi)看到寫(xiě)入文件,有沒(méi)有數(shù)據(jù)也不想管了-反正后面可能也用不著)
Service Name:MongoDB
Data Directory:D:\Program Files\Mongo\data\
2) EDA
Exploratory data analysis and data cleaning have been performed with ipython notebook. Wikivoyage and NYT data were loaded, cleaned, pickled out as input format for word2vec, which is a list of sentences where each sentence is represented as a list of words. Also, global NOAA weather data was downloaded but I later determined that it leaves out major parts of the world. Thus, more data has to be collected to incorporate weather to the project.
Ipython notebook已執(zhí)行探索性數(shù)據(jù)分析和數(shù)據(jù)清理。Wikivoyage 和 NYT 數(shù)據(jù)被加載、清理、挑選出來(lái)作為 word2vec 的輸入格式,該格式是句子列表,其中每個(gè)句子都表示為單詞列表。此外,全球NOAA天氣數(shù)據(jù)被下載,但我后來(lái)確定,它忽略了世界的主要部分。因此,要將天氣納入項(xiàng)目必須收集更多的數(shù)據(jù)。
Wiki voyage
| (py2_flask) D:\anacondaProject\where2go\code\data_collection>ipython Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help. |
Anaconda中預(yù)置了ipython
可能是用jupyter做的,直接打開(kāi)看起來(lái)像一堆json
Jupyter介紹:http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spider&for=pc
當(dāng)你還處于原型開(kāi)發(fā)階段時(shí),Jupyter Notebooks 的作用更是引人注目。這是因?yàn)槟愕拇a是按獨(dú)立單元的形式編寫(xiě)的,而且這些單元是獨(dú)立執(zhí)行的。這讓用戶可以測(cè)試一個(gè)項(xiàng)目中的特定代碼塊,而無(wú)需從項(xiàng)目開(kāi)始處執(zhí)行代碼。
要運(yùn)行你的 Jupyter Notebooks,只需在命令行輸入以下命令即可!
jupyter notebook
完成之后,Jupyter Notebooks 就會(huì)在你的默認(rèn)網(wǎng)絡(luò)瀏覽器打開(kāi),地址是:
http://localhost:8888/tree
在某些情況下,它可能不會(huì)自動(dòng)打開(kāi)。而是會(huì)在終端/命令行生成一個(gè) URL,并帶有令牌密鑰提示。你需要將包含這個(gè)令牌密鑰在內(nèi)的整個(gè) URL 都復(fù)制并粘貼到你的瀏覽器,然后才能打開(kāi)一個(gè)筆記本。
打開(kāi)筆記本后,你會(huì)看到頂部有三個(gè)選項(xiàng)卡:Files、Running 和 Clusters。其中,Files 基本上就是列出所有文件,Running 是展示你當(dāng)前打開(kāi)的終端和筆記本,Clusters 是由 IPython 并行提供的。
要打開(kāi)一個(gè)新的 Jupyter 筆記本,點(diǎn)擊頁(yè)面右側(cè)的「New」選項(xiàng)。你在這里會(huì)看到 4 個(gè)需要選擇的選項(xiàng):
Python 3Text FileFolderTerminal
選擇 Text File,你會(huì)得到一個(gè)空面板。你可以添加任何字母、單詞和數(shù)字。其基本上可以看作是一個(gè)文本編輯器(類似于 Ubuntu 的文本編輯器)。你可以在其中選擇語(yǔ)言(有很多語(yǔ)言選項(xiàng)),所以你可以在這里編寫(xiě)腳本。你也可以查找和替換該文件中的詞。
選擇 Folder 選項(xiàng)時(shí),你會(huì)創(chuàng)建一個(gè)新的文件夾,你可以在其中放入文件,重命名或刪除它。各種操作都可以。
Terminal 完全類似于在 Mac 或 Linux 機(jī)器上的終端(或 Windows 上的 cmd)。其能在你的網(wǎng)絡(luò)瀏覽器內(nèi)執(zhí)行一些支持終端會(huì)話的工作。在這個(gè)終端輸入 python,你就可以開(kāi)始寫(xiě)你的 Python 腳本了!
?
在代碼上面的菜單中,你有一些操作各個(gè)單元的選項(xiàng):添加、編輯、剪切、向上和向下移動(dòng)單元、運(yùn)行單元內(nèi)的代碼、停止代碼、保存工作以及重啟 kernel。
?
上圖所示的下拉菜單中,你還有 4 個(gè)選項(xiàng):
Code——不言而喻,就是寫(xiě)代碼的地方。Markdown——這是寫(xiě)文本的地方。你可以在運(yùn)行一段代碼后添加你的結(jié)論、添加注釋等。Raw NBConvert——這是一個(gè)可將你的筆記本轉(zhuǎn)換成另一種格式(比如 HTML)的命令行工具。Heading——這是你添加標(biāo)題的地方,這樣你可以將不同的章節(jié)分開(kāi),讓你的筆記本看起來(lái)更整齊更清晰。這個(gè)現(xiàn)在已經(jīng)被轉(zhuǎn)換成 Markdown 選項(xiàng)本身了。輸入一個(gè)「##」之后,后面輸入的內(nèi)容就會(huì)被視為一個(gè)標(biāo)題。
!%clear、%autosave、%debug 和 %mkdir 等功能你以前肯定見(jiàn)過(guò)。現(xiàn)在,神奇的命令可以以兩種方式運(yùn)行:
逐行方式逐單元方式
顧名思義,逐行方式是執(zhí)行單行的命令,而逐單元方式則是執(zhí)行不止一行的命令,而是執(zhí)行整個(gè)單元中的整個(gè)代碼塊。
在逐行方式中,所有給定的命令必須以 % 字符開(kāi)頭;而在逐單元方式中,所有的命令必須以 %% 開(kāi)頭
?
快捷方式是 Jupyter Notebooks 最大的優(yōu)勢(shì)之一。當(dāng)你想運(yùn)行任意代碼塊時(shí),只需要按 Ctrl+Enter 就行了。
Jupyter Notebooks 提供了兩種不同的鍵盤(pán)輸入模式——命令和編輯。命令模式是將鍵盤(pán)和筆記本層面的命令綁定起來(lái),并且由帶有藍(lán)色左邊距的灰色單元邊框表示。編輯模式讓你可以在活動(dòng)單元中輸入文本(或代碼),用綠色單元邊框表示。
你可以分別使用 Esc 和 Enter 在命令模式和編輯模式之間跳躍。
?
如之前提到的,Ctrl + Enter 會(huì)運(yùn)行你的整個(gè)單元塊。
?Alt + Enter 不止會(huì)運(yùn)行你的單元塊,還會(huì)在下面添加一個(gè)新單元。
?Ctrl + Shift + F 打開(kāi)命令面板。
要查看鍵盤(pán)快捷鍵完整列表,可在命令模式按「H」或進(jìn)入「Help > Keyboard Shortcuts」。
保存和共享你的筆記本
當(dāng)我必須寫(xiě)一篇博客文章時(shí),我的代碼和評(píng)論都會(huì)在一個(gè) Jupyter 文件中,我需要首先將它們轉(zhuǎn)換成另一個(gè)格式。記住這些筆記本是 json 格式的,這在進(jìn)行共享時(shí)不會(huì)很有幫助。我總不能在電子郵件和博客上貼上不同單元塊,對(duì)不對(duì)?
進(jìn)入「Files」菜單,你會(huì)看到「Download As」選項(xiàng):
你可以用 7 種可選格式保存你的筆記本。其中最常用的是 .ipynb 文件和 .html 文件。使用 .ipynb 文件可讓其他人將你的代碼復(fù)制到他們的機(jī)器上,使用 .html 文件能以網(wǎng)頁(yè)格式打開(kāi)(當(dāng)你需要保存嵌入在筆記本中的圖片時(shí)會(huì)很方便)。
你也可以使用 nbconvert 選項(xiàng)手動(dòng)將你的筆記本轉(zhuǎn)換成 HTML 或 PDF 等格式。
你也可以使用 jupyterhub,地址:https://github.com/jupyterhub/jupyterhub。其能讓你將筆記本托管在它的服務(wù)器上并進(jìn)行多用戶共享。很多頂級(jí)研究項(xiàng)目都在使用這種方式進(jìn)行協(xié)作。
?
有時(shí)候你的文件中有非常大量的代碼。看看能不能將你認(rèn)為不重要的某些代碼隱藏起來(lái),之后再引用。這能讓你的筆記本看起來(lái)整潔清晰,這是非常可貴的。查看這個(gè)在 matplotlib 上的筆記本,看看可以如何簡(jiǎn)練地進(jìn)行呈現(xiàn):http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb
另一個(gè)額外技巧!在你想創(chuàng)建一個(gè)演示文稿時(shí),你可能首先想到的工具是 PowerPoint 和 Google Slides。其實(shí)你的 Jupyter Notebooks 也能創(chuàng)建幻燈片!
更改默認(rèn)保存位置
- 打開(kāi)Windows的cmd,在cmd中輸入jupyter notebook --generate-config如下圖:
可以看到路徑為D:\Users……找到此路徑修改jupyter_notebook_config.py文件
打開(kāi)此文件找到
## The directory to use for notebooks and kernels.
#c.NotebookApp.notebook_dir = ''
將其改為
## The directory to use for notebooks and kernels.
c.NotebookApp.notebook_dir = 'E:\Jupyter'
其中E:\Jupyter為我的工作空間,你可以改成你自己的,
注意:
1.#c.NotebookApp.notebook_dir = ''中的#必須刪除,且前面不能留空格。
2. E:\Jupyter,Jupyter文件夾必須提前新建,如果沒(méi)有新建,Jupyter Notebook會(huì)找不到這個(gè)文件,會(huì)產(chǎn)生閃退現(xiàn)象。
Cmd中沒(méi)有jupyter環(huán)境,無(wú)法運(yùn)行jupyter notebook --generate-config,在anaconda中修改的配置,也在anaconda中打開(kāi)
(base) C:\Users\Lenovo>jupyter notebook
反斜杠有可能識(shí)別為轉(zhuǎn)義
| c.NotebookApp.notebook_dir = 'D:\\anacondaProject' |
?
嘗試用base環(huán)境直接運(yùn)行,新建Wiki voyage eda副本。
報(bào)錯(cuò)
| ModuleNotFoundError: No module named 'gensim' |
暫時(shí)不管,待會(huì)再看在哪個(gè)環(huán)境裝好
發(fā)現(xiàn)py2環(huán)境雖然沒(méi)特地裝jupyter,但是居然也可以運(yùn)行,所有配置和base環(huán)境一樣(右上角也有個(gè)py3的標(biāo)志)
解決:http://www.360doc.com/content/17/0413/22/1489589_645405947.shtml
Jupyter Notebook的環(huán)境和kernels內(nèi)核有關(guān)。用everything搜索kernel.json找到
/jupyter/kernels/python3/kernel.json
(py27)環(huán)境還缺少ipykernel
conda install ipykernel
切換
https://blog.csdn.net/castle_cc/article/details/77476081
| python -m pip install ipykernel python -m ipykernel install --user |
成功切換。
運(yùn)行報(bào)錯(cuò)
| FileNotFoundError: [Errno 2] No such file or directory: '../data/wikivoyage.json' |
復(fù)制data collection中py生成的文件到該目錄D:\anacondaProject\where2go\code\data
報(bào)錯(cuò)
| ImportError: matplotlib is required for plotting. |
https://www.cnblogs.com/star-zhao/p/9726212.html
嘗試重啟IDE,全部重新運(yùn)行,報(bào)錯(cuò)
| LookupError Traceback (most recent call last) <ipython-input-36-3b51d0f0aedc> in <module>() 5 # final_articles_words[key] = convert_article_into_list_of_words(value) 6 #print article ----> 7 final_articles_words[key] = convert_article_into_list_of_words(value) <ipython-input-33-e13c4daff3a0> in convert_article_into_list_of_words(article) 14 text = clean_paragraph(text) 15 #tokenize paragraph to sentences ---> 16 sentences = sent_tokenize(text) 17 18 for sentence in sentences: |
| LookupError: ********************************************************************** Resource punkt not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt') For more information see: https://www.nltk.org/data.html Attempted to load tokenizers/punkt/english.pickle |
https://blog.csdn.net/qq_31747765/article/details/80307450
| 命令行 python import nltk nltk.download() |
切換到models標(biāo)簽,找到punkt
在報(bào)錯(cuò)中查找的地址中選一個(gè),更改Download Directory
| D:\\ProgramData\\Anaconda3\\envs\\py2_flask\\nltk_data |
最后生成../data/wikivoyage_list_of_words.pkl
Weather
報(bào)錯(cuò)
| No module named haversine |
| pip install haversine |
報(bào)錯(cuò)
| No such file or directory: '../../data/pickles/geotag_imglink_wikiurl.pkl' |
沒(méi)找到拿他作為輸出文件的代碼,weather_normals_eda-checkpoint和此處都是作為讀入文件
嘗試拿剛剛生成的pickle改名字
報(bào)錯(cuò)
| IOError: [Errno 2] No such file or directory: '../data/weather/ghcnm.tavg.v3.3.0.20150624.qca.dat' |
https://www.jianshu.com/p/3d4b606ec359
全球歷史氣候網(wǎng)絡(luò)月度(GHCNm)數(shù)據(jù)集是來(lái)自世界各地?cái)?shù)千個(gè)氣象站的一組月度氣候摘要。月度數(shù)據(jù)具有通過(guò)站最早觀測(cè)可追溯至18改變記錄期間日世紀(jì)。一些臺(tái)站記錄純粹是歷史性的,不再更新,而其他許多臺(tái)站仍在運(yùn)行,并提供對(duì)氣候監(jiān)測(cè)有用的短時(shí)間延遲更新。
在該網(wǎng)頁(yè)找到該數(shù)據(jù)集地址:https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4
地址https://www.ncei.noaa.gov/data/global-historical-climatology-network-monthly/
GHCN:https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn
v4版本的只有qcf、qcu、qfe,沒(méi)有qca,還是決定下v3的
PHA已經(jīng)過(guò)廣泛的評(píng)估(例如,Williams等人,2012),并且GHCNm v4數(shù)據(jù)被提供為均質(zhì)化(調(diào)整)和非均質(zhì)化(未調(diào)整)。均勻化數(shù)據(jù)由字符串“ qcf ” 已知,而未均勻化數(shù)據(jù)由字符串“ qcu ” 指定。如Menne等人所述。(2018),PHA作為整體周期性地運(yùn)行以量化均質(zhì)化的不確定性。還評(píng)估了其他不確定因素。
放到指定文件夾,修改讀取文件名中的時(shí)間日期
| globaldata = pd.read_fwf('../data/weather/ghcnm.tavg.v3.3.0.20190821.qca.dat',header = None, widths=widths) |
報(bào)錯(cuò):
| IOError: [Errno 2] No such file or directory: '../data/weather/ghcnm.tavg.v3.3.0.20150624.qca.inv' |
解壓并改讀取文件名
警告:
| D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\ipykernel_launcher.py:6: FutureWarning: The current behaviour of 'Series.argmin' is deprecated, use 'idxmin' instead. The behavior of 'argmin' will be corrected to return the positional minimum in the future. For now, use 'series.values.argmin' or 'np.argmin(np.array(values))' to get the position of the minimum row. D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\ipykernel_launcher.py:7: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated import sys ? # ndarray compat argmin = idxmin argmax = idxmax 別名而已 ? ix和loc、iloc函數(shù)都是用來(lái)獲取某一行或者某一列數(shù)據(jù)的。 ? ????? col1? col2? col3 row1???? 1???? 2???? 3 row2???? 4???? 5???? 6 row3???? 7???? 8???? 9 .loc[]?is primarily label based, but may also be used with a boolean array. 完全基于標(biāo)簽位置(而不是下標(biāo))的索引器,所謂標(biāo)簽位置就是上面定義的'row1','row2'。 使用方法(row1就是行標(biāo)簽):print df.loc['row1'] ? .iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.完全基于行號(hào)的索引器,所謂行號(hào)就是第0、1、2行。 print df.iloc[0] ? .ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.支持標(biāo)簽和行號(hào)混合的索引器,既可以通過(guò)標(biāo)簽也可以通過(guò)行號(hào),還可以組合在一起(這個(gè)函數(shù)已經(jīng)過(guò)期,建議使用上面兩個(gè)函數(shù)替代) |
| 源碼: closest_station = distance.argmin() temps = master.ix[closest_station][months] |
| 修改: closest_station = distance.idxmin() temps = master.loc[closest_station][months] |
全部運(yùn)行完成
Nyt
報(bào)錯(cuò)
| ImportError: No module named nyt_articles_api |
這是之前collection中的模塊
如何在jupyter中調(diào)用自己寫(xiě)的Python模塊:
https://blog.csdn.net/w371500241/article/details/55809362
https://www.cnblogs.com/master-pokemon/p/6136483.html
放同一個(gè)目錄或
| import sys sys.path.append('e:/workspace/Modules') import Hello Hello.hello() |
Jupyter中直接寫(xiě)絕對(duì)路徑無(wú)法識(shí)別,但相對(duì)路徑可以,加入
| import sys sys.path.append('..\data_collection') |
報(bào)錯(cuò):
| ServerSelectionTimeoutError: localhost:27017: [Errno 10061] |
估計(jì)是mongodb的
這回好像裝的是個(gè)自帶server版本,自己就運(yùn)行了,沒(méi)再報(bào)錯(cuò)了,生成文件
../data/nyt_articles_word_list.pkl','wb'
?
啟動(dòng)jupyter:
| activate py2_flask jupyter notebook |
?
3) Model
Where2go is based on a model created at Google called word2vec. Word2vec is a neural network with 1 hidden layer that has continuous bag of words (CBOW) or skip-grams implementation. Where2go uses the version that uses skip-grams and hierarchical softmax for optimization.
On the high level, word2vec tries to train the neural network to paramatize a model that can predict the surrounding words for every word in the corpus. The predictions are then used to backpropogate and optimize the parameters to make words with similar contexts be closer together, while being further away from words that have different contexts. The input-hidden layer weighting matrix, which is also the vector representation of words, is then used to gain insight into the meaning/similarity of words.
In my where2go_model.py file, I implemented gensim's word2vec model and wrote functions to vectorize user search queries and functions to filter the recommendations to actual geolocations and output destinations in geojson format.
Where2go基于谷歌創(chuàng)建的名為word2vec的模型。Word2vec 是一個(gè)神經(jīng)網(wǎng)絡(luò),具有 1 個(gè)隱藏層,該層具有連續(xù)的單詞袋 (CBOW) 或skip-grams實(shí)現(xiàn)。where2go 使用的版本使用skip-grams 和hierarchical softmax進(jìn)行優(yōu)化。
在高層級(jí)上,word2vec 試圖訓(xùn)練神經(jīng)網(wǎng)絡(luò),以參數(shù)化一個(gè)模型,該模型可以預(yù)測(cè)語(yǔ)料庫(kù)中每個(gè)單詞的周?chē)鷨卧~。然后,這些預(yù)測(cè)用于回推和優(yōu)化參數(shù),使具有相似上下文的單詞更緊密地聯(lián)系在一起,同時(shí)遠(yuǎn)離具有不同上下文的單詞。然后,使用輸入隱藏層加權(quán)矩陣(也是單詞的矢量表示形式)來(lái)深入了解單詞的含義/相似性。
在我的 where2go_model.py 文件中,我實(shí)現(xiàn)了 gensim 的 word2vec 模型,并編寫(xiě)了矢量化用戶搜索查詢的函數(shù),和將建議篩選到實(shí)際地理位置、以geojson格式輸出目的地的函數(shù)。
| activate py2_flask cd /d D:\anacondaProject\where2go\code\model python where2go_model.py |
報(bào)錯(cuò):
| IOError: [Errno 2] No such file or directory: '../../data/pickles/geo_imglink_wikiurl.pkl' |
原名稱為geotag_imglink_wikiurl,備份,修改名稱
報(bào)錯(cuò):
| IOError: [Errno 2] No such file or directory: '../../data/pickles/wikivoyage_list_of_words.pkl' |
將上面eda運(yùn)行出來(lái)的粘上
警告:
| D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training. ? "C extension not loaded, training will be slow. " |
https://blog.csdn.net/menghuanguaishou/article/details/90546838
| pip uninstall gensim pip install gensim==3.6 |
警告:
| D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial ? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") |
https://blog.csdn.net/qq_41185868/article/details/88344862
據(jù)說(shuō)沒(méi)有關(guān)系
警告:
| D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\models\phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class ? warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class") |
| 源碼使用: bigram = gensim.models.Phrases(self.wikivoyage_list, min_count=10) ? ??????? trigram = gensim.models.Phrases(bigram[self.wikivoyage_list], min_count=10) |
奇怪,沒(méi)有看到解決這個(gè)警告的
https://blog.csdn.net/lwhsyit/article/details/82750218
應(yīng)當(dāng)生成../../data/pickles/where2go_model.pkl
?
Webapp
I was able to launch my own website using python Flask. I used javascript to perform AJAX calls for the search engine so that I could run a user's search query on my model to predict the most similar places and show my recommendations on the map. The Flask file is named 'app.py' and can be found in the folder 'webapp'; the 'index.html' file contains the html and javascript and can be found in the folder 'templates'. I used Bootstrap to design my website.
我能夠運(yùn)行我自己的網(wǎng)站使用python Flask。我使用 javascript 對(duì)搜索引擎執(zhí)行 AJAX 調(diào)用,以便可以在模型上運(yùn)行用戶的搜索查詢,以預(yù)測(cè)最相似的位置并在地圖上顯示我的建議。Flask 文件名為"app.py",可在文件夾"webapp"中找到;"index.html"文件包含 html 和 javascript,可以在文件夾"模板"中找到。我用Bootstrap來(lái)設(shè)計(jì)我的網(wǎng)站。
Final Remarks
This project has been very fun and intellectually challenging. I started this application as a capstone project but there are many things I would like to add to this app. I really want to add more travel guide data to make my results more robust, add historical weather data to help users decide when to go to a destination, and add average flight and hotel costs to help users choose plausible places. If you have any comments and recommendations for this project, please feel free to contact me.
這個(gè)項(xiàng)目很有趣,智力上很有挑戰(zhàn)性。我開(kāi)始這個(gè)應(yīng)用程序作為一個(gè)頂點(diǎn)項(xiàng)目,但有很多東西我想添加到這個(gè)app。我希望添加更多的旅游指南數(shù)據(jù),使我的搜索結(jié)果更加可靠,添加歷史天氣數(shù)據(jù),以幫助用戶決定何時(shí)前往目的地,并增加平均航班和酒店費(fèi)用,以幫助用戶選擇合理的地方。如果您有任何意見(jiàn)和建議這個(gè)項(xiàng)目,請(qǐng)隨時(shí)與我聯(lián)系。
啟動(dòng)項(xiàng)目
該項(xiàng)目為python2。在我anaconda,py3,py2共存的環(huán)境中使py2能夠使用。
發(fā)現(xiàn)需要的模塊還挺多,決定在anaconda中新建py2虛擬環(huán)境
圖形界面(fetching的過(guò)程挺長(zhǎng)的):
https://www.cnblogs.com/zimo-jing/p/7834808.html?utm_source=debugrun&utm_medium=referral
命令行:https://jingyan.baidu.com/article/455a9950500494a166277808.html
安裝模塊:
| Traceback (most recent call last): ? File "app.py", line 10, in <module> ??? from where2go_model import Where2Go_Model ? File "../code/model\where2go_model.py", line 1, in <module> ??? import gensim ImportError: No module named gensim |
pip install genism(在圖形化界面中安裝總是報(bào)錯(cuò),安裝失敗)
報(bào)錯(cuò)無(wú)滿足版本
pip?install?--upgrade?gensim
?
安裝(可能網(wǎng)絡(luò)問(wèn)題)報(bào)錯(cuò)
| File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\pip\_vendor\urllib3\response.py", line 374, in _error_catcher ??? raise ReadTimeoutError(self._pool, None, 'Read timed out.') ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. |
重復(fù)安裝命令
?
缺少模塊
| Traceback (most recent call last): ? File "app.py", line 1, in <module> ??? from flask import Flask ImportError: No module named flask |
圖形化界面安裝
?
| Traceback (most recent call last): ? File "app.py", line 10, in <module> ??? from where2go_model import Where2Go_Model ? File "../code/model\where2go_model.py", line 7, in <module> ??? from bs4 import BeautifulSoup ImportError: No module named bs4 |
安裝beautifulsoup4
圖形化界面安裝報(bào)錯(cuò)
pip install bs4
安裝成功
?
When I run the web app in a python2.7 environment with all the dependencies, I get the following error:
| Traceback (most recent call last): ? File "app.py", line 44, in <module> ??? app.where2go = load_pickle() ? File "app.py", line 19, in load_pickle ??? return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb')) IOError: [Errno 2] No such file or directory: '../data/pickles/where2go_model.pkl' |
?
查看源碼:
| webapp/app.py Showing the top four matches Last indexed Jun 30, 2018 Python
code/model/where2go_model.py Showing the top three matches Last indexed Jun 30, 2018 Python
? |
嘗試使用code文件夾中代碼搜集數(shù)據(jù)集
啟動(dòng)顯示:
| (py2_flask) D:\anacondaProject\where2go\webapp>python app.py D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial ? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") ?* Serving Flask app "app" (lazy loading) ?* Environment: production ?? WARNING: This is a development server. Do not use it in a production deployment. ?? Use a production WSGI server instead. ?* Debug mode: on ?* Restarting with stat D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial ? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") ?* Debugger is active! ?* Debugger PIN: 232-882-558 ?* Running on http://0.0.0.0:80/ (Press CTRL+C to quit) |
成功打開(kāi)網(wǎng)頁(yè)
使用方法:
| To search travel destinations, you can |
| 要搜索旅游目的地,您可以 1. 寫(xiě)入目的地和/或特征 2. 在單詞前面放置一個(gè) +(添加)或 -(減去) 符號(hào)以表示首選項(xiàng) 3. 將單詞乘以數(shù)字以增強(qiáng)(大于 1.0)或更低(小于 1.0)其影響 ? where2go 可能會(huì)推薦與輸入處于相同描述級(jí)別的位置。這意味著,在搜索城市時(shí),它更有可能返回城市名稱而不是國(guó)家/地區(qū)名稱。您最好輸入... 個(gè)別國(guó)家/城市Individual country/cities Spain 輸入相同級(jí)別的位置Adding places of same level description level hong kong + singapore 地點(diǎn)+特征Adding places + characteristic french polynesia + guam + scuba diving Search Tips: 至少放一個(gè)地點(diǎn)Try to put at least one place word2vec searches similar words so it is likely to return places with names related to the search *權(quán)重Play around with the place multipliers san francisco + 1.5*malaga will yield results more like malaga than san francisco + malaga 當(dāng)你想在B國(guó)到A這樣的城市When you want cities like A but in country B City A - Country of Place A + Country of Place B. |
無(wú)法搜索,報(bào)錯(cuò)
| 127.0.0.1 - - [02/Sep/2019 13:09:16] "POST /map HTTP/1.1" 500 - Traceback (most recent call last): ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2463, in __call__ ??? return self.wsgi_app(environ, start_response) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2449, in wsgi_app ??? response = self.handle_exception(e) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1866, in handle_exception ??? reraise(exc_type, exc_value, tb) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2446, in wsgi_app ?? ?response = self.full_dispatch_request() ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1951, in full_dispatch_request ??? rv = self.handle_user_exception(e) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1820, in handle_user_exception ??? reraise(exc_type, exc_value, tb) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request ??? rv = self.dispatch_request() ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1935, in dispatch_request ??? return self.view_functions[rule.endpoint](**req.view_args) ? File "D:\anacondaProject\where2go\webapp\app.py", line 41, in userinput ??? return json.dumps(app.result) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\__init__.py", line 244, in dumps ??? return _default_encoder.encode(obj) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 207, in encode ??? chunks = self.iterencode(o, _one_shot=True) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 270, in iterencode ??? return _iterencode(o, 0) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 184, in default ??? raise TypeError(repr(o) + " is not JSON serializable") TypeError: Decimal('113.26700000') is not JSON serializable |
查看源碼
| def userinput(): ??? data = request.data ? ??? ms = app.where2go.most_similar(data) ??? top_places_json = app.where2go.get_top_places_json(ms)? ??? ??? app.result['top_places'] = top_places_json ??? print top_places_json ??? return json.dumps(app.result) |
報(bào)錯(cuò)
| [(1.0, 'beijing')] ? [(u'guangzhou', 0.7928651571273804), (u'seoul', 0.7863544225692749), (u'nanjing', 0.7803971767425537), (u'tianjin', 0.776152491569519), (u'shanghai', 0.7680126428604126), (u'hangzhou', 0.747714638710022), (u'wuhan', 0.7452333569526672), (u'kunming', 0.7269240021705627), (u'fuzhou', 0.720137357711792), (u'xiamen', 0.7125382423400879), (u'beijing_shanghai', 0.7124168872833252), (u'busan', 0.7067762613296509), (u'harbin', 0.7055091857910156), (u'xian', 0.703764796257019), (u'taipei', 0.7032514810562134), (u'moscow', 0.7001821994781494), (u'urumqi', 0.6986857652664185), (u'shenyang', 0.6914734244346619), (u'chengdu', 0.6909835338592529), (u'munich', 0.6862865686416626), (u'vienna', 0.6839408874511719), (u'ulaanbaatar', 0.6831813454627991), (u'budapest', 0.6821123957633972), (u'vladivostok', 0.6806952953338623), (u'zhengzhou', 0.6783512830734253), (u'brussels', 0.6768432259559631), (u'copenhagen', 0.6743952035903931), (u'pyongyang', 0.6742997169494629), (u'bratislava', 0.667781412601471), (u'astana', 0.6674197912216187), (u'ningbo', 0.6667413711547852), (u'chongqing', 0.6655149459838867), (u'shenzhen', 0.6651620864868164), (u'qingdao', 0.6618784070014954), (u'sofia', 0.6600873470306396), (u'frankfurt', 0.6579354405403137), (u'nanning', 0.6576802730560303), (u'berlin', 0.6552646160125732), (u'wuchang', 0.6497694253921509)] 127.0.0.1 - - [02/Sep/2019 14:35:10] "POST /map HTTP/1.1" 500 - Traceback (most recent call last): ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2463, in __call__ ??? return self.wsgi_app(environ, start_response) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2449, in wsgi_app ??? response = self.handle_exception(e) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1866, in handle_exception ??? reraise(exc_type, exc_value, tb) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2446, in wsgi_app ?? ?response = self.full_dispatch_request() ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1951, in full_dispatch_request ??? rv = self.handle_user_exception(e) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1820, in handle_user_exception ??? reraise(exc_type, exc_value, tb) ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request ??? rv = self.dispatch_request() ? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1935, in dispatch_request ??? return self.view_functions[rule.endpoint](**req.view_args) ? File "D:\anacondaProject\where2go\webapp\app.py", line 39, in userinput ??? app.result['top_places'] = top_places_json TypeError: 'NoneType' object does not support item assignment |
top_places_json、app.result是一個(gè)list
查看json.dumps的使用方法:
django自帶encoder,無(wú)法序列化時(shí)增加一個(gè)cls=NpEncoder的參數(shù)。
也可能是數(shù)據(jù)中帶有numpy等數(shù)據(jù)類型,dumps無(wú)法識(shí)別
此處Decimal('113.26700000')是小數(shù)的意思
自定義類:https://blog.csdn.net/rt5476238/article/details/91398332
https://stackoverflow.com/questions/1960516/python-json-serialize-a-decimal-object/8274307#8274307
Simplejson 2.1 2.1 and higher has native support for Decimal type: json.dumps(Decimal('3.9'), use_decimal=True)
Note that use_decimal is True by default
Simplejson:是一個(gè)簡(jiǎn)單,快速,完整,正確和可擴(kuò)展的JSON[http://json.org]編碼器和解碼器的Python 2.5+和Python 3.3*。它是純 Python 代碼,沒(méi)有依賴項(xiàng),但包括可選的 C 擴(kuò)展,用于嚴(yán)重提升速度。
簡(jiǎn)單json的最新文檔可以在這里在線閱讀:https://simplejson.readthedocs.io/
simplejson 是 Python 2.6 和 Python 3.0 附帶的 json 庫(kù)的外部維護(hù)開(kāi)發(fā)版本,但保留了與 Python 2.5 的向后兼容性。
使用文檔:https://simplejson.readthedocs.io/en/latest/
嘗試引入:
| 修改import json 為import simplejson as json |
| pip install simplejson |
(除了存的banner圖片還沒(méi)放到文件夾里)全部完成,但是沒(méi)有包含線路,和客戶關(guān)注信息。
Anaconda常用命令
新建虛擬環(huán)境
(base) C:\Users\zdp>conda create -n django
?
激活虛擬環(huán)境
(base) C:\Users\zdp>activate py2_flask
?
進(jìn)入項(xiàng)目文件夾路徑
(django) C:\Users\laugo>cd /d D:\anacondaProject\where2go\webapp
運(yùn)行py文件:
python app.py
| 爬的時(shí)候運(yùn)行: activate py2_flask cd /d D:\anacondaProject\where2go\code\data_collection python scrap_wikivoyage_banners.py |
| 打開(kāi)app時(shí): |
| activate py2_flask cd /d D:\anacondaProject\where2go\webapp python app.py |
查看flask版本:
| python import flask flask.__version__ |
常用算法
下一個(gè)問(wèn)題是找出使用哪種模型。傳統(tǒng)的自然語(yǔ)言處理推薦系統(tǒng)包括 TF-IDF + cos-similarity和 TF-IDF + SVD + k - means聚類等模型。
自然語(yǔ)言處理natural language processing:
自然語(yǔ)言處理技術(shù)(NLP)在推薦系統(tǒng)中的應(yīng)用https://blog.csdn.net/heyc861221/article/details/80130263
相比結(jié)構(gòu)化信息(例如商品的屬性等),文本信息在具體使用時(shí)具有一些先天缺點(diǎn):結(jié)構(gòu)代表著信息量,無(wú)論是使用算法還是業(yè)務(wù)規(guī)則,都可以根據(jù)結(jié)構(gòu)化信息來(lái)制定推薦策略;信息量不確定;歧義問(wèn)題較多
優(yōu)點(diǎn):數(shù)據(jù)量大;多樣性豐富;信息及時(shí)
推薦系統(tǒng)中常見(jiàn)的文本處理方法:
文本數(shù)據(jù)的一些“顯式”使用方法:
詞袋模型(Bag of Words,簡(jiǎn)稱BOW模型):核心假設(shè)是認(rèn)為一篇文檔是由文檔中的詞組成的多重集合(多重集合與普通集合的不同在于考慮了集合中元素的出現(xiàn)次數(shù))構(gòu)成的。
統(tǒng)一度量衡:權(quán)重計(jì)算和向量空間模型
簡(jiǎn)單的詞袋模型在經(jīng)過(guò)適當(dāng)預(yù)處理之后,可以用來(lái)在推薦系統(tǒng)中召回候選物品。但是在計(jì)算物品和關(guān)鍵詞的相關(guān)性,以及物品之間的相關(guān)性時(shí),僅僅使用簡(jiǎn)單的詞頻作為排序因素顯然是不合理的。為了解決這個(gè)問(wèn)題,我們可以引入表達(dá)能力更強(qiáng)的基于TF-IDF的權(quán)重計(jì)算方法。其中tft,d代表t在d中出現(xiàn)的頻次,而dft指的是包含t的文檔數(shù)目
這些方法的目的都是使對(duì)詞在文檔中重要性的度量更加合理,在此基礎(chǔ)之上,我們可以對(duì)基于詞頻的方法進(jìn)行改進(jìn),例如,可以將之前使用詞頻來(lái)對(duì)物品進(jìn)行排序的方法,改進(jìn)為根據(jù)TF-IDF得分來(lái)進(jìn)行排序。
但是除此以外,我們還需要一套統(tǒng)一的方法來(lái)度量關(guān)鍵詞和文檔,以及文檔和文檔之間的相關(guān)性,這套方法就是向量空間模型(Vector Space Model,簡(jiǎn)稱VSM)。
VSM的核心思想是將一篇文檔表達(dá)為一個(gè)向量,向量的每一維可以代表一個(gè)詞,在此基礎(chǔ)上,可以使用向量運(yùn)算的方法對(duì)文檔間相似度進(jìn)行統(tǒng)一計(jì)算。文本相關(guān)性計(jì)算方面,我們可以使用TFIDF填充向量,同時(shí)也可以用N-gram,以及后面會(huì)介紹的文本主題的概率分布、各種詞向量等其他表示形式。
?
隱語(yǔ)義模型:
(Latent Semantic Analysis,簡(jiǎn)稱LSA)
模型的核心假設(shè),是認(rèn)為雖然一個(gè)文檔由很多的詞組成,但是這些詞背后的主題并不是很多。換句話說(shuō),詞不過(guò)是由背后的主題產(chǎn)生的,這背后的主題才是更為核心的信息。這種從詞下沉到主題的思路,貫穿著我們后面要介紹到的其他模型,也是各種不同文本主體模型(Topic Model)的共同中心思想
LSA的做法是將這個(gè)原始矩陣C進(jìn)行SVD分解
可以看到LSA相比關(guān)鍵詞來(lái)說(shuō)前進(jìn)了一大步,主要體現(xiàn)在信息量的提升,維度的降低,以及對(duì)近義詞和多義詞的理解。但是LSA同時(shí)也具有一些缺點(diǎn),例如:訓(xùn)練復(fù)雜度高。LSA的訓(xùn)練時(shí)通過(guò)SVD進(jìn)行的,而SVD本身的復(fù)雜度是很高的,在海量文檔和海量詞匯的場(chǎng)景下難以計(jì)算,雖然有一些優(yōu)化方法可降低計(jì)算的復(fù)雜度,但該問(wèn)題仍然沒(méi)有得到根本解決。
檢索(召回)復(fù)雜度高。如上文所述,使用LSA做召回需要先將文檔或者查詢關(guān)鍵詞映射到LSA的向量空間中,這顯然也是一個(gè)耗時(shí)的操作。
LSA中每個(gè)主題下詞的值沒(méi)有概率含義,甚至可能出現(xiàn)負(fù)值,只能反應(yīng)數(shù)值大小關(guān)系。這讓我們難以從概率角度來(lái)解釋和理解主題和詞的關(guān)系,從而限制了我們對(duì)其結(jié)果更豐富的使用。
概率隱語(yǔ)義模型:
將文檔和詞的關(guān)系看作概率分布,然后試圖找出這個(gè)概率分布來(lái)
從矩陣的角度來(lái)看,LSA和pLSA看上去非常像,但是它們的內(nèi)涵卻有著本質(zhì)的不同,這其中最為重要的一點(diǎn)就是兩者的優(yōu)化目標(biāo)是完全不同的:LSA本質(zhì)上是在優(yōu)化SVD分解后的矩陣和原始矩陣之間的平方誤差,而pLSA本質(zhì)上是在優(yōu)化似然函數(shù),是一種標(biāo)準(zhǔn)的機(jī)器學(xué)習(xí)優(yōu)化套路。也正是由于這一點(diǎn)本質(zhì)的不同,導(dǎo)致了兩者在優(yōu)化結(jié)果和解釋能力方面的不同。
但是pLSA仍然存在一些問(wèn)題,主要包括:
由于pLSA為每個(gè)文檔生成一組文檔級(jí)參數(shù),模型中參數(shù)的數(shù)量隨著與文檔數(shù)成正比,因此在文檔數(shù)較多的情況下容易過(guò)擬合。
pLSA將每個(gè)文檔d表示為一組主題的混合,然而具體的混合比例卻沒(méi)有對(duì)應(yīng)的生成概率模型,換句話說(shuō),對(duì)于不在訓(xùn)練集中的新文檔,pLSA無(wú)法給予一個(gè)很好的主題分布。簡(jiǎn)言之,pLSA并非完全的生成式模型。
而LDA的出現(xiàn),就是為了解決這些問(wèn)題。
生成式概率模型:
(Latent Dirichlet Allocation,簡(jiǎn)稱LDA)
Latent:這個(gè)詞不用多說(shuō),是說(shuō)這個(gè)模型仍然是個(gè)隱語(yǔ)義模型。
Dirichlet:這個(gè)詞是在說(shuō)該模型涉及到的主要概率分布式狄利克雷分布。
Allocation:這個(gè)詞是在說(shuō)這個(gè)模型的生成過(guò)程就是在使用狄利克雷分布不斷地分配主題和詞
LDA的中心思想就是在pLSA外面又包了一層先驗(yàn),使得文檔中的主題分布和主題下的詞分布都有了生成概率,從而解決了上面pLSA存在的“非生成式”的問(wèn)題,順便也減少了模型中的參數(shù),從而解決了pLSA的另外一個(gè)問(wèn)題
捕捉上下文信息:神經(jīng)概率語(yǔ)言模型
pLSA/LDA有一個(gè)很重要的假設(shè),那就是文檔集合中的文檔,以及一篇文檔中的詞在選定了主題分布的情況下都是相互獨(dú)立,可交換的,換句話說(shuō),模型中沒(méi)有考慮詞的順序以及詞和詞之間的關(guān)系,這種假設(shè)隱含了兩個(gè)含義:在生成詞的過(guò)程中,之前生成的詞對(duì)接下來(lái)生成的詞是沒(méi)有影響的。
兩篇文檔如果包含同樣的詞,但是詞的出現(xiàn)順序不同,那么在LDA看來(lái)他們是完全相同的
這樣的假設(shè)使得LDA會(huì)丟失一些重要的信息,而近年來(lái)得到關(guān)注越來(lái)越多的以word2vec為代表的神經(jīng)概率語(yǔ)言模型恰好在這方面和LDA形成了一定程度的互補(bǔ)關(guān)系,從而可以捕捉到LDA所無(wú)法捕捉到的信息。
word2vector的中心思想用一句話來(lái)講就是:A word is characterized by the company it keeps(一個(gè)詞的特征由它周?chē)脑~所決定)。
很像是成語(yǔ)中的“物以類聚人以群分”
具體來(lái)講,詞向量模型使用“周?chē)脑~=>當(dāng)前詞”或“當(dāng)前詞=>周?chē)脑~”這樣的方式構(gòu)造訓(xùn)練樣本,然后使用神經(jīng)網(wǎng)絡(luò)來(lái)訓(xùn)練模型,訓(xùn)練完成之后,輸入詞的輸入向量表示便成為了該詞的向量表示,如圖3所示。
LDA天然就可以做到詞的聚類和相似詞的計(jì)算,那么使用word2vec計(jì)算出來(lái)的結(jié)果和LDA有什么不同:第一是聚類的粒度不同,LDA關(guān)注的主題級(jí)別的粒度,層次更高,而詞向量關(guān)注的是更低層次的語(yǔ)法語(yǔ)義級(jí)別的含義。例如“蘋(píng)果”,“小米”和“三星”這三個(gè)詞,在LDA方法中很可能會(huì)被聚類在一個(gè)主題中,但是在詞向量的角度來(lái)看,“蘋(píng)果”和“小米”可能會(huì)具有更高的相似度,就像“喬布斯”和“雷軍”在詞向量下的關(guān)系一樣,所以在詞向量中可能會(huì)有:“vector(小米)- vector(蘋(píng)果)+vector(喬布斯)= vector(雷軍)”這樣的結(jié)果。
除此以外,由于word2vec有著“根據(jù)上下文預(yù)測(cè)當(dāng)前內(nèi)容”的能力,將其做適當(dāng)修改之后,還可以用來(lái)對(duì)用戶行為喜好做出預(yù)測(cè)。首先我們將用戶的行為日志進(jìn)行收集,進(jìn)行session劃分,得到類似文本語(yǔ)料的訓(xùn)練數(shù)據(jù),在這個(gè)數(shù)據(jù)上訓(xùn)練word2vec模型,可以得到一個(gè)“根據(jù)上下文行為預(yù)測(cè)當(dāng)前行為”的模型。
沿著這樣的思路,我們還可以對(duì)word2vec作進(jìn)一步修改,得到對(duì)時(shí)序關(guān)系更為敏感的模型,以及嘗試使用RNN、LSTM等純時(shí)序模型來(lái)得到更好的預(yù)測(cè)結(jié)果
Word2vec原理
Word2vec概述:http://www.mamicode.com/info-detail-2150217.html無(wú)監(jiān)督學(xué)習(xí)
概要(比較專業(yè)詳細(xì)):https://www.jianshu.com/p/bca4e7bfb86d
應(yīng)用, 序列數(shù)據(jù) + 局部強(qiáng)關(guān)聯(lián)
聚類, 找同義詞, 詞性分析
文本序列: 近鄰強(qiáng)關(guān)聯(lián), 可通過(guò)上下文預(yù)測(cè)目標(biāo)詞(選詞填空)
社交網(wǎng)絡(luò): 隨機(jī)游走生成序列, 然后使用word2vec訓(xùn)練每個(gè)節(jié)點(diǎn)的向量.
推薦系統(tǒng), 廣告(APP下載序列: word2vec + similarity = aggr to )
word2vec 從原理到實(shí)現(xiàn):https://zhuanlan.zhihu.com/p/43736169
word2vec中哈夫曼樹(shù)原理https://www.jianshu.com/p/f9351532f281
genism中關(guān)于word2vec使用的文檔https://radimrehurek.com/gensim/models/word2vec.html
word2vec原理介紹:https://www.zhihu.com/topic/19886836/hot(其中的幾篇參考也值得一看)
Hierarchical softmax 和 negative sampling優(yōu)化:https://www.cnblogs.com/Determined22/p/5807362.html
網(wǎng)站細(xì)節(jié)
我能夠運(yùn)行網(wǎng)站使用python Flask。使用 javascript 對(duì)搜索引擎執(zhí)行 AJAX 調(diào)用,以便可以在模型上運(yùn)行用戶的搜索查詢,以預(yù)測(cè)最相似的位置并在地圖上顯示建議。
Flask 文件名為"app.py",可在文件夾"webapp"中找到;
"index.html"文件包含 html 和 javascript,可以在文件夾"模板"中找到。Bootstrap來(lái)設(shè)計(jì)網(wǎng)站。
Html
Html中點(diǎn)擊查詢,onClick="sendToFlask()"
| function sendToFlask() { ??????????????????????????????? data = $('#user_input').val(); ? ??????????????????????????????? $.ajax({ ??????????????????????????????????? 'url': '/map', ??????????????????????????????????? 'data': data, ??????????????????????????????????? 'type': 'POST', ??????????????????????????????????? 'contentType': 'application/json', ??????????????????????????????????? 'success': function (data) { ??????????????????????????????????????? model_output = JSON.parse(data) ?????????????????????????????????????? ?var center_location = model_output['center_location']; ?????????????? ?????????????????????????var geojson = model_output['top_places']; ??????????????????????????????????????? ??????????????????????????????????????? //Initialize ??????????????????????????????????????? $('#error_msg').remove() ????????????????????????????? ??????????$('#portfolio').empty() ? ??????????????????????????????????????? //if geojson list is empty, display error message. ??????????????????????????????????????? if (geojson.length==0) { ??????????????????????????????????????????? error_message() ???? ???????????????????????????????????}; ? ??????????????????????????????????????? var portfolio_header =? '<br/><div class="col-lg-12 text-center"><h3 class="section-heading">Places 2 Go</h2></div>' ? ??????????????????????????????????????? $('#portfolio').append(portfolio_header); ? ? ??????????????????????????????????????? addtoPortfolio(geojson); ? ??????????????????????????????????????? // Clear the map before adding new markers ??????????????????????????????????????? mapSimple.removeLayer(myLayer); ? ??????????????????????????????????????? // Create new layer ??????????????????????????????????????? myLayer = L.mapbox.featureLayer(); ? ??????????????????????????????????????? // Add custom popups to each using our custom feature properties ?????????????? ?????????????????????????myLayer.on('layeradd', function(e) { ??????????????????????????????????????????? var marker = e.layer, ??????????????????????????????????????????????? feature = marker.feature; ? ??????????????????????????????????????????? // Create custom popup content ??????????????????????????????????????????? var popupContent =? '<a target="_blank" class="popup" href="' + feature.properties.url + '">' + '<div class=crop><img src="' + feature.properties.image + '" height/></div><div class=text-center style="padding:15px 0 0 0"><font size="5">' + feature.properties.title + '</font></div></a>'; ? ??????????????????????????????????????????? // http://leafletjs.com/reference.html#popup ??????????????????????????????????????????? marker.bindPopup(popupContent,{ ??????????????????????????????????????????????? closeButton: true, ??????????????????????????????????????????????? minWidth: 320 ??????????????????????????????????????????? }); ??????????????????????????????????????? }); ? ? ??????????????????????????????????????? myLayer.setGeoJSON(geojson).addTo(mapSimple); ??????????????????????????????????????? ??????????????????????????????????????? mapSimple.fitBounds(myLayer.getBounds()); ??????????????????????????????????????? // mapSimple.clearLayers(); ??????????????????????????????????? }, ??????????????????????????????????? 'error': function (request, status, error) { ??????????????????????????????????????? $('#error_msg').remove() ??????????????????????????????????????? error_message(); ??????????????????????????????????????? console.log('Oh no!! Something went wrong.'); ??????????????????????????????????? } ??????????????????????????????? }); ??????????????????????????? }; |
網(wǎng)頁(yè)使用mapbox
地圖無(wú)法顯示了,有可能是因?yàn)閠oken失效或者沒(méi)連上網(wǎng)(控制臺(tái)提示'L' is not defined)
Font awesome的icon使用(沒(méi)有CSDN,不打算用了)
雖然不影響使用,貌似有一些沒(méi)下載下來(lái)?
| GET /static/font-awesome/css/font-awesome.min.css HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/css/bootstrap.min.css HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/css/agency.css HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/jquery.js HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/bootstrap.min.js HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/cbpAnimatedHeader.js HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/agency.js HTTP/1.1" 304 - 127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/classie.js HTTP/1.1" 304 - |
?
Flask后臺(tái)接口
范例:
| @app.route('/1' , methods=['POST']) def aa(): #傳什么返回什么 ??? with open('1.txt','a') as f: ??????? print(str(request.data, encoding='utf-8'),file=f) ? ??? return request.data ? if __name__=='__main__': ??? app.run(port=3002)#默認(rèn)不填寫(xiě)的話,是5000端口 |
文件為app.py
| app.run(host= '0.0.0.0', port=80, debug=True) |
| @app.route('/map', methods=['POST']) def userinput(): ??? data = request.data ? ??? ms = app.where2go.most_similar(data) ??? top_places_json = app.where2go.get_top_places_json(ms)? ??? # print top_places_json ??? app.result['top_places'] = top_places_json ??? return json.dumps(app.result) |
Model
?
?
在我的 where2go_model.py 文件中,我實(shí)現(xiàn)了 gensim 的 word2vec 模型,并編寫(xiě)了矢量化用戶搜索查詢的函數(shù),和將建議篩選到實(shí)際地理位置、以geojson格式輸出目的地的函數(shù)。
| use the trained word2vec model to give most similar recommendations to the input ? ??????? input = search string in the format of place/char + place/char -... ??????? output = top recommendations in json format |
| 使用經(jīng)過(guò)訓(xùn)練的 word2vec 模型為輸入提供最類似的建議 |
| terms = self.parse_search_query(input) 將用戶查詢解析為乘數(shù)和目標(biāo) |
| ??????? # Set to make sure the output doesn't include one of the input destinations. check = set() |
| 確保輸出中不包含輸入的目的地 |
| # For (multiplier, destination), get the multiplier * vector of that destination. ??????? # Then sum up to the master vector. ??????? for i, term in enumerate(terms): ??????????? multiplier, word = term ??????????? check.add(word) ??????????? if i == 0: ??????????????? master_vector = multiplier * self.model[word] ??????????? else: ??????????????? master_vector += multiplier * self.model[word] |
| 對(duì)于(乘數(shù)、目的地),獲取該目標(biāo)的乘數(shù) + 矢量。 然后加到主向量中 |
| # Find the most similar vectors to the amter vector ??????? ms = self.model.most_similar(positive=[master_vector], topn=topn) ??????? ms_wo_search_terms = [dest for dest in ms if dest[0] not in check] ? ??????? print ms_wo_search_terms ? ??????? return ms_wo_search_terms |
| 查找與 master 矢量最相似的矢量 |
?
疑問(wèn),ms到底怎么查出來(lái)的,
| ms = self.model.most_similar(positive=[master_vector], topn=topn) |
是自己調(diào)用自己?jiǎn)?#xff0c;還是word2vec自帶方法
有可能是自帶方法,但貌似不建議使用,警告:
| DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel. |
| 方法將在 4.0.0 中刪除,改用self.wv.most_similar() |
類分析
| import cPickle as pkl#序列化 from where2go_model import Where2Go_Model#模型 |
| def load_pickle(): ??? return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb')) |
| ms = app.where2go.most_similar(data) |
運(yùn)行時(shí)只有導(dǎo)入的where2go_model中有Where2Go_Model類,反序列化model也是它
但Where2Go_Model中也加載了其他pkl(找了一會(huì)在哪生成的,記憶模糊,拎不清了,離生成這些pkl已經(jīng)過(guò)了很久了,猜測(cè)既然where2go_model中沒(méi)有把其他code文件導(dǎo)入,應(yīng)該沒(méi)有其他類了,其他類之后抽空再看)
Pkl
Java 中有序列化與反序列化的操作, 在 Python 中可以進(jìn)行同樣的操作。使用 Python 進(jìn)行對(duì)象的序列化(dump)與反序列化(load)操作時(shí), 我們不用考慮其中的細(xì)節(jié), 因?yàn)?Python 已經(jīng)幫我們封裝好了相關(guān)的類cPickle。
模型分析
Eda
Exploratory data analysis and data cleaning have been performed with ipython notebook. Wikivoyage and NYT data were loaded, cleaned, pickled out as input format for word2vec, which is a list of sentences where each sentence is represented as a list of words. Also, global NOAA weather data was downloaded but I later determined that it leaves out major parts of the world. Thus, more data has to be collected to incorporate weather to the project.
Ipython notebook已執(zhí)行探索性數(shù)據(jù)分析和數(shù)據(jù)清理。Wikivoyage 和 NYT 數(shù)據(jù)被加載、清理、挑選出來(lái)作為 word2vec 的輸入格式,該格式是句子列表,其中每個(gè)句子都表示為單詞列表。此外,全球NOAA天氣數(shù)據(jù)被下載,但我后來(lái)確定,它忽略了世界的主要部分。因此,要將天氣納入項(xiàng)目必須收集更多的數(shù)據(jù)。
使用文件/data/wikivoyage.json(第一步Gathering Data的enwikivoyage-latest-pages-articles.xml轉(zhuǎn)化得來(lái))369M有空去官網(wǎng)對(duì)該文件了解
處理數(shù)據(jù),作為輸入格式
?
Model
Where2go is based on a model created at Google called word2vec. Word2vec is a neural network with 1 hidden layer that has continuous bag of words (CBOW) or skip-grams implementation. Where2go uses the version that uses skip-grams and hierarchical softmax for optimization.
On the high level, word2vec tries to train the neural network to paramatize a model that can predict the surrounding words for every word in the corpus. The predictions are then used to backpropogate and optimize the parameters to make words with similar contexts be closer together, while being further away from words that have different contexts. The input-hidden layer weighting matrix, which is also the vector representation of words, is then used to gain insight into the meaning/similarity of words.
In my where2go_model.py file, I implemented gensim's word2vec model and wrote functions to vectorize user search queries and functions to filter the recommendations to actual geolocations and output destinations in geojson format.
Where2go基于谷歌創(chuàng)建的名為word2vec的模型。Word2vec 是一個(gè)神經(jīng)網(wǎng)絡(luò),具有 1 個(gè)隱藏層,該層具有連續(xù)單詞袋 (CBOW) 或skip-grams實(shí)現(xiàn)。where2go 使用的版本使用skip-grams 和hierarchical softmax進(jìn)行優(yōu)化。
在高層級(jí)上,word2vec 試圖訓(xùn)練神經(jīng)網(wǎng)絡(luò),以參數(shù)化一個(gè)模型,該模型可以預(yù)測(cè)語(yǔ)料庫(kù)中每個(gè)單詞的周?chē)鷨卧~。然后,這些預(yù)測(cè)用于回推和優(yōu)化參數(shù),使具有相似上下文的單詞更緊密地聯(lián)系在一起,同時(shí)遠(yuǎn)離具有不同上下文的單詞。然后,使用輸入隱藏層加權(quán)矩陣(也是單詞的矢量表示形式)來(lái)深入了解單詞的含義/相似性。
在我的 where2go_model.py 文件中,我實(shí)現(xiàn)了 gensim 的 word2vec 模型,并編寫(xiě)了矢量化用戶搜索查詢的函數(shù),和將建議篩選到實(shí)際地理位置、以geojson格式輸出目的地的函數(shù)
模型建立取最相似(word2vec):
| bigram = gensim.models.Phrases(wikivoyage_list, min_count = 10) model_bigrams = gensim.models.Word2Vec(bigram[wikivoyage_list], min_count=10, size = 200) Ms = model_bigrams.most_similar(positive=['paris','london','sevilla'], negative = [], topn=20) top_places = [] for entry in ms: ??? place, sim = entry |
模型使用
| terms = self.parse_search_query(input) ? ??????? # Set to make sure the output doesn't include one of the input destinations. ??????? check = set() ? ??????? # For (multiplier, destination), get the multiplier * vector of that destination. ??????? # Then sum up to the master vector. ??????? for i, term in enumerate(terms): ??????????? multiplier, word = term ??????????? check.add(word) ??????????? if i == 0: ??????????????? master_vector = multiplier * self.model[word] ??????????? else: ??????????????? master_vector += multiplier * self.model[word] ? ??????? # Find the most similar vectors to the amter vector ??????? ms = self.model.most_similar(positive=[master_vector], topn=topn) ??????? ms_wo_search_terms = [dest for dest in ms if dest[0] not in check] |
Word2vec模型實(shí)現(xiàn)原理與源碼:
word2vec 算法包括skip-gram & CBOW模型,使用hierarchical softmax or negative sampling
我們這用的是skip-gram+hierarchical softmax
很多人以為 word2vec 是一種模型和方法,其實(shí) word2vec 只是一個(gè)工具,背后的模型是 CBOW 或者 Skip-gram,并且使用了 Hierarchical Softmax 或者 Negative Sampling 這些訓(xùn)練的優(yōu)化方法。所以準(zhǔn)確說(shuō)來(lái),word2vec 并不是一個(gè)模型或算法,只不過(guò) Mikolov 恰好在當(dāng)時(shí)把他開(kāi)源的工具包起名叫做 word2vec 而已。
softmax(正則的指數(shù)函數(shù))是輸出層函數(shù),他可以用于計(jì)算至少兩種不同類型的常見(jiàn)詞嵌入:word2vec, FastText。另外,它與sigmoid和tanh函數(shù)都是許多種類型的神經(jīng)網(wǎng)絡(luò)架構(gòu)的激活步驟
這個(gè)算法的復(fù)雜性就直接是我們單詞表的大小O(V)。事實(shí)表明,我們使用二叉樹(shù)的結(jié)構(gòu)可以簡(jiǎn)化這個(gè)復(fù)雜性,即分層(hierarchical) softmax
模型需要學(xué)習(xí)的參數(shù):每個(gè)單詞的詞向量Xw + 霍夫曼樹(shù)每個(gè)內(nèi)部結(jié)點(diǎn)的θ
基于 H-softmax 模型的梯度計(jì)算
涉及到的公式太多了,在此直接把劉建平博客里的梯度計(jì)算過(guò)程貼過(guò)來(lái):
spark mllib 里的 word2vec 實(shí)現(xiàn)就是采用的此方式,知道了上面梯度公式,spark word2vec源碼就能看懂了。
// 省略了建樹(shù)的過(guò)程,在建樹(shù)的過(guò)程中會(huì)給每個(gè)內(nèi)部結(jié)點(diǎn)編碼 while (pos < sentence.length) {val word = sentence(pos)val b = random.nextInt(window)// Train Skip-gram,// syn0 是詞向量 x 參數(shù)數(shù)組,長(zhǎng)度為 vocab_size * emb_size// syn1 是霍夫曼樹(shù)內(nèi)部結(jié)點(diǎn) w 參數(shù)數(shù)組,長(zhǎng)度同上var a = bwhile (a < window * 2 + 1 - b) {if (a != window) {val c = pos - window + aif (c >= 0 && c < sentence.length) {val lastWord = sentence(c)val l1 = lastWord * vectorSizeval neu1e = new Array[Float](vectorSize)// Hierarchical softmaxvar d = 0while (d < bcVocab.value(word).codeLen) {val inner = bcVocab.value(word).point(d)val l2 = inner * vectorSize// Propagate hidden -> outputvar f = blas.sdot(vectorSize, syn0, l1, 1, syn1, l2, 1) // 計(jì)算 x^Twif (f > -MAX_EXP && f < MAX_EXP) {val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toIntf = expTable.value(ind) // 計(jì)算 f = sigmoid(x^Tw)val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat // 計(jì)算梯度 g = (1-d-f) * alpha, d 是該節(jié)點(diǎn)的編碼(0/1),alpha是學(xué)習(xí)率blas.saxpy(vectorSize, g, syn1, l2, 1, neu1e, 0, 1) // 累加 e = e + gw, e 初始化 0blas.saxpy(vectorSize, g, syn0, l1, 1, syn1, l2, 1) // 更新 w = w + gxsyn1Modify(inner) += 1}d += 1}blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn0, l1, 1) // 更新 x = x + esyn0Modify(lastWord) += 1}}a += 1}pos += 1 }? ? ? ?
?
總結(jié)
以上是生活随笔為你收集整理的机器学习项目搭建试验 where2go的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 芮勇出任联想CTO,阿里巴巴获CIKM
- 下一篇: Typora配置图床