當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

机器学习项目搭建试验 where2go

發(fā)布時(shí)間：2023/12/8 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习项目搭建试验 where2go 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

https://github.com/da248/where2go

這個(gè)項(xiàng)目感覺(jué)還是挺好的，雖然沒(méi)給各個(gè)數(shù)據(jù)集的下載鏈接，也有一些莫名其妙的bug，但是錯(cuò)誤調(diào)試提示都還挺全，能一直有進(jìn)展。

（看了下這個(gè)好像不太頂用-純html調(diào)用APIhttps://github.com/alex-engelmann/Where2Go）

1) Gathering Data

wikivoyage_xml_to_json.py

2.wikivoyage_geotags_sql.py

3.scrap_wikivoyage_banners.py

New York Times

2) EDA

Wiki voyage

更改默認(rèn)保存位置

Weather

Nyt

3) Model

Webapp

Final Remarks

啟動(dòng)項(xiàng)目

Anaconda常用命令

常用算法

自然語(yǔ)言處理natural language processing：

推薦系統(tǒng)中常見(jiàn)的文本處理方法：

Word2vec原理

網(wǎng)站細(xì)節(jié)

Html

Flask后臺(tái)接口

Model

模型分析

Eda

Model

基于 H-softmax 模型的梯度計(jì)算

here2go 是專為推薦您的地方，根據(jù)你喜歡或不喜歡的地方/字符，而不是基于有廉價(jià)航班的目的地。

有很多網(wǎng)站告訴你最便宜的方式去目的地和最便宜的酒店住宿。但他們忘了問(wèn)你一個(gè)非常根本的問(wèn)題...你知道去哪里嗎？文章，如"XX的25大旅游目的地"或"YY的100個(gè)地方，你必須訪問(wèn)！

此應(yīng)用程序的動(dòng)機(jī)之一是建立一個(gè)公正的推薦系統(tǒng)，該系統(tǒng)將考慮目的地的特征，而不是查看其他人喜歡的目的地。為此，我決定使用旅行指南來(lái)收集目的地信息。我發(fā)現(xiàn)，Wikivoyage提供了偉大的旅游指南，告訴你關(guān)于這個(gè)地方的歷史和文化，以及什么看，如何四處走動(dòng)，等等。

Try it out on www.where2go.help

下一個(gè)問(wèn)題是找出使用哪種模型。傳統(tǒng)的自然語(yǔ)言處理推薦系統(tǒng)包括 TF-IDF + cos-similarity和 TF-IDF + SVD + k - means聚類等模型。這些模型可能做偉大的工作，找到類似的目的地，但我想使用模型，讓我添加地方字符，如'海灘'或'酒'在我的搜索。因此，我決定去與谷歌最近創(chuàng)建的模型稱為word2vec。Word2vec 是一個(gè)驚人的模型，它將單詞轉(zhuǎn)換為捕捉單詞"含義"的矢量。此模型的酷功能是，您可以添加/減去單詞，因?yàn)樗鼈兪鞘噶俊＠?#xff0c;你可以做類似操作'king' - 'man' + 'woman' 產(chǎn)生 a vector that ~= 'queen'。我的 Word2vec 模型了解了 wikivoyage 文章中介紹的單詞和地點(diǎn)的旅游特定上下文，允許矢量操作推薦類似位置。

使用 word2vec，我能夠獲得與搜索查詢具有最接近語(yǔ)義含義的單詞/目的地的建議。但是，我必須找出一種方法，確定哪些建議實(shí)際上是地理位置，哪些只是接近的話。我能夠使用Wikivoyage的地理定位數(shù)據(jù)來(lái)檢查這一點(diǎn)。

一旦我訓(xùn)練了旅行環(huán)境模型，我就構(gòu)建了一個(gè) Web 應(yīng)用程序來(lái)交付我的數(shù)據(jù)科學(xué)項(xiàng)目。我使用 javascript 執(zhí)行 AJAX 調(diào)用，將用戶查詢的結(jié)果更新到 MapBox map and Bootstrap to format the pages。

*我還收集了《紐約時(shí)報(bào)》的《旅行、世界和科學(xué)》（其中有很多環(huán)保文章）新聞來(lái)豐富我的數(shù)據(jù)源，但決定將其排除在外，因?yàn)榻Y(jié)果過(guò)于"新聞化"。

Methodology

The code folder is divided into three sections 1) data collection, 2) EDA, 3) model.

1) Gathering Data

####Wikivoyage There are three files for wikivoyage data.三個(gè)wikivoyage數(shù)據(jù)的文件

wikivoyage_xml_to_json.py

The purpose of this file is to convert Wikivoyage travel guide articles to JSON format. Wikivoyage provided a data dump of its articles in XML format and I converted it to JSON format to go through exploratory data analysis with pandas.

wikivoyage_xml_to_json.py

此文件的目的是將 Wikivoyage 旅行指南文章轉(zhuǎn)換為 JSON 格式。Wikivoyage 以 XML 格式提供了文章的數(shù)據(jù)轉(zhuǎn)儲(chǔ)，我將其轉(zhuǎn)換為 JSON 格式，以便用pandas進(jìn)行探索性數(shù)據(jù)分析。

運(yùn)行：

ImportError: No module named xmltodict

圖形化界面安裝

ImportError: No module named pandas

圖形化界面安裝

Traceback (most recent call last):

? File "wikivoyage_xml_to_json.py", line 25, in <module>

??? jdata = convert_xml_to_json('data/wikivoyage/enwikivoyage-latest-pages-articles.xml')

? File "wikivoyage_xml_to_json.py", line 12, in convert_xml_to_json

??? xml_str = open(filename).read()

IOError: [Errno 2] No such file or directory: 'data/wikivoyage/enwikivoyage-latest-pages-articles.xml'

在https://dumps.wikimedia.org/enwikivoyage/latest/找數(shù)據(jù)集

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python wikivoyage_xml_to_json.py

成功運(yùn)行該文件后在where2go\code\data_collection\data\wikivoyage獲得wikivoyage.json一份，耶

2.wikivoyage_geotags_sql.py

The purpose of this file is to gather the geolocations of articles (places). Wikivoyage provided the geolocations of articles as a sql file. I created my own MySQL database to load in and query the data. I also did a bit of data cleaning in this file to remove the accents.

維基航行_地理標(biāo)記_sql.py

此文件的目的是收集文章（地點(diǎn)）的地理位置。Wikivoyage 提供了文章的地理位置作為 sql 文件。我創(chuàng)建自己的 MySQL 數(shù)據(jù)庫(kù)來(lái)加載和查詢數(shù)據(jù)。我還在這個(gè)文件做了一些數(shù)據(jù)清理,刪除口音。

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python wikivoyage_geotags_sql.py

? File "wikivoyage_geotags_sql.py", line 72

??? geotag_dict = create_geotag_dict():

????????????????????????????????????? ^

SyntaxError: invalid syntax

嘗試刪除這個(gè)冒號(hào)

No module named pymysql.cursors

pip install pymysql

Traceback (most recent call last):

? File "wikivoyage_geotags_sql.py", line 9, in <module>

??? cursorclass=pymysql.cursors.DictCursor)

?pymysql.err.OperationalError: (1045, u"Access denied for user 'admin'@'localhost' (using password: NO)")

查看源碼：

# Connect to the database

connection = pymysql.connect(user='admin',

???????????????????????????? db='wiki',

cursorclass=pymysql.cursors.DictCursor)

查看連接方法：https://www.cnblogs.com/woider/p/5926744.html

pymysql.Connect()參數(shù)說(shuō)明

host(str):????? MySQL服務(wù)器地址

port(int):????? MySQL服務(wù)器端口號(hào)

user(str):????? 用戶名

passwd(str):??? 密碼

db(str):??????? 數(shù)據(jù)庫(kù)名稱

charset(str):?? 連接編碼

connection對(duì)象支持的方法

cursor()??????? 使用該連接創(chuàng)建并返回游標(biāo)

commit()??????? 提交當(dāng)前事務(wù)

rollback()????? 回滾當(dāng)前事務(wù)

close()???????? 關(guān)閉連接

cursor對(duì)象支持的方法

execute(op)???? 執(zhí)行一個(gè)數(shù)據(jù)庫(kù)的查詢命令

fetchone()????? 取得結(jié)果集的下一行

fetchmany(size) 獲取結(jié)果集的下幾行

fetchall()????? 獲取結(jié)果集中的所有行

rowcount()????? 返回?cái)?shù)據(jù)條數(shù)或影響行數(shù)

close()???????? 關(guān)閉游標(biāo)對(duì)象

修改連接時(shí)用戶名密碼，創(chuàng)建數(shù)據(jù)庫(kù)

pymysql.err.ProgrammingError: (1146, u"Table 'wiki.geo_tags' doesn't exist")

看項(xiàng)目介紹中Wikivoyage 提供了文章的地理位置作為 sql 文件，繼續(xù)找數(shù)據(jù)集https://github.com/baturin/wikivoyage-listings

還是在這里找到（霧）：https://dumps.wikimedia.org/hewikivoyage/latest/

pymysql.err.ProgrammingError: (1146, u"Table 'wiki.page' doesn't exist")

還在剛剛的頁(yè)面找到pages.sql

下載的一個(gè)sql貌似不是英文，（?????_?????????）

看到這個(gè)貌似是官網(wǎng)https://www.wikidata.org/wiki/Wikidata:Wikivoyage/Resources

全語(yǔ)言長(zhǎng)這樣：https://www.wikivoyage.org/

英文版的地址長(zhǎng)這樣：https://en.wikivoyage.org/

同理類推：https://dumps.wikimedia.org/enwikivoyage/latest/

成功找到英文版sql

IOError: [Errno 2] No such file or directory: '../data/geotag_dict.pkl'

查看源碼為輸出文件，新建

成功運(yùn)行

3.scrap_wikivoyage_banners.py

This file contains code that I used to scrap the banner images of articles from wikivoyage. I also used this to collect the canonical url of the wikivoyage page. I had to search destinations using a special search page on Wikivoyage to overcome minor syntax differences in place names.

此文件包含用于從 wikivoyage 中抓取文章的橫幅圖像的代碼。我也用這個(gè)來(lái)收集wikivoyage page的標(biāo)準(zhǔn)URL。我不得不在Wikivoyage上使用一個(gè)特殊的搜索頁(yè)面搜索目的地，以克服地名中的微小語(yǔ)法差異。

??? self.locations = pkl.load(open('../../data/pickles/geotag_dict.pkl', 'rb'))

IOError: [Errno 2] No such file or directory: '../../data/pickles/geotag_dict.pkl'

復(fù)制剛剛的pkl

CONNECTION ERROR!!! RECONNECT TO? page

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 109, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 95, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 57, in get_image_and_link

??? return make_default_img_url(place)

NameError: global name 'make_default_img_url' is not defined

INDEX ERROR!!!? page did not exist

查看源碼

def get_image_and_link(self, place):

?????? '''

?????? For a given place, get the canonical wikivoyage url and save the banner.

?????? If the banner is just a default banner, save the img path as the default

?????? banner to minimize duplicates.

?????? input: place as string

?????? output: img_path and wiki_url + (save image in the process)

?????? '''

?????? base_url = "https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search="

?????? full_url = base_url + place.title()

?????? try:

?????????? response = requests.get(full_url).text

?????????? soup = BeautifulSoup(response, 'html.parser')

?????????? wiki_url = soup.find(rel='canonical')['href']

?????????? img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src']

?????? except IndexError:

?????????? print 'INDEX ERROR!!! %s page did not exist' % place

?????????? return make_default_img_url(place)

?????? except ConnectionError:

?????????? print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

?????????? return make_default_img_url(place)

?????? if 'Pagebanner_default' in img_src or 'default_banner' in img_src:

?????????? print '%s has default banner!' % place

?????????? img_path = 'static/banners/default.png'

?????? else:

?????????? place = place.replace('/', '_')? # REPLACE SLASH BECAUSE IT CREATES A DIRECTORY

?????????? try:

????????????? img_response = requests.get(img_src, stream=True)

????????????? img_path = 'static/banners/%s.png' % place

?????????? except IndexError:

????????????? print 'INDEX ERROR!!! %s page did not exist' % place

????????????? return make_default_img_url(place)

?????????? except ConnectionError:

????????????? print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

????????????? return make_default_img_url(place)

?????????? # save the img file if it doesn't already exist. if it already exists, dont overwrite.

?????????? if not os.path.exists('../../webapp/static/banners/%s.png' % place):

????????????? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file:

????????????????? shutil.copyfileobj(img_response.raw, out_file)

????????????? del img_response

????????????? print '%s.png successfully created' % place

?????????? else:

????????????? print '%s.png already exists!' % place

?????? return img_path, wiki_url

def scrap_banners(self):

?????? '''

?????? Go through every key in the locations dictionary and scrape the wiki url and img_path.

?????? '''

?????? for key in self.locations.iterkeys():

??? ?????? # print 'key %s,' % key

?????????? img_path, wiki_url = self.get_image_and_link(key)

?????????? self.locations[key]['wiki_url'] = wiki_url

?????????? self.locations[key]['img_path'] = img_path

def load_location(self):

?????? '''

?????? load the geolocation data.

?????? '''

?????? self.locations = pkl.load(open('../../data/pickles/geotag_dict.pkl', 'rb'))

看來(lái)還是pkl中的location出問(wèn)題了，查看pkl

import cPickle as pickle?

??? f = open('path')?

??? info = pickle.load(f)?

??? print info?? #show file?

{'': {u'gt_lat': Decimal('56.83330000'), u'page_id': 18192, u'gt_lon': Decimal('60.58330000'), u'page_len': 27110}, '__': {u'gt_lat': Decimal('49.85944444'), u'page_id': 13920, u'gt_lon': Decimal('20.27472222'), u'page_len': 3453}, "_(')": {u'gt_lat': Decimal('-53.32000000'), u'page_id': 14305, u'gt_lon': Decimal('-70.91000000'), u'page_len': 3408}, "__'/": {u'gt_lat': Decimal('-22.92000000'), u'page_id': 13410, u'gt_lon': Decimal('-43.22000000'), u'page_len': 56927}, "/-'_": {u'gt_lat': Decimal('41.94610000'), u'page_id': 14123, u'gt_lon': Decimal('-87.66940000'), u'page_len': 28496},……

這些key真的詭異極了

嘗試打印full url

CONNECTION ERROR!!! RECONNECT TO https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search= page

嘗試打印key

保存副本

更改sql語(yǔ)言版本后成功獲得正確key

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba -- did not exist

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 115, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 101, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 57, in get_image_and_link

??? return make_default_img_url(place)

NameError: global name 'make_default_img_url' is not defined

嘗試訪問(wèn)url：

https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba

發(fā)get請(qǐng)求的狀態(tài)碼是302

wiki可以正常訪問(wèn)，但不是這個(gè)網(wǎng)址，跳轉(zhuǎn)到

https://en.wikivoyage.org/wiki/Eastern_Cuba

和make_default_img_url中的地址一樣呢

https://en.wikivoyage.org/wiki/

但是https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default

這個(gè)搜索頁(yè)面還在

用搜索框搜索查看發(fā)出的請(qǐng)求是

https://en.wikivoyage.org/w/index.php?=Eastern_Cuba&sort=relevance&search=Eastern_Cuba&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

加&fulltext=1或無(wú)該名稱則不會(huì)跳轉(zhuǎn)

https://en.wikivoyage.org/w/index.php?search=Eastern_Cuba&title=Special%3ASearch&profile=advanced&fulltext=1

所以URL可能沒(méi)有問(wèn)題

make_default_img_url并不是全局變量，是不是什么錯(cuò)誤讓人把它當(dāng)成全局變量了

嘗試修改make_default_img_url

def make_default_img_url(self, place):

?????? '''

?????? input = place

?????? output = return the default values for img_path and wiki_url

?????? '''

?????? img_path = 'static/banners/default.png'

?????? wiki_url = 'https://en.wikivoyage.org/wiki/%s' % place

?????? return img_path, wiki_url

except IndexError:

?????????? # print 'INDEX ERROR!!! %s page did not exist' % place

?????????? print 'INDEX ERROR!!! %s -- did not exist' % full_url

?????????? # return make_default_img_url(place)

?????????? return self.make_default_img_url(place)

更改后雖然還是無(wú)法訪問(wèn)，但可以連續(xù)運(yùn)行了，最后報(bào)錯(cuò)如下

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 753, in generate

??? raise ChunkedEncodingError(e)

requests.exceptions.ChunkedEncodingError: ("Connection broken: error(10053, '')", error(10053, ''))

在特殊搜索界面看到一個(gè)Developers：

https://www.mediawiki.org/wiki/How_to_contribute

網(wǎng)頁(yè)API：https://www.mediawiki.org/wiki/API:Web_APIs_hub

API:Geosearch：https://www.mediawiki.org/wiki/API:Geosearch

GET 請(qǐng)求用地理位置的附近坐標(biāo)或頁(yè)面名稱搜索 wiki 頁(yè)面。

This module is supported through the Extension:GeoData currently not installed on MediaWiki but Wikipedia. So, in this document, we will use the URL en.wikipedia.org in all API endpoints.

此模塊通過(guò)擴(kuò)展支持：地理數(shù)據(jù)當(dāng)前未安裝在 MediaWiki 上，而是維基百科。因此，在本文中，我們將在所有 API 終結(jié)點(diǎn)中使用 URL en.wikipedia.org。

GET Request[edit]

Search for pages near Wikimedia Foundation headquarters by specifying the geographic coordinates of its location:

api.php?action=query&list=geosearch&gscoord=37.7891838|-122.4033522&gsradius=10000&gslimit=10 [try in ApiSandbox]

通過(guò)指定維基媒體基金會(huì)總部附近的頁(yè)面，指定其位置的地理坐標(biāo)

API documentation：https://en.wikipedia.org/w/api.php?action=help&modules=query+geosearch

https://en.wikivoyage.org/w/api.php?action=help&modules=query

API查閱方法https://www.mediawiki.org/wiki/API:Main_page

Examples:

Fetch site info and revisions of Main Page.

api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&rvprop=user|comment&continue= [open in sandbox]

我之前用過(guò)request.urlopen，源碼為requests.get，查看這兩種區(qū)別https://blog.csdn.net/dead_cicle/article/details/86747593

構(gòu)造一個(gè)Request對(duì)象，然后使用urlopen拿回來(lái)的還是對(duì)象

requests是python實(shí)現(xiàn)的簡(jiǎn)單易用的HTTP庫(kù)，返回一個(gè)HTTPresp，該類有屬性：text,content,code等。

直接打印的狀態(tài)碼為200，但還是報(bào)錯(cuò)，說(shuō)明請(qǐng)求這一步是沒(méi)有問(wèn)題的

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba -- did n

ot exist

查看bs：

https://blog.csdn.net/weixin_42231070/article/details/82225529

importurllib.request frombs4 importBeautifulSoup douban_path = "https://movie.douban.com"response = urllib.request.urlopen(douban_path) soup = BeautifulSoup(response, 'html.parser') # 可以接受response對(duì)象soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser') # 可以接受字符串soup = BeautifulSoup(open(test.html),'html.parser') # 可以接受本地文件

剛才嘗試打印text報(bào)錯(cuò)編碼不對(duì)，但打印soup能打印出一堆html源碼

查看wiki_url成功

soup.find(rel='canonical')['href']

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

https://en.wikivoyage.org/wiki/Eastern_Cuba

所以可能是取img_src的問(wèn)題

'https:'+soup.select('div.topbanner a.image')[0].select('img')[0]['src']

soup.select ：https://blog.csdn.net/geerniya/article/details/77842421

通過(guò)采用soup.select()方法，可以得到所需的內(nèi)容。
其中關(guān)鍵點(diǎn)在于，對(duì)于所需內(nèi)容的精準(zhǔn)定位，通過(guò)（）內(nèi)的語(yǔ)句來(lái)實(shí)現(xiàn)

https://blog.csdn.net/weixin_40425640/article/details/79470617

select 的功能跟find和find_all 一樣用來(lái)選取特定的標(biāo)簽，它的選取規(guī)則依賴于css，我們把它叫做css選擇器，如果之前有接觸過(guò)jquery ，可以發(fā)現(xiàn)select的選取規(guī)則和jquery有點(diǎn)像。

標(biāo)簽名不加任何修飾，會(huì)返回一個(gè)數(shù)組（所以div是標(biāo)簽名

類名前加點(diǎn)，id名前加 #

組合查找可以分為兩種，一種是在一個(gè)tag中進(jìn)行兩個(gè)條件的查找，一種是樹(shù)狀的查找一層一層之間的查找。

print soup.select('a#link2')

選擇標(biāo)簽名為a，id為link2的tag。

猜測(cè)可能是最后的'src'下標(biāo)無(wú)效

查找select('img') https://www.jianshu.com/p/ed2f044bd1fa

Tag或BeautifulSoup對(duì)象的.select()方法。

res = soup.select('#wrapperto')	-> tag's id
res = soup.select('img[src]')	-> 'img' tags有'src' attributes
res = soup.select('img[src=...]')	-> 'src' attributes是...

soup.select 查找Img src

https://www.cnblogs.com/calmzone/p/11139980.html

# soup.a.arrts? # 獲取a標(biāo)簽所有屬性和值，返回一個(gè)字典

# soup.a.attrs['href']? # 獲取href屬性

# soup.a['href']? # 也可簡(jiǎn)寫(xiě)成這種

#上面兩種方式都可以獲取a標(biāo)簽的href屬性值

https://blog.csdn.net/weixin_42231070/article/details/82225529

當(dāng)屬性不存在時(shí)，使用 get 返回None，字典形式取值會(huì)報(bào)錯(cuò)

print soup.select('div.topbanner a.image')

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

[]

難道這返回了一個(gè)空數(shù)組，topbanner類的div中根本就沒(méi)有image類的a

查看https://en.wikivoyage.org/wiki/Eastern_Cuba的源碼

發(fā)現(xiàn)含有topbanner類的div是有的，但是有兩個(gè)，而且這個(gè)類名字只是包含，是好幾個(gè)類其中有個(gè)wpb-topbanner

一個(gè)div元素為了能被多個(gè)樣式表匹配到（樣式復(fù)用），通常div的class中由好幾段組成，如<div class="user login">能被.user和.login兩個(gè)選擇器選中。如果這兩個(gè)選擇器中有相同的屬性值，則該屬性值先被改為.user中的值，再被改為.login中的值，即重復(fù)的屬性以最后一個(gè)選擇器中的屬性值為準(zhǔn)。（這個(gè)div就有好幾個(gè)類）

嘗試改select中的類名

?????????? a_img_tag=soup.select('div.wpb-topbanner a.image')

?????????? print a_img_tag

?????????? # print soup.select('div.topbanner a.image')[0].select('img')[0]

?????????? # img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src']

?????????? img_src = 'https:' + soup.select('div.wpb-topbanner a.image')[0].select('img')[0]['src']

打印不再是空數(shù)組了

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

[<a class="image" dir="ltr" href="/wiki/File:WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" title="Eastern Cuba"><img

class="wpb-banner-image" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardal

avaca.jpg" srcset="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca

.jpg/640px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 640w,https://upload.wikimedia.org/wikipedia/commons/thumb/7/7

d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/1280px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 1280w,https://u

pload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 2560w"/></a>]

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 123, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 109, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 80, in get_image_and_link

??? img_response = requests.get(img_src, stream=True)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\api.py", line 75, in get

??? return request('get', url, params=params, **kwargs)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\api.py", line 60, in request

??? return session.request(method=method, url=url, **kwargs)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\sessions.py", line 519, in request

??? prep = self.prepare_request(req)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\sessions.py", line 462, in prepare_request

??? hooks=merge_hooks(request.hooks, self.hooks),

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 313, in prepare

??? self.prepare_url(url, params)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 390, in prepare_url

??? raise InvalidURL("Invalid URL %r: No host supplied" % url)

requests.exceptions.InvalidURL: Invalid URL u'https:https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Easter

n_Cuba_Road_to_Guardalavaca.jpg': No host supplied

找到的只有一個(gè)a標(biāo)簽，里面也只有一個(gè)img標(biāo)簽

Src中的https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Easter可以訪問(wèn)，出去額外添加的“https:”，報(bào)錯(cuò)

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

[<a class="image" dir="ltr" href="/wiki/File:WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" title="Eastern Cuba"><img class="wpb-banner-image" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" srcset="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/640px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 640w,https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/1280px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 1280w,https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 2560w"/></a>]

CONNECTION ERROR!!! RECONNECT TO eastern_cuba page

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 123, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 109, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 89, in get_image_and_link

??? return make_default_img_url(place)

NameError: global name 'make_default_img_url' is not defined

嘗試打印請(qǐng)求的response

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 124, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 110, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 94, in get_image_and_link

??? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file:

IOError: [Errno 2] No such file or directory: '../../webapp/static/banners/eastern_cuba.png'

嘗試創(chuàng)建banners

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg

eastern_cuba.png successfully created

https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ardrossan_-_SA_WV_Banner.jpg/2560px-Ardrossan_-_SA_WV_Banner.jpg

ardrossan_(south_australia).png successfully created

不想斷網(wǎng)的時(shí)候爬信息一直往下滾，漏過(guò)了好多，嘗試在爬圖片網(wǎng)址的時(shí)候加了sleep

import time

except ConnectionError:

?????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

?????????? print 'CONNECTION ERROR!!! RECONNECT TO -- %s ' % full_url

?????????? time.sleep(20)

?????????? # return make_default_img_url(place)

?????????? return self.make_default_img_url(place)

這樣斷網(wǎng)的時(shí)候error就不會(huì)一直刷屏了，給我一點(diǎn)時(shí)間，把網(wǎng)重新連上，怎么下到一半anaconda還卡了呢= =

這回爬得順利一點(diǎn)，少量index error

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Fjrland&ns0=1 -- img src did not exist

Fjrland這個(gè)我直接在維基上搜也搜不到，There were no results matching the query.

下拉框有個(gè)帶梅花a的（打不出來(lái)）

找到真實(shí)鏈接為https://en.wikivoyage.org/wiki/Fj%C3%A6rland

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Heisy_Bordel/Prague/East_Bank_Of_Vltava&ns0=1 -- img src did not exist

Heisy_Bordel/Prague/East_Bank_Of_Vltava也搜不到

Heisy Bordel是Prague/East_Bank_Of_Vltava的一個(gè)貢獻(xiàn)用戶

真實(shí)鏈接https://en.wikivoyage.org/wiki/Prague/East_bank_of_Vltava

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Berlinichthyosaur_State_Park&ns0=1

下拉框有Berlin–Ichthyosaur State Park，真實(shí)鏈接https://en.wikivoyage.org/wiki/Berlin%E2%80%93Ichthyosaur_State_Park

圖片：https://en.wikivoyage.org/wiki/File:Berlin%E2%80%93Ichthyosaur_State_Park_banner.JPG

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Hafnarfjorur&ns0=1 -- img src did not exist

真實(shí)鏈接https://en.wikivoyage.org/wiki/Hafnarfj%C3%B6r%C3%B0ur

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Drivingukbanner1.Jpg&ns0=1 -- img src did not exist

Drivingukbanner1.Jpg這很奇怪，地名怎么變成jpg了，而且前面的Driving uk也不知道怎么查了

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Owl_Ad_Wouters.Jpg&ns0=1

可能是https://en.wikivoyage.org/wiki/Ad%27s_Path

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Snogebk&ns0=1 -- img src did not exist

真實(shí)鏈接https://en.wikivoyage.org/wiki/Snogeb%C3%A6k

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Nstved&ns0=1 -- img src did not exist

真實(shí)鏈接

https://en.wikivoyage.org/wiki/N%C3%A6stved

因?yàn)榫W(wǎng)老斷，需要重復(fù)多次運(yùn)行，每次都重復(fù)請(qǐng)求url然后判斷圖片存在太慢了，先判斷一波

def get_image_and_link(self, place):

?????? '''

?????? For a given place, get the canonical wikivoyage url and save the banner.

?????? If the banner is just a default banner, save the img path as the default

?????? banner to minimize duplicates.

?????? input: place as string

?????? output: img_path and wiki_url + (save image in the process)

?????? '''

?????? if not os.path.exists('../../webapp/static/banners/%s.png' % place):

?????????? #look over before request

?????????? base_url = "https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search="

?????????? full_url = base_url + place.title()

?????????? # print 'place %s,' % place

?????????? # print 'place_title %s' % place.title()

?????????? try:

????????????? response = requests.get(full_url).text

????????????? soup = BeautifulSoup(response, 'html.parser')

????????????? wiki_url = soup.find(rel='canonical')['href']

????????????? a_img_tag=soup.select('div.wpb-topbanner a.image')

????????????? # img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src']

????????????? img_src = soup.select('div.wpb-topbanner a.image')[0].select('img')[0]['src']

??? # mark

?????????? except IndexError:

????????????? # print 'INDEX ERROR!!! %s page did not exist' % place

????????????? print 'INDEX ERROR!!! %s -- img src did not exist' % wiki_url

??? ?????????? # return make_default_img_url(place)

????????????? return self.make_default_img_url(place)

?????????? except ConnectionError:

????????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

????????????? print 'CONNECTION ERROR!!! RECONNECT TO -- %s ' % full_url

????????????? time.sleep(20)

????????????? # return make_default_img_url(place)

????????????? return self.make_default_img_url(place)

?????????? if 'Pagebanner_default' in img_src or 'default_banner' in img_src:

????????????? print '%s has default banner!' % place

????????????? img_path = 'static/banners/default.png'

?????????? else:

????????????? place = place.replace('/', '_')? # REPLACE '/' with '_' BECAUSE IT CREATES A DIRECTORY

????????????? try:

????????????????? img_response = requests.get(img_src, stream=True)

????????????????? # print img_response

????????????????? img_path = 'static/banners/%s.png' % place

????????????? except IndexError:

????????????????? print 'INDEX ERROR!!! %s img did not exist' % place

????????????????? return self.make_default_img_url(place)

????????????? except ConnectionError:

????????????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s img' % place

????????????????? print 'CONNECTION ERROR!!! RECONNECT TO %s img' % img_src

????????????????? return self.make_default_img_url(place)

????????????? # save the img file if it doesn't already exist. if it already exists, dont overwrite.

????????????? if not os.path.exists('../../webapp/static/banners/%s.png' % place):

????????????????? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file:

???????????????????? shutil.copyfileobj(img_response.raw, out_file)

????????????????? del img_response

????????????????? print '%s.png successfully created' % place

????????????? else:

????????????????? print '%s.png already exists!' % place

?????????? #look over before request

?????? else:

?????????? print '%s.png already exists!' % place

?????????? return self.make_default_img_url(place)

?????? return img_path, wiki_url

（真不容易，電腦總是斷網(wǎng)，一直修不好。去工作室蹭網(wǎng)下圖片，老師還說(shuō)有領(lǐng)導(dǎo)參觀，不讓呆）

網(wǎng)不好真的太難爬了，使用默認(rèn)網(wǎng)址的模型，剩下的交給別人爬

全部運(yùn)行完在D:\anacondaProject\where2go\data生成了一個(gè)geotag_imglink_wikibanner.pkl

New York Times

4.nyt_articles_api.py

This file was use to gather the most recent NYT articles in World, Science, and Travel sections. MongoDB was used to save the articles called with the official NYT API. Data was collected but was not incorporated to the model because the articles contained too much news like semantics.

此文件用于收集《世界、科學(xué)和旅行》部分中最新的《紐約時(shí)報(bào)》文章。MongoDB 用于保存使用官方的 NYT API 調(diào)用的文章。數(shù)據(jù)收集但未納入模型，因?yàn)槲恼掳嘞裾Z(yǔ)義的新聞。

ImportError: No module named pymongo

去蹭網(wǎng)下叭

pip install pymongo

運(yùn)行了一會(huì)兒后報(bào)錯(cuò)

pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [Errno 10061]

想起來(lái)這個(gè)是要mongodb的

https://www.jianshu.com/p/c9777b063593

https://blog.csdn.net/huasonl88/article/details/51755621

MongoDB 不同于關(guān)系型結(jié)構(gòu)的三層結(jié)構(gòu)——database--> table --> record，它的層級(jí)為 database -->collection --> document

https://blog.csdn.net/zwq912318834/article/details/77689568

import pymongo

# mongodb服務(wù)的地址和端口號(hào)

mongo_url = "127.0.0.1:27017"

# 連接到mongodb，如果參數(shù)不填，默認(rèn)為“l(fā)ocalhost:27017”

client = pymongo.MongoClient(mongo_url)

#連接到數(shù)據(jù)庫(kù)myDatabase

DATABASE = "myDatabase"

db = client[DATABASE]

#連接到集合(表):myDatabase.myCollection

COLLECTION = "myCollection"

db_coll = db[COLLECTION ]

# 在表myCollection中尋找date字段等于2017-08-29的記錄，并將結(jié)果按照age從大到小排序

queryArgs = {'date':'2017-08-29'}

search_res = db_coll.find(queryArgs).sort('age',-1)

for record in search_res:

????? print(f"_id = {record['_id']}, name = {record['name']}, age = {record['age']}")

源碼：

# Define the MongoDB database and table

db_cilent = MongoClient()

db = db_cilent['nyt_dump']

table = db['articles']

'''

??? Get all the links, visit the page and scrape the content

??? '''

??? if not section:

??????? links = table.find({'content_txt': {'$exists': False}}, {'web_url': 1})

??? else:

??????? links = table.find({'$and': [{'content_txt': {'$exists': False}},

?? ????????????????????????{'section_name': section}]}, {'web_url': 1})

開(kāi)啟mongodb

D:\Program Files\Mongo\bin>mongod.exe --dbpath "D:\MongoDB\DBData"

Mongo還用不了了，卸載重裝https://www.cnblogs.com/6luv-ml/p/9174818.html

看了下，可能因?yàn)樯洗沃匮b系統(tǒng)的問(wèn)題，程序與功能里并沒(méi)有mongodb，直接刪除了安裝

https://www.baidu.com/link?url=aA78IHXRSyxzObA9ArXLH43I1blC1eDEdnj9io1WJtH5LeR-cHl-gJgEwVfOkuJzsJiWNx_78t_CHZFXGHGNwzY9Vtz5wBluVD2AobNJiaW&wd=&eqid=b831449c000210e4000000035d5e441d

沒(méi)再報(bào)錯(cuò)了（沒(méi)看到寫(xiě)入文件，有沒(méi)有數(shù)據(jù)也不想管了-反正后面可能也用不著）

Service Name：MongoDB

Data Directory：D:\Program Files\Mongo\data\

2) EDA

Exploratory data analysis and data cleaning have been performed with ipython notebook. Wikivoyage and NYT data were loaded, cleaned, pickled out as input format for word2vec, which is a list of sentences where each sentence is represented as a list of words. Also, global NOAA weather data was downloaded but I later determined that it leaves out major parts of the world. Thus, more data has to be collected to incorporate weather to the project.

Ipython notebook已執(zhí)行探索性數(shù)據(jù)分析和數(shù)據(jù)清理。Wikivoyage 和 NYT 數(shù)據(jù)被加載、清理、挑選出來(lái)作為 word2vec 的輸入格式，該格式是句子列表，其中每個(gè)句子都表示為單詞列表。此外，全球NOAA天氣數(shù)據(jù)被下載，但我后來(lái)確定，它忽略了世界的主要部分。因此，要將天氣納入項(xiàng)目必須收集更多的數(shù)據(jù)。

Wiki voyage

(py2_flask) D:\anacondaProject\where2go\code\data_collection>ipython

Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]

Type 'copyright', 'credits' or 'license' for more information

IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

Anaconda中預(yù)置了ipython

可能是用jupyter做的，直接打開(kāi)看起來(lái)像一堆json

Jupyter介紹：http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spider&for=pc

當(dāng)你還處于原型開(kāi)發(fā)階段時(shí)，Jupyter Notebooks 的作用更是引人注目。這是因?yàn)槟愕拇a是按獨(dú)立單元的形式編寫(xiě)的，而且這些單元是獨(dú)立執(zhí)行的。這讓用戶可以測(cè)試一個(gè)項(xiàng)目中的特定代碼塊，而無(wú)需從項(xiàng)目開(kāi)始處執(zhí)行代碼。

要運(yùn)行你的 Jupyter Notebooks，只需在命令行輸入以下命令即可！

jupyter notebook

完成之后，Jupyter Notebooks 就會(huì)在你的默認(rèn)網(wǎng)絡(luò)瀏覽器打開(kāi)，地址是：

http://localhost:8888/tree

在某些情況下，它可能不會(huì)自動(dòng)打開(kāi)。而是會(huì)在終端/命令行生成一個(gè) URL，并帶有令牌密鑰提示。你需要將包含這個(gè)令牌密鑰在內(nèi)的整個(gè) URL 都復(fù)制并粘貼到你的瀏覽器，然后才能打開(kāi)一個(gè)筆記本。

打開(kāi)筆記本后，你會(huì)看到頂部有三個(gè)選項(xiàng)卡：Files、Running 和 Clusters。其中，Files 基本上就是列出所有文件，Running 是展示你當(dāng)前打開(kāi)的終端和筆記本，Clusters 是由 IPython 并行提供的。

要打開(kāi)一個(gè)新的 Jupyter 筆記本，點(diǎn)擊頁(yè)面右側(cè)的「New」選項(xiàng)。你在這里會(huì)看到 4 個(gè)需要選擇的選項(xiàng)：

Python 3Text FileFolderTerminal

選擇 Text File，你會(huì)得到一個(gè)空面板。你可以添加任何字母、單詞和數(shù)字。其基本上可以看作是一個(gè)文本編輯器（類似于 Ubuntu 的文本編輯器）。你可以在其中選擇語(yǔ)言（有很多語(yǔ)言選項(xiàng)），所以你可以在這里編寫(xiě)腳本。你也可以查找和替換該文件中的詞。

選擇 Folder 選項(xiàng)時(shí)，你會(huì)創(chuàng)建一個(gè)新的文件夾，你可以在其中放入文件，重命名或刪除它。各種操作都可以。

Terminal 完全類似于在 Mac 或 Linux 機(jī)器上的終端（或 Windows 上的 cmd）。其能在你的網(wǎng)絡(luò)瀏覽器內(nèi)執(zhí)行一些支持終端會(huì)話的工作。在這個(gè)終端輸入 python，你就可以開(kāi)始寫(xiě)你的 Python 腳本了！

在代碼上面的菜單中，你有一些操作各個(gè)單元的選項(xiàng)：添加、編輯、剪切、向上和向下移動(dòng)單元、運(yùn)行單元內(nèi)的代碼、停止代碼、保存工作以及重啟 kernel。

上圖所示的下拉菜單中，你還有 4 個(gè)選項(xiàng)：

Code——不言而喻，就是寫(xiě)代碼的地方。Markdown——這是寫(xiě)文本的地方。你可以在運(yùn)行一段代碼后添加你的結(jié)論、添加注釋等。Raw NBConvert——這是一個(gè)可將你的筆記本轉(zhuǎn)換成另一種格式（比如 HTML）的命令行工具。Heading——這是你添加標(biāo)題的地方，這樣你可以將不同的章節(jié)分開(kāi)，讓你的筆記本看起來(lái)更整齊更清晰。這個(gè)現(xiàn)在已經(jīng)被轉(zhuǎn)換成 Markdown 選項(xiàng)本身了。輸入一個(gè)「##」之后，后面輸入的內(nèi)容就會(huì)被視為一個(gè)標(biāo)題。

！%clear、%autosave、%debug 和 %mkdir 等功能你以前肯定見(jiàn)過(guò)。現(xiàn)在，神奇的命令可以以兩種方式運(yùn)行：

逐行方式逐單元方式

顧名思義，逐行方式是執(zhí)行單行的命令，而逐單元方式則是執(zhí)行不止一行的命令，而是執(zhí)行整個(gè)單元中的整個(gè)代碼塊。

在逐行方式中，所有給定的命令必須以 % 字符開(kāi)頭；而在逐單元方式中，所有的命令必須以 %% 開(kāi)頭

快捷方式是 Jupyter Notebooks 最大的優(yōu)勢(shì)之一。當(dāng)你想運(yùn)行任意代碼塊時(shí)，只需要按 Ctrl+Enter 就行了。

Jupyter Notebooks 提供了兩種不同的鍵盤(pán)輸入模式——命令和編輯。命令模式是將鍵盤(pán)和筆記本層面的命令綁定起來(lái)，并且由帶有藍(lán)色左邊距的灰色單元邊框表示。編輯模式讓你可以在活動(dòng)單元中輸入文本（或代碼），用綠色單元邊框表示。

你可以分別使用 Esc 和 Enter 在命令模式和編輯模式之間跳躍。

如之前提到的，Ctrl + Enter 會(huì)運(yùn)行你的整個(gè)單元塊。

?Alt + Enter 不止會(huì)運(yùn)行你的單元塊，還會(huì)在下面添加一個(gè)新單元。

?Ctrl + Shift + F 打開(kāi)命令面板。

要查看鍵盤(pán)快捷鍵完整列表，可在命令模式按「H」或進(jìn)入「Help > Keyboard Shortcuts」。

保存和共享你的筆記本

當(dāng)我必須寫(xiě)一篇博客文章時(shí)，我的代碼和評(píng)論都會(huì)在一個(gè) Jupyter 文件中，我需要首先將它們轉(zhuǎn)換成另一個(gè)格式。記住這些筆記本是 json 格式的，這在進(jìn)行共享時(shí)不會(huì)很有幫助。我總不能在電子郵件和博客上貼上不同單元塊，對(duì)不對(duì)？

進(jìn)入「Files」菜單，你會(huì)看到「Download As」選項(xiàng)：

你可以用 7 種可選格式保存你的筆記本。其中最常用的是 .ipynb 文件和 .html 文件。使用 .ipynb 文件可讓其他人將你的代碼復(fù)制到他們的機(jī)器上，使用 .html 文件能以網(wǎng)頁(yè)格式打開(kāi)（當(dāng)你需要保存嵌入在筆記本中的圖片時(shí)會(huì)很方便）。

你也可以使用 nbconvert 選項(xiàng)手動(dòng)將你的筆記本轉(zhuǎn)換成 HTML 或 PDF 等格式。

你也可以使用 jupyterhub，地址：https://github.com/jupyterhub/jupyterhub。其能讓你將筆記本托管在它的服務(wù)器上并進(jìn)行多用戶共享。很多頂級(jí)研究項(xiàng)目都在使用這種方式進(jìn)行協(xié)作。

有時(shí)候你的文件中有非常大量的代碼。看看能不能將你認(rèn)為不重要的某些代碼隱藏起來(lái)，之后再引用。這能讓你的筆記本看起來(lái)整潔清晰，這是非常可貴的。查看這個(gè)在 matplotlib 上的筆記本，看看可以如何簡(jiǎn)練地進(jìn)行呈現(xiàn)：http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb

另一個(gè)額外技巧！在你想創(chuàng)建一個(gè)演示文稿時(shí)，你可能首先想到的工具是 PowerPoint 和 Google Slides。其實(shí)你的 Jupyter Notebooks 也能創(chuàng)建幻燈片！

更改默認(rèn)保存位置

打開(kāi)Windows的cmd，在cmd中輸入jupyter notebook --generate-config如下圖：

可以看到路徑為D:\Users……找到此路徑修改jupyter_notebook_config.py文件

打開(kāi)此文件找到

## The directory to use for notebooks and kernels.
#c.NotebookApp.notebook_dir = ''
將其改為
## The directory to use for notebooks and kernels.
c.NotebookApp.notebook_dir = 'E:\Jupyter'
其中E:\Jupyter為我的工作空間，你可以改成你自己的，
注意:

1.#c.NotebookApp.notebook_dir = ''中的#必須刪除，且前面不能留空格。
2. E:\Jupyter,Jupyter文件夾必須提前新建，如果沒(méi)有新建，Jupyter Notebook會(huì)找不到這個(gè)文件，會(huì)產(chǎn)生閃退現(xiàn)象。

Cmd中沒(méi)有jupyter環(huán)境，無(wú)法運(yùn)行jupyter notebook --generate-config，在anaconda中修改的配置，也在anaconda中打開(kāi)

(base) C:\Users\Lenovo>jupyter notebook

反斜杠有可能識(shí)別為轉(zhuǎn)義

c.NotebookApp.notebook_dir = 'D:\\anacondaProject'

嘗試用base環(huán)境直接運(yùn)行，新建Wiki voyage eda副本。

報(bào)錯(cuò)

ModuleNotFoundError: No module named 'gensim'

暫時(shí)不管，待會(huì)再看在哪個(gè)環(huán)境裝好

發(fā)現(xiàn)py2環(huán)境雖然沒(méi)特地裝jupyter，但是居然也可以運(yùn)行，所有配置和base環(huán)境一樣（右上角也有個(gè)py3的標(biāo)志）

解決：http://www.360doc.com/content/17/0413/22/1489589_645405947.shtml

Jupyter Notebook的環(huán)境和kernels內(nèi)核有關(guān)。用everything搜索kernel.json找到

/jupyter/kernels/python3/kernel.json

(py27)環(huán)境還缺少ipykernel

conda install ipykernel

切換

https://blog.csdn.net/castle_cc/article/details/77476081

python -m pip install ipykernel

python -m ipykernel install --user

成功切換。

運(yùn)行報(bào)錯(cuò)

FileNotFoundError: [Errno 2] No such file or directory: '../data/wikivoyage.json'

復(fù)制data collection中py生成的文件到該目錄D:\anacondaProject\where2go\code\data

報(bào)錯(cuò)

ImportError: matplotlib is required for plotting.

https://www.cnblogs.com/star-zhao/p/9726212.html

嘗試重啟IDE，全部重新運(yùn)行，報(bào)錯(cuò)

LookupError Traceback (most recent call last) <ipython-input-36-3b51d0f0aedc> in <module>() 5 # final_articles_words[key] = convert_article_into_list_of_words(value) 6 #print article ----> 7 final_articles_words[key] = convert_article_into_list_of_words(value) <ipython-input-33-e13c4daff3a0> in convert_article_into_list_of_words(article) 14 text = clean_paragraph(text) 15 #tokenize paragraph to sentences ---> 16 sentences = sent_tokenize(text) 17 18 for sentence in sentences:

LookupError: ********************************************************************** Resource punkt not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt') For more information see: https://www.nltk.org/data.html Attempted to load tokenizers/punkt/english.pickle

https://blog.csdn.net/qq_31747765/article/details/80307450

命令行

python

import nltk

nltk.download()

切換到models標(biāo)簽，找到punkt

在報(bào)錯(cuò)中查找的地址中選一個(gè)，更改Download Directory

D:\\ProgramData\\Anaconda3\\envs\\py2_flask\\nltk_data

最后生成../data/wikivoyage_list_of_words.pkl

Weather

報(bào)錯(cuò)

No module named haversine

pip install haversine

報(bào)錯(cuò)

No such file or directory: '../../data/pickles/geotag_imglink_wikiurl.pkl'

沒(méi)找到拿他作為輸出文件的代碼，weather_normals_eda-checkpoint和此處都是作為讀入文件

嘗試拿剛剛生成的pickle改名字

報(bào)錯(cuò)

IOError: [Errno 2] No such file or directory: '../data/weather/ghcnm.tavg.v3.3.0.20150624.qca.dat'

https://www.jianshu.com/p/3d4b606ec359

全球歷史氣候網(wǎng)絡(luò)月度（GHCNm）數(shù)據(jù)集是來(lái)自世界各地?cái)?shù)千個(gè)氣象站的一組月度氣候摘要。月度數(shù)據(jù)具有通過(guò)站最早觀測(cè)可追溯至18改變記錄期間日世紀(jì)。一些臺(tái)站記錄純粹是歷史性的，不再更新，而其他許多臺(tái)站仍在運(yùn)行，并提供對(duì)氣候監(jiān)測(cè)有用的短時(shí)間延遲更新。

在該網(wǎng)頁(yè)找到該數(shù)據(jù)集地址：https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4

地址https://www.ncei.noaa.gov/data/global-historical-climatology-network-monthly/

GHCN：https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn

v4版本的只有qcf、qcu、qfe，沒(méi)有qca，還是決定下v3的

PHA已經(jīng)過(guò)廣泛的評(píng)估（例如，Williams等人，2012），并且GHCNm v4數(shù)據(jù)被提供為均質(zhì)化（調(diào)整）和非均質(zhì)化（未調(diào)整）。均勻化數(shù)據(jù)由字符串“ qcf ” 已知，而未均勻化數(shù)據(jù)由字符串“ qcu ” 指定。如Menne等人所述。（2018），PHA作為整體周期性地運(yùn)行以量化均質(zhì)化的不確定性。還評(píng)估了其他不確定因素。

放到指定文件夾，修改讀取文件名中的時(shí)間日期

globaldata = pd.read_fwf('../data/weather/ghcnm.tavg.v3.3.0.20190821.qca.dat',header = None, widths=widths)

報(bào)錯(cuò)：

IOError: [Errno 2] No such file or directory: '../data/weather/ghcnm.tavg.v3.3.0.20150624.qca.inv'

解壓并改讀取文件名

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\ipykernel_launcher.py:6: FutureWarning: The current behaviour of 'Series.argmin' is deprecated, use 'idxmin' instead. The behavior of 'argmin' will be corrected to return the positional minimum in the future. For now, use 'series.values.argmin' or 'np.argmin(np.array(values))' to get the position of the minimum row. D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\ipykernel_launcher.py:7: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated import sys

# ndarray compat

argmin = idxmin

argmax = idxmax

別名而已

ix和loc、iloc函數(shù)都是用來(lái)獲取某一行或者某一列數(shù)據(jù)的。

????? col1? col2? col3

row1???? 1???? 2???? 3

row2???? 4???? 5???? 6

row3???? 7???? 8???? 9

.loc[]?is primarily label based, but may also be used with a boolean array.

完全基于標(biāo)簽位置（而不是下標(biāo)）的索引器，所謂標(biāo)簽位置就是上面定義的'row1','row2'。

使用方法（row1就是行標(biāo)簽）：print df.loc['row1']

.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.完全基于行號(hào)的索引器，所謂行號(hào)就是第0、1、2行。

print df.iloc[0]

.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.支持標(biāo)簽和行號(hào)混合的索引器，既可以通過(guò)標(biāo)簽也可以通過(guò)行號(hào)，還可以組合在一起（這個(gè)函數(shù)已經(jīng)過(guò)期，建議使用上面兩個(gè)函數(shù)替代）

源碼：

closest_station = distance.argmin()

temps = master.ix[closest_station][months]

修改：

closest_station = distance.idxmin()

temps = master.loc[closest_station][months]

全部運(yùn)行完成

Nyt

報(bào)錯(cuò)

ImportError: No module named nyt_articles_api

這是之前collection中的模塊

如何在jupyter中調(diào)用自己寫(xiě)的Python模塊：

https://blog.csdn.net/w371500241/article/details/55809362

https://www.cnblogs.com/master-pokemon/p/6136483.html

放同一個(gè)目錄或

import sys

sys.path.append('e:/workspace/Modules')

import Hello

Hello.hello()

Jupyter中直接寫(xiě)絕對(duì)路徑無(wú)法識(shí)別，但相對(duì)路徑可以，加入

import sys

sys.path.append('..\data_collection')

報(bào)錯(cuò)：

ServerSelectionTimeoutError: localhost:27017: [Errno 10061]

估計(jì)是mongodb的

這回好像裝的是個(gè)自帶server版本，自己就運(yùn)行了，沒(méi)再報(bào)錯(cuò)了，生成文件

../data/nyt_articles_word_list.pkl','wb'

啟動(dòng)jupyter：

activate py2_flask

jupyter notebook

3) Model

Where2go is based on a model created at Google called word2vec. Word2vec is a neural network with 1 hidden layer that has continuous bag of words (CBOW) or skip-grams implementation. Where2go uses the version that uses skip-grams and hierarchical softmax for optimization.

On the high level, word2vec tries to train the neural network to paramatize a model that can predict the surrounding words for every word in the corpus. The predictions are then used to backpropogate and optimize the parameters to make words with similar contexts be closer together, while being further away from words that have different contexts. The input-hidden layer weighting matrix, which is also the vector representation of words, is then used to gain insight into the meaning/similarity of words.

In my where2go_model.py file, I implemented gensim's word2vec model and wrote functions to vectorize user search queries and functions to filter the recommendations to actual geolocations and output destinations in geojson format.

Where2go基于谷歌創(chuàng)建的名為word2vec的模型。Word2vec 是一個(gè)神經(jīng)網(wǎng)絡(luò)，具有 1 個(gè)隱藏層，該層具有連續(xù)的單詞袋（CBOW）或skip-grams實(shí)現(xiàn)。where2go 使用的版本使用skip-grams 和hierarchical softmax進(jìn)行優(yōu)化。

在高層級(jí)上，word2vec 試圖訓(xùn)練神經(jīng)網(wǎng)絡(luò)，以參數(shù)化一個(gè)模型，該模型可以預(yù)測(cè)語(yǔ)料庫(kù)中每個(gè)單詞的周?chē)鷨卧~。然后，這些預(yù)測(cè)用于回推和優(yōu)化參數(shù)，使具有相似上下文的單詞更緊密地聯(lián)系在一起，同時(shí)遠(yuǎn)離具有不同上下文的單詞。然后，使用輸入隱藏層加權(quán)矩陣（也是單詞的矢量表示形式）來(lái)深入了解單詞的含義/相似性。

在我的 where2go_model.py 文件中，我實(shí)現(xiàn)了 gensim 的 word2vec 模型，并編寫(xiě)了矢量化用戶搜索查詢的函數(shù)，和將建議篩選到實(shí)際地理位置、以geojson格式輸出目的地的函數(shù)。

activate py2_flask

cd /d D:\anacondaProject\where2go\code\model

python where2go_model.py

報(bào)錯(cuò)：

IOError: [Errno 2] No such file or directory: '../../data/pickles/geo_imglink_wikiurl.pkl'

原名稱為geotag_imglink_wikiurl，備份，修改名稱

報(bào)錯(cuò)：

IOError: [Errno 2] No such file or directory: '../../data/pickles/wikivoyage_list_of_words.pkl'

將上面eda運(yùn)行出來(lái)的粘上

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.

? "C extension not loaded, training will be slow. "

https://blog.csdn.net/menghuanguaishou/article/details/90546838

pip uninstall gensim

pip install gensim==3.6

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

https://blog.csdn.net/qq_41185868/article/details/88344862

據(jù)說(shuō)沒(méi)有關(guān)系

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\models\phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class

? warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

源碼使用：

bigram = gensim.models.Phrases(self.wikivoyage_list, min_count=10)

??????? trigram = gensim.models.Phrases(bigram[self.wikivoyage_list], min_count=10)

奇怪，沒(méi)有看到解決這個(gè)警告的

https://blog.csdn.net/lwhsyit/article/details/82750218

應(yīng)當(dāng)生成../../data/pickles/where2go_model.pkl

Webapp

I was able to launch my own website using python Flask. I used javascript to perform AJAX calls for the search engine so that I could run a user's search query on my model to predict the most similar places and show my recommendations on the map. The Flask file is named 'app.py' and can be found in the folder 'webapp'; the 'index.html' file contains the html and javascript and can be found in the folder 'templates'. I used Bootstrap to design my website.

我能夠運(yùn)行我自己的網(wǎng)站使用python Flask。我使用 javascript 對(duì)搜索引擎執(zhí)行 AJAX 調(diào)用，以便可以在模型上運(yùn)行用戶的搜索查詢，以預(yù)測(cè)最相似的位置并在地圖上顯示我的建議。Flask 文件名為"app.py"，可在文件夾"webapp"中找到;"index.html"文件包含 html 和 javascript，可以在文件夾"模板"中找到。我用Bootstrap來(lái)設(shè)計(jì)我的網(wǎng)站。

Final Remarks

This project has been very fun and intellectually challenging. I started this application as a capstone project but there are many things I would like to add to this app. I really want to add more travel guide data to make my results more robust, add historical weather data to help users decide when to go to a destination, and add average flight and hotel costs to help users choose plausible places. If you have any comments and recommendations for this project, please feel free to contact me.

這個(gè)項(xiàng)目很有趣，智力上很有挑戰(zhàn)性。我開(kāi)始這個(gè)應(yīng)用程序作為一個(gè)頂點(diǎn)項(xiàng)目，但有很多東西我想添加到這個(gè)app。我希望添加更多的旅游指南數(shù)據(jù)，使我的搜索結(jié)果更加可靠，添加歷史天氣數(shù)據(jù)，以幫助用戶決定何時(shí)前往目的地，并增加平均航班和酒店費(fèi)用，以幫助用戶選擇合理的地方。如果您有任何意見(jiàn)和建議這個(gè)項(xiàng)目，請(qǐng)隨時(shí)與我聯(lián)系。

啟動(dòng)項(xiàng)目

該項(xiàng)目為python2。在我anaconda，py3，py2共存的環(huán)境中使py2能夠使用。

發(fā)現(xiàn)需要的模塊還挺多，決定在anaconda中新建py2虛擬環(huán)境

圖形界面（fetching的過(guò)程挺長(zhǎng)的）：

https://www.cnblogs.com/zimo-jing/p/7834808.html?utm_source=debugrun&utm_medium=referral

命令行：https://jingyan.baidu.com/article/455a9950500494a166277808.html

安裝模塊：

Traceback (most recent call last):

? File "app.py", line 10, in <module>

??? from where2go_model import Where2Go_Model

? File "../code/model\where2go_model.py", line 1, in <module>

??? import gensim

ImportError: No module named gensim

pip install genism（在圖形化界面中安裝總是報(bào)錯(cuò)，安裝失敗）

報(bào)錯(cuò)無(wú)滿足版本

pip?install?--upgrade?gensim

安裝（可能網(wǎng)絡(luò)問(wèn)題）報(bào)錯(cuò)

File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\pip\_vendor\urllib3\response.py", line 374, in _error_catcher

??? raise ReadTimeoutError(self._pool, None, 'Read timed out.')

ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

重復(fù)安裝命令

缺少模塊

Traceback (most recent call last):

? File "app.py", line 1, in <module>

??? from flask import Flask

ImportError: No module named flask

圖形化界面安裝

Traceback (most recent call last):

? File "app.py", line 10, in <module>

??? from where2go_model import Where2Go_Model

? File "../code/model\where2go_model.py", line 7, in <module>

??? from bs4 import BeautifulSoup

ImportError: No module named bs4

安裝beautifulsoup4

圖形化界面安裝報(bào)錯(cuò)

pip install bs4

安裝成功

When I run the web app in a python2.7 environment with all the dependencies, I get the following error:

Traceback (most recent call last):

? File "app.py", line 44, in <module>

??? app.where2go = load_pickle()

? File "app.py", line 19, in load_pickle

??? return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb'))

IOError: [Errno 2] No such file or directory: '../data/pickles/where2go_model.pkl'

查看源碼：

webapp/app.py

Showing the top four matches Last indexed Jun 30, 2018

Python

4	from collections import defaultdict
5	import cPickle as pkl
6	import json
7	import random
8	import sys
9	sys.path.insert(0,'../code/model')
…	?
18	def load_pickle():
19	return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb'))
20	?
21	@app.route('/')
22	def welcome():

code/model/where2go_model.py

Showing the top three matches Last indexed Jun 30, 2018

Python

7	from bs4 import BeautifulSoup
8	?
9	?
10	class Where2Go_Model(object):
11	?
12	def __init__(self):
13	self.geotag_imglink_wikiurl = None
…	?
212	with open('../../data/pickles/where2go_model.pkl', 'wb') as f:
213	cPickle.dump(where2go, f)

嘗試使用code文件夾中代碼搜集數(shù)據(jù)集

啟動(dòng)顯示：

(py2_flask) D:\anacondaProject\where2go\webapp>python app.py

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

?* Serving Flask app "app" (lazy loading)

?* Environment: production

?? WARNING: This is a development server. Do not use it in a production deployment.

?? Use a production WSGI server instead.

?* Debug mode: on

?* Restarting with stat

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

?* Debugger is active!

?* Debugger PIN: 232-882-558

?* Running on http://0.0.0.0:80/ (Press CTRL+C to quit)

成功打開(kāi)網(wǎng)頁(yè)

使用方法：

To search travel destinations, you can
1. Write destinations and/or characteristics
2. Put a + (add) or - (subtract) sign in front of the words to denote preferences
3. Multiply the words by a number to strengthen (greater than 1.0) or lower (less than 1.0) its influence

Where2go is likely to recommend a place at the same description level as the inputs. This means that when a city is searched, it is more likely to return a city name than a country name. It will work best when you input...

要搜索旅游目的地，您可以

1. 寫(xiě)入目的地和/或特征

2. 在單詞前面放置一個(gè) +（添加）或 -（減去）符號(hào)以表示首選項(xiàng)

3. 將單詞乘以數(shù)字以增強(qiáng)（大于 1.0）或更低（小于 1.0）其影響

where2go 可能會(huì)推薦與輸入處于相同描述級(jí)別的位置。這意味著，在搜索城市時(shí)，它更有可能返回城市名稱而不是國(guó)家/地區(qū)名稱。您最好輸入...

個(gè)別國(guó)家/城市Individual country/cities

Spain
Beijing
Maldives

輸入相同級(jí)別的位置Adding places of same level description level

hong kong + singapore
paris + 1.2*milan
0.8*dubai+cairo

地點(diǎn)+特征Adding places + characteristic

french polynesia + guam + scuba diving
California + wine
rome + beach

Search Tips:

至少放一個(gè)地點(diǎn)Try to put at least one place

word2vec searches similar words so it is likely to return places with names related to the search

*權(quán)重Play around with the place multipliers

san francisco + 1.5*malaga will yield results more like malaga than san francisco + malaga

當(dāng)你想在B國(guó)到A這樣的城市When you want cities like A but in country B

City A - Country of Place A + Country of Place B.
Again, play around with the weights to see different results.
Example: Barcelona - Spain + 1.5*Italy

無(wú)法搜索，報(bào)錯(cuò)

127.0.0.1 - - [02/Sep/2019 13:09:16] "POST /map HTTP/1.1" 500 -

Traceback (most recent call last):

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2463, in __call__

??? return self.wsgi_app(environ, start_response)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2449, in wsgi_app

??? response = self.handle_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1866, in handle_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2446, in wsgi_app

?? ?response = self.full_dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1951, in full_dispatch_request

??? rv = self.handle_user_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1820, in handle_user_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request

??? rv = self.dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1935, in dispatch_request

??? return self.view_functions[rule.endpoint](**req.view_args)

? File "D:\anacondaProject\where2go\webapp\app.py", line 41, in userinput

??? return json.dumps(app.result)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\__init__.py", line 244, in dumps

??? return _default_encoder.encode(obj)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 207, in encode

??? chunks = self.iterencode(o, _one_shot=True)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 270, in iterencode

??? return _iterencode(o, 0)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 184, in default

??? raise TypeError(repr(o) + " is not JSON serializable")

TypeError: Decimal('113.26700000') is not JSON serializable

查看源碼

def userinput():

??? data = request.data

??? ms = app.where2go.most_similar(data)

??? top_places_json = app.where2go.get_top_places_json(ms)?

???

??? app.result['top_places'] = top_places_json

??? print top_places_json

??? return json.dumps(app.result)

報(bào)錯(cuò)

[(1.0, 'beijing')]

[(u'guangzhou', 0.7928651571273804), (u'seoul', 0.7863544225692749), (u'nanjing', 0.7803971767425537), (u'tianjin', 0.776152491569519), (u'shanghai', 0.7680126428604126), (u'hangzhou', 0.747714638710022), (u'wuhan', 0.7452333569526672), (u'kunming', 0.7269240021705627), (u'fuzhou', 0.720137357711792), (u'xiamen', 0.7125382423400879), (u'beijing_shanghai', 0.7124168872833252), (u'busan', 0.7067762613296509), (u'harbin', 0.7055091857910156), (u'xian', 0.703764796257019), (u'taipei', 0.7032514810562134), (u'moscow', 0.7001821994781494), (u'urumqi', 0.6986857652664185), (u'shenyang', 0.6914734244346619), (u'chengdu', 0.6909835338592529), (u'munich', 0.6862865686416626), (u'vienna', 0.6839408874511719), (u'ulaanbaatar', 0.6831813454627991), (u'budapest', 0.6821123957633972), (u'vladivostok', 0.6806952953338623), (u'zhengzhou', 0.6783512830734253), (u'brussels', 0.6768432259559631), (u'copenhagen', 0.6743952035903931), (u'pyongyang', 0.6742997169494629), (u'bratislava', 0.667781412601471), (u'astana', 0.6674197912216187), (u'ningbo', 0.6667413711547852), (u'chongqing', 0.6655149459838867), (u'shenzhen', 0.6651620864868164), (u'qingdao', 0.6618784070014954), (u'sofia', 0.6600873470306396), (u'frankfurt', 0.6579354405403137), (u'nanning', 0.6576802730560303), (u'berlin', 0.6552646160125732), (u'wuchang', 0.6497694253921509)]

127.0.0.1 - - [02/Sep/2019 14:35:10] "POST /map HTTP/1.1" 500 -

Traceback (most recent call last):

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2463, in __call__

??? return self.wsgi_app(environ, start_response)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2449, in wsgi_app

??? response = self.handle_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1866, in handle_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2446, in wsgi_app

?? ?response = self.full_dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1951, in full_dispatch_request

??? rv = self.handle_user_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1820, in handle_user_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request

??? rv = self.dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1935, in dispatch_request

??? return self.view_functions[rule.endpoint](**req.view_args)

? File "D:\anacondaProject\where2go\webapp\app.py", line 39, in userinput

??? app.result['top_places'] = top_places_json

TypeError: 'NoneType' object does not support item assignment

top_places_json、app.result是一個(gè)list

查看json.dumps的使用方法：

django自帶encoder，無(wú)法序列化時(shí)增加一個(gè)cls=NpEncoder的參數(shù)。

也可能是數(shù)據(jù)中帶有numpy等數(shù)據(jù)類型，dumps無(wú)法識(shí)別

此處Decimal('113.26700000')是小數(shù)的意思

自定義類：https://blog.csdn.net/rt5476238/article/details/91398332

https://stackoverflow.com/questions/1960516/python-json-serialize-a-decimal-object/8274307#8274307

Simplejson 2.1 2.1 and higher has native support for Decimal type: json.dumps(Decimal('3.9'), use_decimal=True)

Note that use_decimal is True by default

Simplejson：是一個(gè)簡(jiǎn)單，快速，完整，正確和可擴(kuò)展的JSON[http://json.org]編碼器和解碼器的Python 2.5+和Python 3.3*。它是純 Python 代碼，沒(méi)有依賴項(xiàng)，但包括可選的 C 擴(kuò)展，用于嚴(yán)重提升速度。

簡(jiǎn)單json的最新文檔可以在這里在線閱讀：https://simplejson.readthedocs.io/

simplejson 是 Python 2.6 和 Python 3.0 附帶的 json 庫(kù)的外部維護(hù)開(kāi)發(fā)版本，但保留了與 Python 2.5 的向后兼容性。

使用文檔：https://simplejson.readthedocs.io/en/latest/

嘗試引入：

修改import json

為import simplejson as json

pip install simplejson

（除了存的banner圖片還沒(méi)放到文件夾里）全部完成，但是沒(méi)有包含線路，和客戶關(guān)注信息。

Anaconda常用命令

新建虛擬環(huán)境

(base) C:\Users\zdp>conda create -n django

激活虛擬環(huán)境

(base) C:\Users\zdp>activate py2_flask

進(jìn)入項(xiàng)目文件夾路徑

(django) C:\Users\laugo>cd /d D:\anacondaProject\where2go\webapp

運(yùn)行py文件：

python app.py

爬的時(shí)候運(yùn)行：

activate py2_flask

cd /d D:\anacondaProject\where2go\code\data_collection

python scrap_wikivoyage_banners.py

打開(kāi)app時(shí)：

activate py2_flask

cd /d D:\anacondaProject\where2go\webapp

python app.py

查看flask版本：

python

import flask

flask.__version__

常用算法

下一個(gè)問(wèn)題是找出使用哪種模型。傳統(tǒng)的自然語(yǔ)言處理推薦系統(tǒng)包括 TF-IDF + cos-similarity和 TF-IDF + SVD + k - means聚類等模型。

自然語(yǔ)言處理natural language processing：

自然語(yǔ)言處理技術(shù)（NLP）在推薦系統(tǒng)中的應(yīng)用https://blog.csdn.net/heyc861221/article/details/80130263

相比結(jié)構(gòu)化信息（例如商品的屬性等），文本信息在具體使用時(shí)具有一些先天缺點(diǎn)：結(jié)構(gòu)代表著信息量，無(wú)論是使用算法還是業(yè)務(wù)規(guī)則，都可以根據(jù)結(jié)構(gòu)化信息來(lái)制定推薦策略；信息量不確定；歧義問(wèn)題較多

優(yōu)點(diǎn)：數(shù)據(jù)量大；多樣性豐富；信息及時(shí)

Word2vec原理

Word2vec概述：http://www.mamicode.com/info-detail-2150217.html無(wú)監(jiān)督學(xué)習(xí)

概要（比較專業(yè)詳細(xì)）：https://www.jianshu.com/p/bca4e7bfb86d

應(yīng)用, 序列數(shù)據(jù) + 局部強(qiáng)關(guān)聯(lián)

聚類, 找同義詞, 詞性分析

文本序列: 近鄰強(qiáng)關(guān)聯(lián), 可通過(guò)上下文預(yù)測(cè)目標(biāo)詞(選詞填空)

社交網(wǎng)絡(luò): 隨機(jī)游走生成序列, 然后使用word2vec訓(xùn)練每個(gè)節(jié)點(diǎn)的向量.

推薦系統(tǒng), 廣告(APP下載序列: word2vec + similarity = aggr to )

word2vec 從原理到實(shí)現(xiàn)：https://zhuanlan.zhihu.com/p/43736169

word2vec中哈夫曼樹(shù)原理https://www.jianshu.com/p/f9351532f281

genism中關(guān)于word2vec使用的文檔https://radimrehurek.com/gensim/models/word2vec.html

word2vec原理介紹：https://www.zhihu.com/topic/19886836/hot（其中的幾篇參考也值得一看）

Hierarchical softmax 和 negative sampling優(yōu)化：https://www.cnblogs.com/Determined22/p/5807362.html

網(wǎng)站細(xì)節(jié)

我能夠運(yùn)行網(wǎng)站使用python Flask。使用 javascript 對(duì)搜索引擎執(zhí)行 AJAX 調(diào)用，以便可以在模型上運(yùn)行用戶的搜索查詢，以預(yù)測(cè)最相似的位置并在地圖上顯示建議。

Flask 文件名為"app.py"，可在文件夾"webapp"中找到;

"index.html"文件包含 html 和 javascript，可以在文件夾"模板"中找到。Bootstrap來(lái)設(shè)計(jì)網(wǎng)站。

Html

Html中點(diǎn)擊查詢，onClick="sendToFlask()"

function sendToFlask() {

??????????????????????????????? data = $('#user_input').val();

??????????????????????????????? $.ajax({

??????????????????????????????????? 'url': '/map',

??????????????????????????????????? 'data': data,

??????????????????????????????????? 'type': 'POST',

??????????????????????????????????? 'contentType': 'application/json',

??????????????????????????????????? 'success': function (data) {

??????????????????????????????????????? model_output = JSON.parse(data)

?????????????????????????????????????? ?var center_location = model_output['center_location'];

?????????????? ?????????????????????????var geojson = model_output['top_places'];

???????????????????????????????????????

??????????????????????????????????????? //Initialize

??????????????????????????????????????? $('#error_msg').remove()

????????????????????????????? ??????????$('#portfolio').empty()

??????????????????????????????????????? //if geojson list is empty, display error message.

??????????????????????????????????????? if (geojson.length==0) {

??????????????????????????????????????????? error_message()

???? ???????????????????????????????????};

??????????????????????????????????????? var portfolio_header =? '<br/><div class="col-lg-12 text-center"><h3 class="section-heading">Places 2 Go</h2></div>'

??????????????????????????????????????? $('#portfolio').append(portfolio_header);

??????????????????????????????????????? addtoPortfolio(geojson);

??????????????????????????????????????? // Clear the map before adding new markers

??????????????????????????????????????? mapSimple.removeLayer(myLayer);

??????????????????????????????????????? // Create new layer

??????????????????????????????????????? myLayer = L.mapbox.featureLayer();

??????????????????????????????????????? // Add custom popups to each using our custom feature properties

?????????????? ?????????????????????????myLayer.on('layeradd', function(e) {

??????????????????????????????????????????? var marker = e.layer,

??????????????????????????????????????????????? feature = marker.feature;

??????????????????????????????????????????? // Create custom popup content

??????????????????????????????????????????? var popupContent =? '<a target="_blank" class="popup" href="' + feature.properties.url + '">' + '<div class=crop><img src="' + feature.properties.image + '" height/></div><div class=text-center style="padding:15px 0 0 0"><font size="5">' + feature.properties.title + '</font></div></a>';

??????????????????????????????????????????? // http://leafletjs.com/reference.html#popup

??????????????????????????????????????????? marker.bindPopup(popupContent,{

??????????????????????????????????????????????? closeButton: true,

??????????????????????????????????????????????? minWidth: 320

??????????????????????????????????????????? });

??????????????????????????????????????? });

??????????????????????????????????????? myLayer.setGeoJSON(geojson).addTo(mapSimple);

???????????????????????????????????????

??????????????????????????????????????? mapSimple.fitBounds(myLayer.getBounds());

??????????????????????????????????????? // mapSimple.clearLayers();

??????????????????????????????????? },

??????????????????????????????????? 'error': function (request, status, error) {

??????????????????????????????????????? $('#error_msg').remove()

??????????????????????????????????????? error_message();

??????????????????????????????????????? console.log('Oh no!! Something went wrong.');

??????????????????????????????????? }

??????????????????????????????? });

??????????????????????????? };

網(wǎng)頁(yè)使用mapbox

地圖無(wú)法顯示了，有可能是因?yàn)閠oken失效或者沒(méi)連上網(wǎng)（控制臺(tái)提示'L' is not defined）

Font awesome的icon使用（沒(méi)有CSDN，不打算用了）

雖然不影響使用，貌似有一些沒(méi)下載下來(lái)？

GET /static/font-awesome/css/font-awesome.min.css HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/css/bootstrap.min.css HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/css/agency.css HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/jquery.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/bootstrap.min.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/cbpAnimatedHeader.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/agency.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/classie.js HTTP/1.1" 304 -

Flask后臺(tái)接口

范例：

@app.route('/1' , methods=['POST'])

def aa(): #傳什么返回什么

??? with open('1.txt','a') as f:

??????? print(str(request.data, encoding='utf-8'),file=f)

??? return request.data

if __name__=='__main__':

??? app.run(port=3002)#默認(rèn)不填寫(xiě)的話，是5000端口

文件為app.py

app.run(host= '0.0.0.0', port=80, debug=True)

@app.route('/map', methods=['POST'])

def userinput():

??? data = request.data

??? ms = app.where2go.most_similar(data)

??? top_places_json = app.where2go.get_top_places_json(ms)?

??? # print top_places_json

??? app.result['top_places'] = top_places_json

??? return json.dumps(app.result)

Model

use the trained word2vec model to give most similar recommendations to the input

??????? input = search string in the format of place/char + place/char -...

??????? output = top recommendations in json format

使用經(jīng)過(guò)訓(xùn)練的 word2vec 模型為輸入提供最類似的建議

terms = self.parse_search_query(input)

將用戶查詢解析為乘數(shù)和目標(biāo)

??????? # Set to make sure the output doesn't include one of the input destinations.

check = set()

確保輸出中不包含輸入的目的地

# For (multiplier, destination), get the multiplier * vector of that destination.

??????? # Then sum up to the master vector.

??????? for i, term in enumerate(terms):

??????????? multiplier, word = term

??????????? check.add(word)

??????????? if i == 0:

??????????????? master_vector = multiplier * self.model[word]

??????????? else:

??????????????? master_vector += multiplier * self.model[word]

對(duì)于（乘數(shù)、目的地），獲取該目標(biāo)的乘數(shù) + 矢量。

然后加到主向量中

# Find the most similar vectors to the amter vector

??????? ms = self.model.most_similar(positive=[master_vector], topn=topn)

??????? ms_wo_search_terms = [dest for dest in ms if dest[0] not in check]

? ??????print

??????? print ms_wo_search_terms

??????? return ms_wo_search_terms

查找與 master 矢量最相似的矢量

疑問(wèn)，ms到底怎么查出來(lái)的，

ms = self.model.most_similar(positive=[master_vector], topn=topn)

是自己調(diào)用自己?jiǎn)?#xff0c;還是word2vec自帶方法

有可能是自帶方法，但貌似不建議使用，警告：

DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.

方法將在 4.0.0 中刪除，改用self.wv.most_similar()

類分析

import cPickle as pkl#序列化

from where2go_model import Where2Go_Model#模型

def load_pickle():

??? return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb'))

ms = app.where2go.most_similar(data)

運(yùn)行時(shí)只有導(dǎo)入的where2go_model中有Where2Go_Model類，反序列化model也是它

但Where2Go_Model中也加載了其他pkl（找了一會(huì)在哪生成的，記憶模糊，拎不清了，離生成這些pkl已經(jīng)過(guò)了很久了，猜測(cè)既然where2go_model中沒(méi)有把其他code文件導(dǎo)入，應(yīng)該沒(méi)有其他類了，其他類之后抽空再看）

Pkl

Java 中有序列化與反序列化的操作，在 Python 中可以進(jìn)行同樣的操作。使用 Python 進(jìn)行對(duì)象的序列化（dump）與反序列化（load）操作時(shí)，我們不用考慮其中的細(xì)節(jié)，因?yàn)?Python 已經(jīng)幫我們封裝好了相關(guān)的類cPickle。

模型分析

Eda

使用文件/data/wikivoyage.json(第一步Gathering Data的enwikivoyage-latest-pages-articles.xml轉(zhuǎn)化得來(lái))369M有空去官網(wǎng)對(duì)該文件了解

處理數(shù)據(jù)，作為輸入格式

Model

Where2go基于谷歌創(chuàng)建的名為word2vec的模型。Word2vec 是一個(gè)神經(jīng)網(wǎng)絡(luò)，具有 1 個(gè)隱藏層，該層具有連續(xù)單詞袋（CBOW）或skip-grams實(shí)現(xiàn)。where2go 使用的版本使用skip-grams 和hierarchical softmax進(jìn)行優(yōu)化。

在我的 where2go_model.py 文件中，我實(shí)現(xiàn)了 gensim 的 word2vec 模型，并編寫(xiě)了矢量化用戶搜索查詢的函數(shù)，和將建議篩選到實(shí)際地理位置、以geojson格式輸出目的地的函數(shù)

模型建立取最相似（word2vec）：

bigram = gensim.models.Phrases(wikivoyage_list, min_count = 10)

model_bigrams = gensim.models.Word2Vec(bigram[wikivoyage_list], min_count=10, size = 200)

Ms = model_bigrams.most_similar(positive=['paris','london','sevilla'], negative = [], topn=20)

top_places = []

for entry in ms:

??? place, sim = entry

模型使用

terms = self.parse_search_query(input)

??????? # Set to make sure the output doesn't include one of the input destinations.

??????? check = set()

??????? # For (multiplier, destination), get the multiplier * vector of that destination.

??????? # Then sum up to the master vector.

??????? for i, term in enumerate(terms):

??????????? multiplier, word = term

??????????? check.add(word)

??????????? if i == 0:

??????????????? master_vector = multiplier * self.model[word]

??????????? else:

??????????????? master_vector += multiplier * self.model[word]

??????? # Find the most similar vectors to the amter vector

??????? ms = self.model.most_similar(positive=[master_vector], topn=topn)

??????? ms_wo_search_terms = [dest for dest in ms if dest[0] not in check]

Word2vec模型實(shí)現(xiàn)原理與源碼：

word2vec 算法包括skip-gram & CBOW模型，使用hierarchical softmax or negative sampling

我們這用的是skip-gram+hierarchical softmax

很多人以為 word2vec 是一種模型和方法，其實(shí) word2vec 只是一個(gè)工具，背后的模型是 CBOW 或者 Skip-gram，并且使用了 Hierarchical Softmax 或者 Negative Sampling 這些訓(xùn)練的優(yōu)化方法。所以準(zhǔn)確說(shuō)來(lái)，word2vec 并不是一個(gè)模型或算法，只不過(guò) Mikolov 恰好在當(dāng)時(shí)把他開(kāi)源的工具包起名叫做 word2vec 而已。

softmax(正則的指數(shù)函數(shù))是輸出層函數(shù)，他可以用于計(jì)算至少兩種不同類型的常見(jiàn)詞嵌入：word2vec, FastText。另外，它與sigmoid和tanh函數(shù)都是許多種類型的神經(jīng)網(wǎng)絡(luò)架構(gòu)的激活步驟

這個(gè)算法的復(fù)雜性就直接是我們單詞表的大小O(V)。事實(shí)表明，我們使用二叉樹(shù)的結(jié)構(gòu)可以簡(jiǎn)化這個(gè)復(fù)雜性，即分層(hierarchical) softmax

模型需要學(xué)習(xí)的參數(shù)：每個(gè)單詞的詞向量Xw + 霍夫曼樹(shù)每個(gè)內(nèi)部結(jié)點(diǎn)的θ

基于 H-softmax 模型的梯度計(jì)算

涉及到的公式太多了，在此直接把劉建平博客里的梯度計(jì)算過(guò)程貼過(guò)來(lái)：

spark mllib 里的 word2vec 實(shí)現(xiàn)就是采用的此方式，知道了上面梯度公式，spark word2vec源碼就能看懂了。

// 省略了建樹(shù)的過(guò)程，在建樹(shù)的過(guò)程中會(huì)給每個(gè)內(nèi)部結(jié)點(diǎn)編碼 while (pos < sentence.length) {val word = sentence(pos)val b = random.nextInt(window)// Train Skip-gram,// syn0 是詞向量 x 參數(shù)數(shù)組，長(zhǎng)度為 vocab_size * emb_size// syn1 是霍夫曼樹(shù)內(nèi)部結(jié)點(diǎn) w 參數(shù)數(shù)組，長(zhǎng)度同上var a = bwhile (a < window * 2 + 1 - b) {if (a != window) {val c = pos - window + aif (c >= 0 && c < sentence.length) {val lastWord = sentence(c)val l1 = lastWord * vectorSizeval neu1e = new Array[Float](vectorSize)// Hierarchical softmaxvar d = 0while (d < bcVocab.value(word).codeLen) {val inner = bcVocab.value(word).point(d)val l2 = inner * vectorSize// Propagate hidden -> outputvar f = blas.sdot(vectorSize, syn0, l1, 1, syn1, l2, 1) // 計(jì)算 x^Twif (f > -MAX_EXP && f < MAX_EXP) {val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toIntf = expTable.value(ind) // 計(jì)算 f = sigmoid(x^Tw)val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat // 計(jì)算梯度 g = (1-d-f) * alpha, d 是該節(jié)點(diǎn)的編碼(0/1)，alpha是學(xué)習(xí)率blas.saxpy(vectorSize, g, syn1, l2, 1, neu1e, 0, 1) // 累加 e = e + gw, e 初始化 0blas.saxpy(vectorSize, g, syn0, l1, 1, syn1, l2, 1) // 更新 w = w + gxsyn1Modify(inner) += 1}d += 1}blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn0, l1, 1) // 更新 x = x + esyn0Modify(lastWord) += 1}}a += 1}pos += 1 }

? ? ? ?

總結(jié)

以上是生活随笔為你收集整理的机器学习项目搭建试验 where2go的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：芮勇出任联想CTO，阿里巴巴获CIKM
下一篇： Typora配置图床