當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

面向Tableau开发人员的Python简要介绍（第4部分）

發(fā)布時(shí)間：2023/11/29 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了面向Tableau开发人员的Python简要介绍（第4部分）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

用PYTHON探索數(shù)據(jù) (EXPLORING DATA WITH PYTHON)

Between data blends, joins, and wrestling with the resulting levels of detail in Tableau, managing relationships between data can be tricky.

在數(shù)據(jù)混合，聯(lián)接以及在Tableau中產(chǎn)生的詳細(xì)程度之間進(jìn)行搏斗之間，管理數(shù)據(jù)之間的關(guān)系可能很棘手。

Stepping into your shiny new Python shoes, the world opens up a bit. Instead of trying to squeeze everything into a single data source to rule them all, you can choose your battles.

踏入您閃亮的新Python鞋子，世界將會(huì)打開一點(diǎn)。您可以選擇自己的戰(zhàn)斗，而不是試圖將所有內(nèi)容壓縮到單個(gè)數(shù)據(jù)源中以對(duì)其全部進(jìn)行統(tǒng)治。

In our previous articles, we already saw:

在之前的文章中，我們已經(jīng)看到：

Pandas groupby expressions and basic visualizations

熊貓groupby表達(dá)式和基本可視化

Calculations and coloring sales by profitability

按獲利能力計(jì)算和著色銷售

Data exploration with advanced visuals

具有高級(jí)視覺效果的數(shù)據(jù)探索

搭建舞臺(tái) (Setting the stage)

In this article, we’ll focus on one of the most important aspects of working with data in any ecosystem: joins.

在本文中，我們將重點(diǎn)介紹在任何生態(tài)系統(tǒng)中處理數(shù)據(jù)的最重要方面之一：聯(lián)接。

Lucky for us, the Tableau data we have been playing around with comes with batteries included! Within the context of the fictitious store we have been analyzing, we have a bit of data detailing various items that were returned.

幸運(yùn)的是，我們一直在使用的Tableau數(shù)據(jù)隨附電池！在我們一直在分析的虛擬商店的背景下，我們有一些數(shù)據(jù)詳細(xì)說明了退回的各種商品。

Let’s use this data on orders and returns to get comfortable joining data in Python using the Pandas library. Along the way, we’ll see how different types of joins can help us accomplish different goals.

讓我們按訂單和退貨使用此數(shù)據(jù)，以使用Pandas庫輕松地在Python中加入數(shù)據(jù)。一路上，我們將看到不同類型的聯(lián)接如何幫助我們實(shí)現(xiàn)不同的目標(biāo)。

第1步：快速查看我們要加入的數(shù)據(jù) (Step 1: taking a quick look at the data we’d like to join)

Remember that orders data we’ve been working with? Let’s get a good look at it real quick.

還記得我們一直在處理的訂單數(shù)據(jù)嗎？讓我們快速地真正了解它。

Recall that the head() function fetches the top rows, in this case we asked for the top 2 rows!回想一下head()函數(shù)獲取前幾行，在這種情況下，我們要求輸入前兩行！

Ah, that’s familiar. Now let’s see what’s behind door #2. But first, in the spirit of getting a feel for how Pandas works, let’s highlight how we are getting our hands on this data.

嗯，很熟悉。現(xiàn)在，讓我們看看2號(hào)門的背后。但是首先，本著了解熊貓工作方式的精神，讓我們重點(diǎn)介紹一下如何獲取這些數(shù)據(jù)。

In the first article, we used Pandas read_excel() function to get our orders data from an Excel file. It turns out this file has multiple sheets:

在第一篇文章中，我們使用了Pandas的read_excel()函數(shù)從一個(gè)Excel文件中獲取訂單數(shù)據(jù)。事實(shí)證明，此文件有多個(gè)工作表：

Orders

命令

Returns

退貨

People

人

Let’s ignore the ‘People’ sheet for now (nobody likes them anyway), and simply acknowledge the fact that when we used the read_excel() function without specifying a sheet name, the function grabbed the first sheet available in the Excel workbook.

現(xiàn)在讓我們忽略“人員”工作表(無論如何都沒人喜歡)，并簡(jiǎn)單地承認(rèn)一個(gè)事實(shí)，當(dāng)我們使用read_excel()函數(shù)但未指定工作表名稱時(shí)，該函數(shù)將獲取Excel工作簿中可用的第一張工作表。

Here’s how we can specify that we want to fetch data from the ‘Returns’ sheet:

這是我們可以指定要從“退貨”表中獲取數(shù)據(jù)的方式：

returns_df = pd.read_excel('Global Superstore.xls', sheet_name='Returns')

And here’s what that data looks like:

數(shù)據(jù)如下所示：

It doesn’t get much simpler than this data set沒有比這個(gè)數(shù)據(jù)集更簡(jiǎn)單的了

步驟2：避免相親數(shù)據(jù) (Step 2: avoid blind dates for your data)

In the data joining world, an arranged marriage is better than a blind date. That is, we want to be the overbearing parents who inspect the columns we are about to join our data on, looking for any inconsistencies that would make for a poor match. Blindly joining our data together just because we assume the columns match up could lead to some undesired results.

在數(shù)據(jù)連接世界中，包辦婚姻比相親更好。就是說，我們希望成為一個(gè)頑強(qiáng)的父母，他們將檢查將要加入我們數(shù)據(jù)的列，以尋找可能導(dǎo)致匹配不良的任何不一致之處。僅僅因?yàn)槲覀兗僭O(shè)各列匹配會(huì)導(dǎo)致盲目地將我們的數(shù)據(jù)連接在一起，可能會(huì)導(dǎo)致一些不良結(jié)果。

To prove that point, let’s take a stroll into a blind date and see what happens.

為了證明這一點(diǎn)，讓我們走進(jìn)相親，看看會(huì)發(fā)生什么。

For starters, let’s observe which columns our two data sets have in common.

首先，讓我們觀察一下我們的兩個(gè)數(shù)據(jù)集共有哪些列。

These are the columns for our orders…這些是我們訂單的列... and these are the columns for our returns這些是我們退貨的專欄

It looks to me like we have ‘Order ID’ and ‘Market’ in common. Of course, in many real world scenarios the same information might be stored under different names, but this isn’t the real world. It’s an example!

在我看來，我們有共同的“訂單ID”和“市場(chǎng)”。當(dāng)然，在許多現(xiàn)實(shí)世界中，相同的信息可能以不同的名稱存儲(chǔ)，但這不是現(xiàn)實(shí)世界。這是一個(gè)例子！

By the way, here’s a shortcut you could use to find values that exist in both column lists:

順便說一下，這是一個(gè)快捷方式，可用于查找兩個(gè)列列表中都存在的值：

common_cols = [col for col in store_df.columns if col in returns_df.columns]

What you just saw is called a list comprehension. If you end up falling in love with Python, these will become like second nature to you. They’re very useful. In this case, our list comprehension provides us any value of list X whose values also appear in list Y.

您剛剛看到的被稱為列表理解。如果您最終愛上了Python，那么對(duì)于您來說，這些將成為第二天性。它們非常有用。在這種情況下，我們的列表理解為我們提供了列表X的任何值，其值也出現(xiàn)在列表Y中。

Now that we know we have two columns in common, what are we waiting for!? Initiate the blind date!

現(xiàn)在我們知道我們有兩個(gè)共同點(diǎn)，我們還在等什么！開始相親！

步驟3：好的，讓我們嘗試一下相親… (Step 3: alright, let’s give the blind date a try…)

So we have ‘Order ID’ and ‘Market’ in common between our orders and our returns. Let’s join them.

因此，我們的訂單和退貨之間共有“訂單ID”和“市場(chǎng)”。讓我們加入他們。

In Pandas, you can use the join() method and you can also use the merge() method. Different strokes for different folks. I prefer the merge() method in pretty much any scenario, so that’s what we’ll use here. Team merge, for life.

在Pandas中，您可以使用join()方法，也可以使用merge()方法。物以類聚，人以群分。在幾乎所有情況下，我都更喜歡merge()方法，因此我們將在這里使用它。團(tuán)隊(duì)合并，終生一生。

blind_date_df = store_df.merge(returns_df, on=['Order ID', 'Market'], how='left')

And let’s see what the results look like:

讓我們看看結(jié)果如何：

Results of our left join我們左聯(lián)接的結(jié)果

First of all, we ran .shape on the resulting dataframe to get values for the number of rows and columns (in that order). So our resulting dataframe has 51,290 rows and 25 columns, where the original orders dataframe has 51,290 rows and 24 columns.

首先，我們?cè)诮Y(jié)果數(shù)據(jù)幀上運(yùn)行.shape以獲取行數(shù)和列數(shù)的值(按該順序)。因此，我們得到的數(shù)據(jù)框具有51,290行和25列，而原始訂單數(shù)據(jù)框具有51,290行和24列。

This join has effectively sprinkled in new data for each of our rows, providing one additional column named ‘Returned’, which takes on the value of ‘Yes’ if an order was returned.

此聯(lián)接有效地為我們的每一行添加了新數(shù)據(jù)，并提供了一個(gè)名為“ Returned”的附加列，如果返回了訂單，該列的值為“ Yes”。

Note that in our join we specified the columns to join on as well as how to perform the join. What is this ‘left’ join? It simply means that the table that was there first (in this example that is our store_df) will remain as-is, and the new table’s data will be sprinkled onto it wherever relevant.

請(qǐng)注意，在我們的聯(lián)接中，我們指定了要聯(lián)接的列以及執(zhí)行聯(lián)接的方式。什么是“左”聯(lián)接？這只是意味著首先存在的表(在本例中為store_df )將保持原樣，并且新表的數(shù)據(jù)將在任何相關(guān)的地方散布到該表上。

Let’s compare this to an inner join:

讓我們將其與內(nèi)部聯(lián)接進(jìn)行比較：

The inner join results in fewer rows than our left join內(nèi)部聯(lián)接比我們的左聯(lián)接更少的行

This inner join behaves differently from our left join in the sense that it only keeps the intersection between the two tables. This type of join would be useful if we only cared about analyzing orders that were returned, as it filters out any orders that were not returned.

此內(nèi)部聯(lián)接的行為與我們的左聯(lián)接不同，因?yàn)樗鼉H保持兩個(gè)表之間的交集。如果我們只關(guān)心分析已退回的訂單，則這種類型的聯(lián)接將很有用，因?yàn)樗鼤?huì)過濾掉未退回的所有訂單。

步驟4：相親怎么了？ (Step 4: so what’s wrong with the blind date?)

Sometimes you think you know everything, and that’s when it bites you the hardest. In this example, we think we know that we have two matching columns: ‘Order ID’ and ‘Market’. But do our two data sets agree on what a market is?

有時(shí)您認(rèn)為自己知道所有事情，那是最難的時(shí)刻。在此示例中，我們認(rèn)為我們知道有兩個(gè)匹配的列：“訂單ID”和“市場(chǎng)”。但是，我們的兩個(gè)數(shù)據(jù)集是否就什么是市場(chǎng)達(dá)成共識(shí)？

Let’s stir up some drama. Orders, how do you define your markets?

讓我們煽動(dòng)一些戲劇。訂單，您如何定義市場(chǎng)？

store_df['Market'].unique()A respectable list of markets… except you spelled Canada and used acronyms for the others可觀的市場(chǎng)清單……除了您拼寫了加拿大，并使用了其他縮寫詞

This line of code takes a look at the entire ‘Market’ column and outputs the unique values found within it.

此行代碼將查看整個(gè)“市場(chǎng)”列，并輸出在其中找到的唯一值。

Okay. Returns, how do you define your markets?

好的。退貨，您如何定義市場(chǎng)？

A respectable list of… wait, ‘United States’ is spelled out?可觀的清單...等等，“美國(guó)”是否已明確列出？

It looks like our orders and returns teams both need to get on the same page in terms of whether they use acronyms for markets or spell them out.

看來我們的訂單和退貨團(tuán)隊(duì)都需要在同一個(gè)頁面上使用首字母縮寫詞表示市場(chǎng)還是將其拼寫清楚。

On the orders side, to avoid future issues we should probably switch ‘Canada’ to an acronym value like ‘CA’.

在訂單方面，為避免將來出現(xiàn)問題，我們可能應(yīng)該將“加拿大”更改為“ CA”等首字母縮寫值。

On the returns side, to avoid future issues we should probably switch ‘United States’ to ‘US’

在收益方面，為了避免將來出現(xiàn)問題，我們可能應(yīng)該將“美國(guó)”切換為“美國(guó)”

But wait, that only fixes the future issues… what kinds of problems is this causing right now?

但是，等等，這只能解決未來的問題……這現(xiàn)在會(huì)引起什么問題？

步驟5：盲目加入后清理 (Step 5: cleaning up after a blind join)

To see the mess we’re in, let’s look at how many returns we have per market (using the inner join from earlier):

為了弄清楚我們所處的混亂狀況，讓我們看一下每個(gè)市場(chǎng)有多少回報(bào)(使用前面的內(nèi)部聯(lián)接)：

Anything missing?缺少什么？

Hurray, it looks like our US market is perfect and has no returns! Or wait, is it the United States market… ah, oops.

華友世紀(jì)，看來我們的美國(guó)市場(chǎng)是完美的，沒有任何回報(bào)！還是等等，這是美國(guó)市場(chǎng)嗎？

Because the data containing our orders calls the market ‘US’ and the data containing our returns calls the market ‘United States’, the join will never match the two.

因?yàn)榘覀兊挠唵蔚臄?shù)據(jù)將市場(chǎng)稱為“美國(guó)”，而包含我們的退貨的數(shù)據(jù)將市場(chǎng)稱為“美國(guó)”，所以聯(lián)接將永遠(yuǎn)不會(huì)匹配兩者。

Luckily, it’s really easy to rename our markets. Here’s a quick way to do it in this situation, where there’s really just one mismatch that’s causing a problem. This introduces the concept of a lambda function, which you can simply ignore for now if it makes no sense to you.

幸運(yùn)的是，重命名我們的市場(chǎng)真的很容易。在這種情況下，這是一種快速的解決方法，實(shí)際上只有一個(gè)不匹配會(huì)導(dǎo)致問題。這引入了lambda函數(shù)的概念，如果對(duì)您沒有意義，您可以暫時(shí)忽略它。

returns_df['Market'] = returns_df['Market'].apply(lambda market: market.replace('United States', 'US'))

Basically what this does is it creates a function on the fly that we use quickly to perform a useful action in a single line of code. The result of running the line of code above is that any occurrence of the ‘United States’ market has been renamed to ‘US’.

基本上，這是在運(yùn)行中創(chuàng)建一個(gè)函數(shù)，我們可以快速使用它在一行代碼中執(zhí)行有用的操作。運(yùn)行上述代碼行的結(jié)果是，所有出現(xiàn)的“美國(guó)”市場(chǎng)都已重命名為“美國(guó)”。

Now, if we run that inner join between store_df and returns_df, the results will look a bit different:

現(xiàn)在，如果我們?cè)趕tore_df和returns_df之間運(yùn)行該內(nèi)部聯(lián)接，結(jié)果將看起來有些不同：

And if we check how many returns there are per market, we get this:

如果我們檢查每個(gè)市場(chǎng)有多少回報(bào)，我們將得到：

第6步：對(duì)數(shù)據(jù)上癮者的挑戰(zhàn) (Step 6: a little challenge for the data addicts out there)

Now that we know how to join our orders and our returns, can you figure out how to stitch together a table like the one shown below?

既然我們知道如何加入訂單和退貨，那么您能否弄清楚如何將一張表格拼接在一起，如下圖所示？

Looks like the ‘Tables’ sub-category is crying out for attention again! We can’t escape it.

看起來“表格”子類別再次引起人們的注意！我們無法逃脫。

Applying what we’ve learned so far in this series, can you recreate this table on your own? If you’re a real over-achiever, go ahead and build it in Tableau as well and compare the process. How do you handle the market mismatch in Tableau vs Python? There are multiple ways to crack the case — go try it out!

運(yùn)用我們?cè)诒鞠盗兄械侥壳盀橹顾鶎W(xué)到的知識(shí)，您可以自己重新創(chuàng)建該表嗎？如果您是真正的成就者，請(qǐng)繼續(xù)在Tableau中進(jìn)行構(gòu)建，并進(jìn)行比較。您如何處理Tableau vs Python中的市場(chǎng)不匹配問題？有多種破解方法-試試吧！

結(jié)語 (Wrapping it up)

Joining data is an absolutely crucial skill if you’re working with data at scale. Understanding what you’re joining is as important as knowing the technical details of how to execute the joins, keep that in mind!

如果您要大規(guī)模處理數(shù)據(jù)，那么連接數(shù)據(jù)是絕對(duì)至關(guān)重要的技能。了解您要加入的內(nèi)容與了解如何執(zhí)行連接的技術(shù)細(xì)節(jié)一樣重要，請(qǐng)記住這一點(diǎn)！

If you send your data on a blind date to be joined with another table, be aware of the risks. Scrub your data sets clean before sending them on dates with other data. Dirty data tends to leave a mess, and you’ll be the one troubleshooting it.

如果您在相親數(shù)據(jù)中發(fā)送數(shù)據(jù)以與另一個(gè)表結(jié)合使用，請(qǐng)注意風(fēng)險(xiǎn)。在將數(shù)據(jù)集與其他數(shù)據(jù)一起發(fā)送之前，先清理數(shù)據(jù)集。骯臟的數(shù)據(jù)容易造成混亂，您將成為對(duì)它進(jìn)行故障排除的人。

Hope to see you next time as we dive into crafting reusable code using functions!

希望下次我們使用函數(shù)編寫可重用代碼時(shí)與您見面！

翻譯自: https://towardsdatascience.com/a-gentle-introduction-to-python-for-tableau-developers-part-4-a6fd6b2f46b1

總結(jié)

以上是生活随笔為你收集整理的面向Tableau开发人员的Python简要介绍（第4部分）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：老是梦到僵尸是什么意思
下一篇： python 数据框缺失值_Python