當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

power bi_如何将Power BI模型的尺寸减少90％！

發布時間：2023/12/15 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 power bi_如何将Power BI模型的尺寸减少90％！小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

power bi

Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? So powerful, that it performs complex calculations over millions of rows in a blink of an eye.

您是否想過什么使Power BI在性能方面如此之快和強大？如此強大，它可以在一瞬間對數百萬行執行復雜的計算。

In this series of articles, we will dig deep to discover what is “under the hood” of Power BI, how your data is being stored, compressed, queried, and finally, brought back to your report. Once you finish reading, I hope that you will get a better understanding of the hard work happening in the background and appreciate the importance of creating an optimal data model in order to get maximum performance from the Power BI engine.

在本系列文章中，我們將深入研究以發現Power BI的“內幕”，以及如何存儲，壓縮，查詢您的數據，最后將它們帶回到您的報告中。閱讀完本書后，我希望您能更好地了解在后臺進行的艱苦工作，并希望了解創建最佳數據模型以從Power BI引擎獲得最佳性能的重要性。

After we laid the theoretical ground for understanding architecture behind VertiPaq storage engine, and which types of compression it uses to optimize your Power BI data model, it’s the right moment to make our hands dirty and apply our knowledge in real-life case!

在為理解VertiPaq存儲引擎背后的體系結構以及用于優化Power BI數據模型的壓縮類型奠定了理論基礎之后，是時候動手動手，并在實際案例中運用我們的知識了！

起點= 777 MB (Starting point = 777 MB)

Our data model is quite simple, yet memory exhaustive. We have a fact table (factChat) which contains data about live support chats and one dimension table (dimProduct), that relates to a fact table. Our fact table has around 9 million rows, which should not be a big deal for Power BI, but the table was imported as it is, without any additional optimization or transformation.

我們的數據模型非常簡單，但內存卻無窮。我們有一個事實表(factChat)，其中包含有關實時支持聊天的數據，以及一個與事實表相關的一維表(dimProduct)。我們的事實表大約有900萬行，這對于Power BI來說應該不是什么大問題，但是該表是按原樣導入的，沒有進行任何其他優化或轉換。

Now, this pbix file consumes whopping 777 MB!!! You can’t believe it? Just take a look:

現在，這個pbix文件消耗了777 MB！你不敢相信嗎？看看吧：

Just remember this picture! Of course, I don’t need to tell you how much time this report needs to load or refresh, and how our calculations are slow because of the file size.

只要記住這張照片！當然，我不需要告訴您此報告需要加載或刷新多少時間，并且由于文件大小而導致我們的計算速度變慢。

……甚至更糟！ (…and it’s even worse!)

Additionally, it’s not just 777 MBs that takes our memory, since memory consumption is being calculated taking into account the following factors:

此外，占用內存的不僅僅是777 MB，因為要考慮以下因素來計算內存消耗：

PBIX file
PBIX文件
Dictionary (you’ve learned about the dictionary in this article)
字典(您已經在本文中了解了字典)
Column hierarchies
列層次結構
User-defined hierarchies
用戶定義的層次結構
Relationships
人際關系

Now, if I open Task Manager, go to the Details tab and find the msmdsrv.exe process, I will see that it burns more than 1 GB of the memory!

現在，如果我打開任務管理器，請轉到“詳細信息”選項卡并找到msmdsrv.exe進程，我將看到它消耗了超過1 GB的內存！

Oh, man, that really hurts! And we haven’t even interacted with the report! So, let’s see what we can do to optimize our model…

哦，老兄，真痛！而且我們甚至都沒有與報告互動！因此，讓我們來看看如何優化模型…

規則1-僅導入您真正需要的列 (Rule #1 — Import only those columns you really need)

The first and the most important rule is: keep in your data model only those columns you really need for the report!

第一個也是最重要的規則是： 僅將報表中??真正需要的列保留在數據模型中！

That being said, do I really need here both chatID column, which is a surrogate key, and sourceID column, which is a primary key from the source system. Both these values are unique, so even if I need to count the total number of chats, I would still be fine with only one of them.

話雖這么說，我真的需要在這里同時使用chatID列(作為代理鍵)和sourceID列(作為源系統的主鍵)。這兩個值都是唯一的，因此即使我需要計算聊天的總數，也可以只使用其中一個就可以了。

Let me check how the file looks now:

讓我檢查一下文件的外觀：

By removing just one unnecessary column, we saved more than 100 MB!!! Let’s examine further what can be removed without taking a deeper look (and we will come later into this, I promise).

通過僅刪除一個不必要的列，我們節省了100 MB以上！！！讓我們進一步研究在不深入了解的情況下可以刪除的內容(我保證，我們稍后會介紹)。

Do we really need both original start time of the chat and UTC time, one stored as a Date/Time/Timezone type, other as Date/Time, and both going to a second precision??!!

我們真的需要聊天的原始開始時間和UTC時間，一個存儲為日期/時間/時區類型，另一個存儲為日期/時間，并且都達到第二精度嗎？

Let me get rid of the original start time column and keep only UTC values.

讓我擺脫原始的開始時間列，僅保留UTC值。

Another 100 MB of wasted space gone! By removing just two columns we don’t need, we reduced the size of our file by 30%!

另有100 MB的浪費空間消失了！通過僅刪除不需要的兩列，我們將文件大小減少了30％！

Now, that was without even looking into more details of the memory consumption. Let’s now turn on DAX Studio, my favorite tool for troubleshooting Power BI reports. As I already stressed a few times, this tool is a MUST if you plan to work seriously with Power BI — and it’s completely free!

現在，這甚至沒有涉及更多的內存消耗細節。現在讓我們打開DAX Studio (我最喜歡的Power BI報表故障排除工具)。正如我已經強調過的，如果您打算認真使用Power BI，則必須使用此工具-它是完全免費的！

One of the features in DAX Studio is a VertiPaq Analyzer, a very useful tool built by Marco Russo and Alberto Ferrari from sqlbi.com. When I connect to my pbix file with DAX Studio, here are the numbers related to my data model size:

DAX Studio的功能之一是VertiPaq分析器，這是由sqlbi.com的Marco Russo和Alberto Ferrari構建的非常有用的工具。當我使用DAX Studio連接到我的pbix文件時，以下是與我的數據模型大小相關的數字：

I can see here what are the most expensive columns in my data model and decide if I can discard some of them, or do I need to keep them all.

我可以在這里看到數據模型中最昂貴的列，然后決定是否可以丟棄其中一些，還是需要保留所有列。

At first glance, I have few candidates for removal — sessionReferrer and referrer columns have high cardinality and therefore can’t be optimally compressed. Moreover, as these are text columns and need to be encoded using a Hash algorithm, you can see that their dictionary size is extremely high! If you take a closer look, you can notice that these two columns take almost 40% of my table size!

乍一看，我有幾個去除候選人- sessionReferrer和引用列具有較高的基數，因此不能得到最佳壓縮。此外，由于這些是文本列，需要使用哈希算法進行編碼，因此您可以看到它們的字典大小非常大！如果您仔細看一下，您會發現這兩列占據了我表大小的近40％！

After checking with my report users if they need any of these columns, or maybe only one of them, I’ve got a confirmation that they don’t perform any analysis on those columns. So, why on Earth should we bloat our data model with them??!!

與我的報告用戶核實后，他們是否需要這些列中的任何一個，或者可能僅其中之一，我得到了確認，他們沒有對這些列進行任何分析。那么，為什么我們應該在地球上膨脹他們的數據模型呢？

Another strong candidate for removal is the LastEditDate column. This column just shows the date and time when the record was last edited in the data warehouse. Again checked with report users, they didn’t even know that this column exists!

另一個很可能刪除的候選對象是LastEditDate列。此列僅顯示記錄在數據倉庫中的最后編輯日期和時間。再次與報表用戶核對，他們甚至都不知道該列存在！

I removed these three columns and the result is:

我刪除了這三列，結果是：

Oh, god, we halved the size of our data model by just removing few unnecessary columns.

哦，天哪，我們只刪除了一些不必要的列就將數據模型的大小減半了。

Truth to be said, there are some few more columns that could be dismissed from the data model, but let’s now focus on other techniques for data model optimization.

說實話，數據模型中可能還有其他幾篇文章，但現在讓我們關注其他用于數據模型優化的技術。

規則2-減少列基數！ (Rule #2 — Reduce the column cardinality!)

As you may recall from my previous article, the rule of thumb is: the higher the cardinality of a column, the harder for VertiPaq to optimally compress the data. Especially, if we are not working with integer values.

您可能會記得我在上一篇文章中提到的經驗法則是：列的基數越高，VertiPaq難以最佳地壓縮數據。特別是，如果我們不使用整數值。

Let’s take a deeper look into VertiPaq Analyzer results:

讓我們深入了解VertiPaq Analyzer結果：

As you see, even if the chatID column has higher cardinality than the datetmStartUTC column, it takes almost 8 times less memory! Since it is a surrogate key integer value, VertiPaq applies Value encoding, and the size of a dictionary is irrelevant. On the other hand, Hash encoding is being applied for the column of date/time data type with high cardinality, so the dictionary size is enormously higher.

如您所見，即使chatID列的基數高于datetmStartUTC列，它的內存也要少將近8倍！由于它是代理鍵整數值，因此VertiPaq將應用值編碼，并且字典的大小無關緊要。另一方面，哈希編碼被應用于具有高基數的日期/時間數據類型的列，因此字典的大小要大得多。

There are multiple techniques for reducing the column cardinality, such as splitting columns. Here are a few examples of using this technique.

有多種減少列基數的技術，例如拆分列。這里有一些使用這種技術的例子。

For Integer columns, you can split them into two even columns using division and modulo operations. In our case, it would be:

對于整數列，可以使用除法和取模運算將它們分成兩個偶數列。在我們的情況下，它將是：

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod
.......

This optimization technique must be performed on the source side (in this case by writing a T-SQL statement). If we use the calculated columns, there is no benefit at all, since the original column has to be stored in the data model first.

此優化技術必須在源端執行(在這種情況下，通過編寫T-SQL語句)。如果使用計算列，則根本沒有好處，因為原始列必須首先存儲在數據模型中。

A similar technique can bring significant savings when you have decimal values in the column. You can just simply split values before and after the decimal as explained in this article.

當列中有十進制值時，類似的技術可以節省大量資金。您可以按照本文中的說明簡單地在小數點之前和之后拆分值。

Since we don’t have any decimal values, let’s focus on our problem — optimizing the datetmStartUTC column. There are multiple valid options to optimize this column. The first is to check if your users need granularity higher than day level (in other words, can you remove hours, minutes, and seconds from your data).

由于我們沒有任何十進制值，因此我們集中解決我們的問題-優化datetmStartUTC列。有多個有效選項可優化此列。第一個是檢查用戶是否需要比日級更高的粒度(換句話說，是否可以從數據中刪除小時，分鐘和秒)。

Let’s check what savings would this solution bring:

讓我們檢查一下該解決方案將帶來哪些節省：

The first thing we notice is that our file is now 255 MB, so 1/3 from what we started from. VertiPaq Analyzer’s results show that this column is now almost perfectly optimized, going from taking over 62% of our data model to just slightly over 2.5%! That’s huuuuge!

我們注意到的第一件事是我們的文件現在為255 MB，因此是我們開始時的1/3。 VertiPaq分析器的結果表明，該列現在幾乎完美優化，從占用我們數據模型的62％到略超過2.5％！太好了！

However, it appeared that day level grain was not fine enough and my users needed to analyze figures on hour level. OK, so we can at least get rid of minutes and seconds and that would also decrease the cardinality of the column.

但是，看來日級精度不夠好，我的用戶需要分析小時級的數字。好的，這樣我們至少可以省去幾分鐘和幾秒鐘，這也將減少該列的基數。

So, I’ve imported values rounded per hour:

因此，我導入了每小時取整的值：

SELECT chatID
,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
,customerID
,userID
,ipAddressID
,productID
,countryID
,userStatus
,isUnansweredChat
,totalMsgsOp
,totalMsgsUser
,userTimezone
,waitTimeSec
,waitTimeoutSec
,chatDurationSec
,sourceSystem
,subject
,usaccept
,transferUserID
,languageID
,waitFirstClick
FROM factChat

It appeared that my users also didn’t need a chatVariables column for analysis, so I’ve also removed it from the data model.

看來我的用戶也不需要chatVariables列進行分析，因此我也將其從數據模型中刪除了。

Finally, after disabling Auto Date/Time in Options for Data Load, my data model size was around 220 MB! However, one thing still bothered me: the chatID column was still occupying almost 1/3 of my table. And this is just a surrogate key, which is not used in any of the relationships within my data model.

最后，在“數據加載選項”中禁用“自動日期/時間”后，我的數據模型大小約為220 MB！但是，有一件事仍然困擾著我：chatID列仍然占據著我表的近1/3。這只是一個代理密鑰，在我的數據模型中的任何關系中都沒有使用。

So, here I was examining two different solutions: the first was to simply remove this column and aggregate number of chats, counting them using GROUP BY clause:

因此，在這里，我正在研究兩種不同的解決方案：第一種是簡單地刪除此列并匯總聊天次數，并使用GROUP BY子句對其進行計數：

SELECT count(chatID) chatID
,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) datetmStartUTC
,customerID
,userID
,ipAddressID
,productID
,countryID
,userStatus
,isUnansweredChat
,totalMsgsOp
,totalMsgsUser
,userTimezone
,waitTimeSec
,waitTimeoutSec
,chatDurationSec
,sourceSystem
,subject
,usaccept
,transferUserID
,languageID
,waitFirstClick
FROM factChat
GROUP BY dateadd(hour, datediff(hour, 0, datetmStartUTC), 0)
,customerID
,userID
,ipAddressID
,productID
,countryID
,userStatus
,isUnansweredChat
,totalMsgsOp
,totalMsgsUser
,userTimezone
,waitTimeSec
,waitTimeoutSec
,chatDurationSec
,sourceSystem
,subject
,usaccept
,transferUserID
,languageID
,waitFirstClick

This solution would also reduce the number of rows since it will aggregate chats grouped by defined attributes — but the main advantage is that it will drastically reduce the cardinality of chatID column, as you can see in the next illustration:

該解決方案還將減少行數，因為它將聚合按定義的屬性分組的聊天，但是主要優點是，它將大大減少chatID列的基數，如下圖所示：

So, we went down from “9 million and something” cardinality to just 13!!! And this column’s memory consumption is now not worth mentioning anymore. Obviously, this also reflected on our pbix file size:

因此，我們的基數從“ 900萬”降到了13！而且，此列的內存消耗現在不再值得一提。顯然，這也反映在我們的pbix文件大小上：

Also, there would be no benefit by keeping the chatID column at all, since it’s not being used anywhere in our data model. Once I’ve removed it from the model, we saved an additional 3 MB, but preserved original granularity of the table!

同樣，完全保留chatID列也不會有任何好處，因為在我們的數據模型中的任何地方都沒有使用它。將其從模型中刪除后，我們又節省了3 MB，但保留了表格的原始粒度！

And one last time, let’s check the pbix file size:

最后一次，讓我們檢查一下pbix文件的大小：

Please recall the number we started at: 777 MB! So, I’ve managed to reduce my data model size by almost 90%, applying some simple techniques which enabled the VertiPaq storage engine to perform more optimal compression of the data.

請回想起我們從777 MB開始的電話號碼！ 因此，我設法使用一些簡單的技術使數據模型的大小減少了近90％ ，這些技術使VertiPaq存儲引擎能夠對數據執行更優化的壓縮。

And this was a real use-case, which I faced during the last year!

這是我去年遇到的一個真實用例！

減少數據模型大小的一般規則 (General rules for reducing data model size)

To conclude, here is the list of general rules you should keep in mind when trying to reduce the data model size:

總而言之，以下是在嘗試減小數據模型大小時應記住的一般規則：

Keep only those columns your users need in the report! Just sticking with this one single rule will save you an unbelievable amount of space, I assure you…
僅將用戶需要的列保留在報告中！我向您保證，僅遵循這一單一規則將為您節省大量的空間。
Try to optimize column cardinality whenever possible. The golden rule here is: test, test, test…and if there is a significant benefit from, for example, splitting one column into two, or to substitute decimal column with two whole number columns, then do it! But, also keep in mind that your measures need to be rewritten to handle those structural changes, in order to display expected results. So, if your table is not big, or if you have to rewrite hundreds of measures, maybe it’s not worth splitting the column. As I said, it depends on your specific scenario, and you should carefully evaluate which solution makes more sense
盡可能嘗試優化列基數。這里的黃金法則是：測試，測試，測試...，如果將一欄分為兩列，或者將小數列替換為兩個整數列，則有很大的好處，那就去做吧！但是，還要記住，為了顯示預期的結果，需要重寫您的度量以處理這些結構性更改。因此，如果表不大，或者您必須重寫數百個度量，則可能不值得拆分列。正如我所說，這取決于您的特定情況，您應該仔細評估哪種解決方案更有意義
Same as for columns, keep only those rows you need: for example, maybe you don’t need to import data from the last 10 years, but only 5! That will also reduce your data model size. Talk to your users, ask them what they really need, before blindly putting everything inside your data model
與列一樣，只保留所需的行：例如，也許您不需要導入最近10年的數據，而只需導入5年！這也將減少數據模型的大小。與您的用戶交談，詢問他們真正的需求，然后盲目地將所有內容放入您的數據模型
Aggregate your data whenever possible! That means — fewer rows, lower cardinality, so all nice things you are aiming to achieve! If you don’t need hours, minutes, or seconds level of granularity, don’t import them! Aggregations in Power BI (and Tabular model in general) are a very important and wide topic, which is out of the scope of this series, but I strongly recommend you to check Phil Seamark’s blog and his series of posts on creative aggregations usage
盡可能匯總您的數據！這意味著-更少的行，更低的基數，所以您想要實現的所有美好！如果您不需要小時，分鐘或秒的粒度級別，請不要導入它們！ Power BI(通常是表格模型)中的聚合是一個非常重要且廣泛的話題，不在本系列文章的討論范圍之內，但是我強烈建議您查看Phil Seamark的博客及其有關創造性聚合用法的系列文章。
Avoid using calculated columns whenever possible, since they are not being optimally compressed. Instead, try to push all calculations to a data source (SQL database for example) or perform them using the Power Query editor
避免盡可能使用計算列，因為它們沒有得到最佳壓縮。相反，嘗試將所有計算推送到數據源(例如SQL數據庫)或使用Power Query編輯器執行它們
Use proper data types (for example, if your data granularity is on a day level, there is no need to use Date/Time data type. Date data type will suffice)
使用適當的數據類型(例如，如果您的數據粒度為一天級別，則無需使用日期/時間數據類型。日期數據類型就足夠了)
Disable Auto Date/Time option for data loading (this will remove a bunch of automatically created date tables in the background)
禁用自動日期/時間選項以加載數據(這將在后臺刪除一堆自動創建的日期表)

結論 (Conclusion)

After you learned the basics of VertiPaq storage engine and different techniques it uses for data compression, I wanted to wrap up this series by showing you on a real-life example how we can “help” VertiPaq (and Power BI consequentially) to get the best out of report performance and optimal resource consumption.

在您了解了VertiPaq存儲引擎的基礎知識以及它用于數據壓縮的不同技術之后，我想通過向您展示一個真實的示例來總結本系列文章，我們將如何“幫助” VertiPaq(和Power BI)以獲取相關信息。最佳的報表性能和最佳的資源消耗。

Thanks for reading, hope that you enjoyed the series!

感謝您的閱讀，希望您喜歡這個系列！

翻譯自: https://towardsdatascience.com/how-to-reduce-your-power-bi-model-size-by-90-b2f834c9f12e

power bi

總結

以上是生活随笔為你收集整理的power bi_如何将Power BI模型的尺寸减少90％！的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： unity中ContentSizeFit
下一篇：使用Optuna的XGBoost模型的高