字体大小变化_变小变大
字體大小變化
When I was in my final year as a university student, I was preparing and collecting sufficient datasets for my research paper as my final year project. I was just casually scrolling through the internet and voila! It didn’t take me long to gather all of the datasets I needed. But when I thought everything went smooth sailing with my boat, a Kraken appeared — of course not the sea monster but it required tons of brainstorming sessions. The dataset that I’ve been collecting is too small to work with, I’m talking 20 to 30 periodic observations, yikes. You may ask, why didn’t you realize that it’s insufficient just by looking at the number of observations? Well, to be frank, I did feel a little bit worried when I saw the “handful” amount of observations. But it hit me when I realized it’s not enough to be implemented in the model I was researching.
當(dāng)我是大學(xué)生的最后一年時,我正在為我的研究論文準(zhǔn)備和收集足夠的數(shù)據(jù)集,作為我的最后一個項目。 我只是隨便滾動瀏覽互聯(lián)網(wǎng), 瞧! 我花了很長時間收集了我需要的所有數(shù)據(jù)集。 但是,當(dāng)我以為一切都順利進行時,出現(xiàn)了KrakenD-當(dāng)然??不是海怪,而是需要大量的頭腦風(fēng)暴會議。 我一直在收集的數(shù)據(jù)集太小而無法使用,我說的是20到30次定期觀測, yikes 。 您可能會問,為什么不僅僅觀察觀察數(shù)就意識到不足? 好吧,坦率地說,當(dāng)我看到“少量”的觀察結(jié)果時,我確實有點擔(dān)心。 但是當(dāng)我意識到不足以在我正在研究的模型中實施時,這讓我感到震驚。
After quite a few hours, a book, and a glass of coffee, I’ve finally found inspiration on how to work with these small datasets, extrapolate it, appropriately. At first, I genuinely thought my idea is going to cause quite an error in the model, but thankfully, it went well and I finished my paper. So in this article, I wanted to share the methods that I used working with a univariate dataset and a new method that I’ve developed for a multivariate dataset.
幾個小時后,再讀一本書,再喝一杯咖啡,我終于找到了靈感,學(xué)習(xí)如何使用這些小型數(shù)據(jù)集,進行適當(dāng)?shù)耐茢?/strong> 。 剛開始,我確實以為我的想法會在模型中引起很大的錯誤,但是值得慶幸的是,它進展順利,我完成了論文。 因此,在本文中,我想分享用于單變量數(shù)據(jù)集的方法以及為多變量數(shù)據(jù)集開發(fā)的新方法。
讓我們從容錯率(MOE)開始簡單的單變量數(shù)據(jù)集 (Let’s start easy, Univariate Dataset with Margin of Error (MOE))
A dataset with provided MOE is so useful in this extrapolation method because the MOE is one of the key factors on how accurate the extrapolated values will be. In this case, I’ll be using the US Annual Mean Income, gathered from the United States Census Bureau, Table S1901. With the MOE on board, we can easily get the minimum and maximum values of mean income for each year. By knowing these values, we extrapolate it according to its annual values by generating random variates from the Uniform(0,1) Distribution, to represent the standardized values of the mean income. Then, we convert the standardized values back to the actual values using the minimum and the maximum values like so
具有MOE的數(shù)據(jù)集在此外推方法中非常有用,因為MOE是外推值的準(zhǔn)確性的關(guān)鍵因素之一。 在這種情況下,我將使用從美國人口調(diào)查局表S1901收集的美國年平均收入。 有了教育部,我們可以輕松獲得每年平均收入的最小值和最大值。 通過了解這些值,我們通過從Uniform(0,1)分布中生成隨機變量來根據(jù)其年值推斷它,以表示平均收入的標(biāo)準(zhǔn)化值。 然后,我們使用最小值和最大值將標(biāo)準(zhǔn)化值轉(zhuǎn)換回實際值,如下所示
Image by Author 圖片作者Say that I wanted to extrapolate the dataset because I want to recreate monthly mean income, I’ll be needing 12 random uniform variates to be converted each year. Here’s a side by side plot comparison of the real and the extrapolated datasets.
假設(shè)我要推斷數(shù)據(jù)集是因為我想重新創(chuàng)建每月平均收入,那么我每年將需要12個隨機均值變量進行轉(zhuǎn)換。 這是真實數(shù)據(jù)集和外推數(shù)據(jù)集的并排圖比較。
Extrapolated (Left), Real (Right), Image by Author 外推(左),實(右),作者提供的圖像As we can see, the increasing trend is still there, it’s just noisier since now it has monthly instead of annual values. And if we check the difference between the statistical properties
我們可以看到,增長趨勢仍然存在,因為現(xiàn)在它是按月而不是按年的值,所以只是比較嘈雜。 如果我們檢查統(tǒng)計屬性之間的差異
Percentage Difference of the Statistical Properties, Image by Author 統(tǒng)計屬性的百分比差異,作者提供的圖像it doesn’t differ much :)
它相差不大:)
沒有MOE的單變量數(shù)據(jù)集 (Univariate Dataset without MOE)
Now, this condition was the problem I mentioned before. I was confused about how I’m supposed to get info on the periodical variance of the data that I was working on. Luckily, the solution only requires two main features: A time-series model that fits the distribution of the dataset and some randomizing standardized values.
現(xiàn)在,這種情況就是我之前提到的問題。 我對于應(yīng)該如何獲取有關(guān)正在處理的數(shù)據(jù)的定期變化的信息感到困惑。 幸運的是,該解決方案僅需要兩個主要功能:適合數(shù)據(jù)集分布的時間序列模型和一些隨機化的標(biāo)準(zhǔn)化值。
In this example, I’m going to use the monthly sunspots dataset which you can acquire here. And yes, it’s already a huge dataset so no need for extrapolation, am I right? But let’s say you’re only given the last 3 years of observations and was told to generate daily values for the last 3 years based on that.
在此示例中,我將使用您可以在此處獲取的每月黑子數(shù)據(jù)集。 是的,它已經(jīng)是一個龐大的數(shù)據(jù)集,因此無需進行推斷,對嗎? 但是,假設(shè)您只獲得了最近3年的觀測值,并被告知要根據(jù)此得出最近3年的每日值。
Monthly Sunspots from 1981–1984, Image by Author 1981–1984年的每月黑子,作者提供的圖片Now let’s pick the model. From the beginning, we know that this is a monthly dataset. So why don’t we pick something simple? We’re going to use a linear seasonal regression model to be fitted to the dataset. Here’s the result:
現(xiàn)在讓我們選擇模型。 從一開始,我們就知道這是每月的數(shù)據(jù)集。 那么,為什么我們不選擇簡單的東西呢? 我們將使用線性季節(jié)性回歸模型來擬合數(shù)據(jù)集。 結(jié)果如下:
Image by Author 圖片作者That’s quite a great fit. Now we’re going to use the estimate and the standard error from this result to extrapolate the data. In other words, if we look back to the previous example, we can use the estimates and standard errors as the “mean income” and MOE respectively. Since we’re going to generate daily values, the values will be generated according to the number of days in the month along with the estimate and standard error — I’m using a confidence level of 95% from this point on. Here are the extrapolated daily values:
非常適合。 現(xiàn)在,我們將使用此結(jié)果的估計值和標(biāo)準(zhǔn)誤差來推斷數(shù)據(jù)。 換句話說,如果我們回顧前面的示例,可以將估計值和標(biāo)準(zhǔn)誤分別用作“平均收入”和MOE。 由于我們將要生成每日值,因此將根據(jù)當(dāng)月的天數(shù)以及估算值和標(biāo)準(zhǔn)誤差來生成值-從現(xiàn)在開始,我將使用95%的置信度。 以下是推斷的每日值:
Image by Author 圖片作者One thing that immediately feels off is the lack of a decreasing trend in the original dataset. I’m doing it on purpose to show how important it is to pick an appropriate model according to the dataset we’re working on. By this result, we can conclude that the linear seasonal regression model is not the perfect fit for this dataset. Moreover, by using a regression we immediately assume a stationary condition in the dataset, which causing the extrapolated values to look like a stationary time series.
立刻感覺到的一件事是原始數(shù)據(jù)集中缺乏下降趨勢。 我這樣做是為了表明根據(jù)我們正在研究的數(shù)據(jù)集選擇合適的模型有多么重要。 通過此結(jié)果,我們可以得出結(jié)論,線性季節(jié)性回歸模型不是此數(shù)據(jù)集的理想選擇。 此外,通過使用回歸,我們立即假定數(shù)據(jù)集中的平穩(wěn)條件,這導(dǎo)致外推值看起來像平穩(wěn)的時間序列。
多元數(shù)據(jù)集 (Multivariate Dataset)
Down to the last example, it took me quite a while to think of a way to extrapolate a multivariate dataset. Nevertheless, here’s one of the methods of doing it. In this last example, I’m using New Delhi Climate Training Dataset from Kaggle.
直到最后一個示例,我花了相當(dāng)長的時間才想到一種推斷多元數(shù)據(jù)集的方法。 但是,這是執(zhí)行此操作的方法之一。 在最后一個示例中,我使用了Kaggle的 New Delhi氣候培訓(xùn)數(shù)據(jù)集。
Likewise, let’s investigate the dataset first. Since I was expecting a correlation between the variables, I’ll start with the scatterplots between the variables.
同樣,讓我們??先研究數(shù)據(jù)集。 由于我期望變量之間具有相關(guān)性,因此我將從變量之間的散點圖開始。
Image by Author 圖片作者Now my eyes immediately make its way to the pressure section albeit the apparent negative correlation between the temperature and humidity. Something feels off with the plot, and I immediately realize it must be some outliers knowing some values differ much from the rest. I understand that I’m no expert in this climate section of knowledge, so I’m calling our best friend and jack-of-all-trades, Google, to help me to find out the normal values for air pressure, and it sent me here. Turns out, the values should be around 1013.25 millibars. Hence, according to the dataset and the website, pressure values that lie between 990 and 1024 will be considered as normal. Then, the outliers will be replaced according to the distribution of the dataset.
現(xiàn)在,盡管溫度和濕度之間明顯存在負相關(guān)關(guān)系,但我的眼睛立即進入壓力區(qū)域。 情節(jié)讓人感覺有些不對勁,我立即意識到一定是一些離群值,知道某些值與其他值有很大不同。 我了解我不是這個氣候知識領(lǐng)域的專家,所以我打電話給我們最好的朋友和千篇一律的交易商Google ,以幫助我找出氣壓的正常值,我在這里 。 事實證明,該值應(yīng)在1013.25毫巴左右。 因此,根據(jù)數(shù)據(jù)集和網(wǎng)站,位于990和1024之間的壓力值將被認為是正常的。 然后,將根據(jù)數(shù)據(jù)集的分布替換異常值。
You might be wondering, there must be a twist to this example since there are already a lot of observations. YOU GUESSED IT RIGHT! (really sorry for my corny jokes trying to get your attention back lol)
您可能想知道,由于已經(jīng)有很多觀察結(jié)果,因此本示例必須有所不同。 您猜對了! (真的很抱歉,我的頑皮笑話試圖引起您的注意,哈哈)
The twist here is that you’re actually given the monthly average from each variable and you need to convert it back to daily values. Now, based on the last two examples I gave out before, please answer this question
這里的問題是,實際上您會獲得每個變量的每月平均值,并且需要將其轉(zhuǎn)換回每日值。 現(xiàn)在,根據(jù)我之前給出的最后兩個示例,請回答此問題
Is it going to work? Is it possible to do so?
它會起作用嗎? 有可能這樣做嗎?
Save your answer until the end of this article, and let’s see.
保存您的答案,直到本文結(jié)尾,讓我們看看。
First, as we did earlier, let’s take a look at the scatterplots between the variables.
首先,就像我們之前所做的那樣,讓我們??看一下變量之間的散點圖。
Image by Author 圖片作者Well, seems like our dataset is correlated to each other. Here’s what I can see from this plot:
好吧,好像我們的數(shù)據(jù)集是相互關(guān)聯(lián)的。 這是我從圖中看到的內(nèi)容:
- The most definite relation is between temperature and pressure, it’s a negative correlation. 最明確的關(guān)系是溫度和壓力之間的關(guān)系,它是負相關(guān)的關(guān)系。
- The rest might have quite a moderate correlation and it looks like it might fit into a quadratic model. 其余的可能具有適度的相關(guān)性,看起來可能適合二次模型。
With these in mind, I decided to create a linear and quadratic regression model for every possible pair of variables, then compare their R-Squared and Adjusted R-Squared values. Also, I’m going to create a linear seasonal regression model for each variable since it definitely has a seasonal pattern based on the plots below.
考慮到這些因素,我決定為每個可能的變量對創(chuàng)建一個線性和二次回歸模型,然后比較其R平方和調(diào)整后的R平方值。 另外,我將為每個變量創(chuàng)建一個線性季節(jié)性回歸模型,因為根據(jù)以下圖表,它肯定具有季節(jié)性模式。
Image by Author 圖片作者Before doing the regressions, it’s best for the values to be standardized since the variation of values isn’t similar. Here’s the result of the model fitting:
在進行回歸之前,最好對值進行標(biāo)準(zhǔn)化,因為值的變化不相似。 這是模型擬合的結(jié)果:
Image by Author 圖片作者Let’s focus on the relation between the variables. Excluding the seasonal regression results (row 1–4), the highest R-Squared value is the quadratic model where pressure as the independent variable and temperature as the dependent one. Whereas the other model doesn’t seem to have a great fit albeit the scatterplot showed an indication of correlation. Fortunately, the seasonal model is a great fit for all variables. With these in mind, here’s my plan:
讓我們關(guān)注變量之間的關(guān)系。 不包括季節(jié)性回歸結(jié)果(第1-4行),最高R平方值是二次模型,其中壓力為自變量,溫度為因變量。 盡管散點圖顯示了相關(guān)性,但其他模型似乎不太適合。 幸運的是,季節(jié)性模型非常適合所有變量。 考慮到這些,這是我的計劃:
Image by Author 圖片作者And now, the moment you’ve been waiting for, the comparison of the real versus the extrapolated values (the blue line is the extrapolated one).
現(xiàn)在,您等待的那一刻,將實數(shù)值與外推值進行比較(藍線是外推值)。
Image by Author 圖片作者Each extrapolated values fit well with the actual values, and not so bad with the temperature. But, our million-dollar question hasn’t been answered yet. To convert the values back to daily values, we’re going to need a little bit of math here.
每個外推值都與實際值非常吻合,而與溫度相差不大。 但是,我們尚未回答數(shù)百萬美元的問題。 要將值轉(zhuǎn)換回每日值,這里我們需要一些數(shù)學(xué)運算。
Image by Author 圖片作者in which n is the number of samples. Then, we can acquire the variance of the monthly averages, which is
其中n是樣本數(shù)。 然后,我們可以獲得月平均值的方差,即
Image by Author 圖片作者in which Yj^s is the standardized version of the monthly averages. Finally, we derive the standard error of the daily values with this set of equations:
其中Yj ^ s是月平均值的標(biāo)準(zhǔn)化版本。 最后,我們通過這組方程得出每日值的標(biāo)準(zhǔn)誤差:
Yay! :), Image by Author 好極了! :),作者提供的圖片Aaaandd without further ado, let’s see how the daily extrapolated values turned out.
事不宜遲,讓我們看看每日推斷值的結(jié)果。
Image by Author 圖片作者My first reaction was “What kind of noisy time-series is this? This is nuts!”. I don’t think we need to explain anything to answer the question, it’s a definite no, at least using this method. The extrapolated values become too noisy and only effective for the short-term since we use extrapolated data to extrapolate — #extrapo-ception. Moreover, the monthly average values don’t carry the “jumps” as the daily values do, causing the extrapolated daily values unable to capture it.
我的第一個React是“這是什么嘈雜的時間序列? 真是瘋了!”。 我認為我們無需解釋任何問題即可回答這個問題,這是肯定的,至少使用此方法是可以的。 外推值變得過于嘈雜,并且僅在短期內(nèi)有效,因為我們使用外推數(shù)據(jù)進行外推-#外差感知。 此外,月平均值不像日平均值那樣“跳躍”,導(dǎo)致外推的日平均值無法捕獲。
結(jié)論 (Conclusion)
- This extrapolation method is only able to create values according to the dataset used in the calculations and the generated values will follow the characteristics of it. 這種外推方法只能根據(jù)計算中使用的數(shù)據(jù)集來創(chuàng)建值,并且生成的值將遵循其特征。
- The stationarity assumption might be affecting the inability of detecting “jump”(s). Therefore, a more appropriate model might be a solution to generate more fitting extrapolated values. 平穩(wěn)性假設(shè)可能會影響無法檢測到“跳躍”。 因此,更合適的模型可能是生成更多擬合外推值的解決方案。
- Even if the extrapolated values are perfect, it doesn’t mean it would be a perfect representation of the population. Nevertheless, it’s still better to get an estimated depiction of the population might be. 即使推斷的值是完美的,也并不意味著它將完美地代表總體。 盡管如此,最好還是對人口進行大概的描述。
下一步是什么? (What’s next?)
I might be not the expert in this, but I did learn to work creatively with a time-series dataset. Even so, I would like to hear your suggestions that may improve this method even more. So, below is my GitHub repo of this time-series extrapolation method. I will definitely post more data science or actuarial science projects in the near future, so stay tuned!
我可能不是這方面的專家,但是我確實學(xué)會了創(chuàng)造性地使用時間序列數(shù)據(jù)集。 即使這樣,我還是想聽聽您的建議,這些建議可能會進一步改善此方法。 因此,以下是我的該時間序列外推方法的GitHub存儲庫。 我一定會在不久的將來發(fā)布更多的數(shù)據(jù)科學(xué)或精算科學(xué)項目,敬請期待!
翻譯自: https://towardsdatascience.com/go-big-by-being-small-618d2da54b49
字體大小變化
總結(jié)
以上是生活随笔為你收集整理的字体大小变化_变小变大的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 人工智能还不错,人工智障就算了
- 下一篇: 完整的查询语句