电路分析导论_生存分析导论
電路分析導(dǎo)論
In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a company (stops purchasing, cancels the subscription, etc.). Retention refers to keeping the clients of a business active (the definition of active highly depends on the business model).
在我們競(jìng)爭(zhēng)異常激烈的時(shí)代,所有企業(yè)都面臨客戶流失/保留的問題。 為了快速提供背景信息,當(dāng)客戶停止使用公司的服務(wù)(停止購(gòu)買,取消訂閱等)時(shí),就會(huì)發(fā)生流失。 保留是指使業(yè)務(wù)的客戶保持活動(dòng)狀態(tài)(活動(dòng)的定義在很大程度上取決于業(yè)務(wù)模型)。
Intuitively, companies want to increase retention by preventing churn. This way, their relationship with the customers is longer and thus potentially more profitable. What is more, in most cases the company’s cost of retaining a customer is much lower than that of acquiring a new customer, for example, via performance marketing. For businesses, the concept of retention is closely connected to customer lifetime value (CLV), which the businesses want to maximize. But that is a topic for another article.
直觀上,公司希望通過防止流失來增加保留率。 這樣,他們與客戶的關(guān)系就會(huì)更長(zhǎng),因此可能會(huì)帶來更大的利潤(rùn)。 更重要的是,在大多數(shù)情況下,公司保留客戶的成本要比例如通過績(jī)效營(yíng)銷獲得新客戶的成本低得多。 對(duì)于企業(yè)而言,保留的概念與企業(yè)希望最大化的客戶生命周期價(jià)值 (CLV)緊密相關(guān)。 但這是另一篇文章的主題。
With this article, I want to start a short series focusing on survival analysis, which is often an underestimated, yet very interesting branch of statistical learning. In this article, I provide a general introduction to survival analysis and its building blocks. First I explain the required concepts and then describe different approaches to analyzing time-to-event data. Let’s start!
在本文中,我想開始一個(gè)簡(jiǎn)短的系列,著重于生存分析,這通常是統(tǒng)計(jì)學(xué)學(xué)習(xí)中被低估但非常有趣的分支。 在本文中,我對(duì)生存分析及其組成部分進(jìn)行了一般性介紹。 首先,我解釋了必需的概念,然后描述了分析事件數(shù)據(jù)的不同方法。 開始吧!
生存分析導(dǎo)論 (Introduction to Survival Analysis)
Survival analysis is a field of statistics that focuses on analyzing the expected time until a certain event happens. Originally, this branch of statistics developed around measuring the effects of medical treatment on patients’ survival in clinical trials. For example, imagine a group of cancer patients who are administered a certain new form of treatment. Survival analysis can be used for analyzing the results of that treatment in terms of the patients’ life expectancy.
生存分析是一個(gè)統(tǒng)計(jì)領(lǐng)域,專注于分析直到發(fā)生某個(gè)事件之前的預(yù)期時(shí)間。 最初,該統(tǒng)計(jì)分支的發(fā)展是圍繞在臨床試驗(yàn)中測(cè)量藥物治療對(duì)患者生存的影響。 例如,想象一組接受某種新形式治療的癌癥患者。 生存分析可用于根據(jù)患者的預(yù)期壽命來分析該治療的結(jié)果。
However, survival analysis is not restricted to investigating deaths and can be just as well used for determining the time until a machine fails or — what may at first sound a bit counterintuitively— a user of a certain platform converts to a premium service. That is possible because survival analysis focuses on the time until an event happens, without actually defining the event as a negative one. The conditions that apply to the most popular methods of survival analysis are:
但是,生存分析并不僅限于調(diào)查死亡情況,它還可以用于確定機(jī)器故障或某個(gè)平臺(tái)的用戶轉(zhuǎn)換為優(yōu)質(zhì)服務(wù)之前的時(shí)間(起初聽起來有些反直覺)。 之所以可以這樣做是因?yàn)樯娣治鲋塾谑录l(fā)生之前的時(shí)間,而沒有將事件實(shí)際定義為否定事件。 適用于最流行的生存分析方法的條件是:
- the event of interest is clearly defined and well-specified, so there is no ambiguity about whether it happened or not, 對(duì)感興趣的事件進(jìn)行了明確的定義和明確的規(guī)定,因此對(duì)于它是否發(fā)生沒有歧義,
- the event can occur only once for each subject — this is clear in case of death, but if we applied the analysis to churn, this might be a more complicated case, as a churned user might be reactivated and churn again. 該事件對(duì)于每個(gè)主題只能發(fā)生一次-在死亡的情況下很明顯,但是如果我們將分析應(yīng)用于客戶流失,則情況可能更復(fù)雜,因?yàn)榱魇У挠脩艨赡軙?huì)重新激活并再次流失。
We have already established that survival analysis is used for modeling the time-to-event series, in other words, lifetimes (hence also the name of the Python library which is the go-to tool for this kind of analyses). Generally speaking, we can use survival analysis to try to answer questions like:
我們已經(jīng)建立了生存分析用于建模事件發(fā)生時(shí)間序列 (即生存期)的方法(因此也稱為Python庫(kù)的名稱,Python庫(kù)是此類分析的必備工具)。 一般而言,我們可以使用生存分析來嘗試回答以下問題:
- what percentage of the population will survive past a certain time? 一定時(shí)間后將有百分之幾的人口生存?
- of the survivors, what will be their death/failure rate? 的幸存者中,他們的死亡/失敗率是多少?
- how do particular characteristics (for example, such features as age, gender, geographical location, etc.) affect the probability of survival? 特定特征(例如年齡,性別,地理位置等特征)如何影響生存概率?
Having briefly described the general idea of survival analysis, it is time to introduce a few concepts that are crucial for a thorough understanding of the subject.
簡(jiǎn)要描述了生存分析的一般概念之后,現(xiàn)在該介紹一些對(duì)徹底理解該主題至關(guān)重要的概念。
Photo by Scott Graham on Unsplash Scott Graham在Unsplash上拍攝的照片審查制度 (Censoring)
Censoring can be described as the missing data problem in the domain of survival analysis. Observations are censored when the information about their survival time is incomplete. There are different kinds of censoring, such as:
審查可以描述為生存分析領(lǐng)域中的數(shù)據(jù)丟失問題。 當(dāng)有關(guān)生存時(shí)間的信息不完整時(shí),將對(duì)觀測(cè)進(jìn)行審查 。 審查方式有多種,例如:
- right-censoring, 權(quán)利審查
- interval-censoring, 間隔檢查
- left-censoring. 左審查。
To keep this section short, we just discuss the one that is encountered most frequently — right-censoring. Let’s come back to the example with cancer treatment. Imagine, that the study of the effects of the new medicine lasts 5 years (this is an arbitrary number, not actually based on anything). It can happen that after 5 years, some of the patients survived and thus have not experienced the death event. At the same time, the authors of the study lost contact with some patients — they might have relocated to another country, they might have actually died, but no confirmation was ever received. Those cases are affected by right-censoring, that is, their true survival time is equal to or greater than the observed survival time (in this case, the 5 years of the study). The following image illustrates right-censoring.
為了使本節(jié)簡(jiǎn)短,我們只討論最常遇到的一個(gè)問題- 右刪失 。 讓我們回到有關(guān)癌癥治療的例子。 想象一下,對(duì)新藥效果的研究持續(xù)了5年(這是一個(gè)任意數(shù)字,實(shí)際上并不是基于任何東西)。 可能發(fā)生的情況是,在5年后,一些患者幸存了下來,因此沒有經(jīng)歷過死亡事件。 同時(shí),該研究的作者與某些患者失去了聯(lián)系-他們可能已搬遷到另一個(gè)國(guó)家,他們可能實(shí)際上已經(jīng)死亡,但從未收到任何確認(rèn)。 這些案例受權(quán)利審查的影響,也就是說,它們的真實(shí)生存時(shí)間等于或大于觀察到的生存時(shí)間(在本例中為研究的5年)。 下圖說明了權(quán)限檢查。
Source資源The existence of censoring is also the reason why we cannot use simple OLS for problems in the survival analysis. That is because OLS effectively draws a regression line that minimizes the sum of squared errors. But for censored data, the error terms are unknown and therefore we cannot minimize the MSE. Applying some simple solutions such as using the censorship date as the date of the death event or dropping the censored observations can severely bias the results.
審查的存在也是我們無法在生存分析中使用簡(jiǎn)單OLS解決問題的原因。 這是因?yàn)镺LS有效地繪制了一條回歸線,該回歸線使平方誤差的總和最小。 但是對(duì)于被檢查的數(shù)據(jù),錯(cuò)誤項(xiàng)是未知的,因此我們無法最小化MSE。 應(yīng)用一些簡(jiǎn)單的解決方案,例如使用檢查日期作為死亡事件的日期或放棄檢查的觀察結(jié)果,可能會(huì)嚴(yán)重影響結(jié)果。
For information regarding different kinds of censoring, please go here.
有關(guān)各種檢查的信息,請(qǐng)轉(zhuǎn)到此處 。
生存功能 (The Survival Function)
The survival function is a function of time (t) and can be represented as
生存函數(shù)是時(shí)間( t )的函數(shù),可以表示為
where Pr() stands for the probability and T for the time of the event of interest for a random observation from the sample. We can interpret the survival function as the probability of the event of interest (for example, the death event) not occurring by the time t.
其中, Pr()代表概率, T代表關(guān)注事件的時(shí)間,可以從樣本中進(jìn)行隨機(jī)觀察。 我們可以將生存函數(shù)解釋為感興趣的事件(例如,死亡事件)在時(shí)間t之前未發(fā)生的概率。
The survival function takes values in the range between 0 and 1 (inclusive) and is a non-increasing function of t.
生存函數(shù)的取值范圍是0到1(含)之間,并且是t的非遞增函數(shù)。
危害功能 (The Hazard Function)
We can think of the hazard function (or hazard rate) as the probability of the subject experiencing the event of interest within a small (or to be more precise, infinitesimal) interval of time, assuming that the subject has survived up until the beginning of the said interval. The hazard function can be represented as:
我們可以將危害函數(shù) (或危害率)視為對(duì)象在很小(或更確切地說是無窮小)的時(shí)間間隔內(nèi)經(jīng)歷關(guān)注事件的概率,前提是對(duì)象一直存活到開始。所說的間隔。 危害函數(shù)可以表示為:
where the expression in the numerator is the conditional probability of the event of interest occurring in the given time interval, provided it has not happened before. dt in the denominator is the width of the considered interval of time. When we divide the former by the latter, we effectively obtain the rate of the event’s occurrence per unit of time. Lastly, by taking the limit as the width of the interval goes to zero, we end up with the instantaneous rate of occurrence, so the risk of an event happening at a particular point in time.
其中分子中的表達(dá)式是感興趣事件在給定時(shí)間間隔內(nèi)發(fā)生的條件概率,前提是該事件以前沒有發(fā)生過。 分母中的dt是所考慮的時(shí)間間隔的寬度。 當(dāng)我們將前者除以后者時(shí),我們可以有效地獲得每單位時(shí)間事件發(fā)生的比率。 最后,通過在間隔的寬度變?yōu)榱銜r(shí)取極限,我們得出瞬時(shí)發(fā)生率,因此事件在特定時(shí)間點(diǎn)發(fā)生的風(fēng)險(xiǎn)。
You might wonder why the hazard rate is defined using this small interval of time. The reason for that lies in the fact that the probability of a continuous random variable being equal to a particular value is zero. That is why we need to consider the probability of the event happening in a very small interval of time.
您可能想知道為什么使用這么短的時(shí)間間隔來定義危險(xiǎn)率。 其原因在于,連續(xù)隨機(jī)變量等于特定值的概率為零。 這就是為什么我們需要考慮事件在很小的時(shí)間間隔內(nèi)發(fā)生的可能性。
Technical note: to be theoretically correct, it is important to mention that the hazard function is not actually a probability and the name hazard rate is the more fitting one. That is because even though the expression in the numerator is the probability, the dt in the denominator can actually result in a value of the hazard rate greater than 1 (it is still limited to 0 at the lower interval).
技術(shù)說明:從理論上講是正確的,重要的是要提到危害函數(shù)實(shí)際上并不是概率,而危害率這個(gè)名稱更合適。 這是因?yàn)榧词狗肿又械谋磉_(dá)式是概率,分母中的dt實(shí)際上也可以導(dǎo)致危險(xiǎn)率的值大于1(在較低的時(shí)間間隔仍限制為0)。
Lastly, the survival and hazard functions are related to each other as specified by the following formula:
最后,生存和危害功能相互關(guān)聯(lián),如下式所示:
To give the equation a bit of context, the integral in the brackets is called the cumulative hazard and can be interpreted as the sum of the risks the subject faces going from time-point 0 to t.
為了使方程更準(zhǔn)確,將方括號(hào)中的積分稱為累積危害,可以將其解釋為受試者從時(shí)間點(diǎn)0到t所面臨的風(fēng)險(xiǎn)之和。
Photo by Justin Luebke on Unsplash 賈斯汀·呂貝克 ( Justin Luebke)在Unsplash上攝生存分析的不同方法 (Different approaches to Survival Analysis)
As survival analysis is an entire domain of different statistical methods for working with time-to-event series, there are naturally many different approaches we could follow. On a high level, we could split them into three main groups:
由于生存分析是處理事件間隔時(shí)間序列的不同統(tǒng)計(jì)方法的整個(gè)領(lǐng)域,因此自然可以采用許多不同的方法。 在較高的層次上,我們可以將它們分為三個(gè)主要組:
Non-parametric — with these approaches, we make no assumptions about the underlying distribution of data. Perhaps the most popular example from this group is the Kaplan-Meier curve, which — in short — is a method of estimating and plotting the survival probability as a function of time.
非參數(shù) -使用這些方法,我們不對(duì)數(shù)據(jù)的基本分布進(jìn)行任何假設(shè)。 該組中最受歡迎的示例也許是Kaplan-Meier曲線 ,簡(jiǎn)而言之,它是一種估計(jì)和繪制生存概率隨時(shí)間變化的方法。
Semi-parametric — as you could have guessed, this group is in between the two extremes and makes very few assumptions. Most importantly, there are no assumptions about the shape of the hazard function/rate. The most popular method from this group is the Cox regression, which we can use to identify the relationship between the hazard function and a set of explanatory variables (predictors).
半?yún)?shù) -正如您可能已經(jīng)猜到的,該組介于兩個(gè)極端之間,并且很少進(jìn)行假設(shè)。 最重要的是,沒有關(guān)于危害函數(shù)/速率的形狀的假設(shè)。 該組中最流行的方法是Cox回歸 ,我們可以使用它來識(shí)別危害函數(shù)和一組解釋變量(預(yù)測(cè)變量)之間的關(guān)系。
Parametric — you might have encountered this approach while doing your studies. The idea is to use some statistical distributions (some of the popular ones include exponential, log, Weibull, or Lomax) to estimate how long a subject will survive. Often, we use maximum likelihood estimation (MLE) to fit the distribution (or actually the distribution’s parameters) to the data for the best performance.
參數(shù)化 -學(xué)習(xí)時(shí)可能會(huì)遇到這種方法。 想法是使用一些統(tǒng)計(jì)分布(一些流行的分布包括指數(shù)分布,對(duì)數(shù)分布,Weibull分布或Lomax分布)來估計(jì)對(duì)象可以存活多長(zhǎng)時(shí)間。 通常,我們使用最大似然估計(jì)(MLE)使分布(或?qū)嶋H上是分布的參數(shù))適合數(shù)據(jù),以獲得最佳性能。
The methods mentioned in this short list are by no means exhaustive and there are many more interesting approaches to analyzing time-to-event data using machine- or deep-learning-based techniques. I will try to cover the most interesting ones in the following posts, so stay tuned :)
此簡(jiǎn)短列表中提到的方法絕不是窮舉,并且有很多有趣的方法可以使用基于機(jī)器學(xué)習(xí)或深度學(xué)習(xí)的技術(shù)來分析事件數(shù)據(jù)。 我將在以下帖子中嘗試介紹最有趣的內(nèi)容,敬請(qǐng)期待:)
結(jié)論 (Conclusions)
In this article, I tried to provide a brief yet thorough introduction to the domain of survival analysis. I believe that this area is often overlooked when talking about different data science solutions. However, by using some simple (or not so simple at all!) solutions we can provide valuable insights for the company or stakeholders and generate actual value-added.
在本文中,我試圖對(duì)生存分析領(lǐng)域進(jìn)行簡(jiǎn)要而全面的介紹。 我認(rèn)為,在談?wù)摬煌臄?shù)據(jù)科學(xué)解決方案時(shí),通常會(huì)忽略這一領(lǐng)域。 但是,通過使用一些簡(jiǎn)單(或根本不是那么簡(jiǎn)單!)解決方案,我們可以為公司或利益相關(guān)者提供有價(jià)值的見解,并產(chǎn)生實(shí)際的增值。
This article is only the beginning of a short series, and I will keep on adding the following parts below. In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.
本文只是一個(gè)簡(jiǎn)短系列的開始,我將繼續(xù)在下面添加以下部分。 如果您有任何疑問或建議,請(qǐng)?jiān)谠u(píng)論中讓我知道,或在Twitter上與您聯(lián)系 。
In the meantime, you might like some of my other articles:
同時(shí),您可能會(huì)喜歡我的其他一些文章:
翻譯自: https://towardsdatascience.com/introduction-to-survival-analysis-6f7e19c31d96
電路分析導(dǎo)論
總結(jié)
以上是生活随笔為你收集整理的电路分析导论_生存分析导论的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 扫描二维码读取文档_使用深度学习读取和分
- 下一篇: 强化学习-第3部分