Why Vector Clock are Easy or Hard?
通過實際例子來闡述vector clock其實是容易理解的, easy
同樣通過實際例子來描述在使用vector clock時會遇到哪些難以解決的問題, hard
Why Vector Clocks are Easy
http://basho.com/blog/technical/2010/01/29/why-vector-clocks-are-easy/
Vector Clocks by Example
通過如下日常的例子來幫助理解vector clock, 你會發現其實這個算法并不復雜.
大家想計劃一下, 下周哪天一起吃飯? 大家如何達成最終的一致性?
之間看下面的圖吧, 過程很清晰
Alice, Ben, Cathy, and Dave are planning to meet next week for dinner.
The planning starts with Alice suggesting they meet on Wednesday.
Later, Dave discuss alternatives with Cathy, and they decide on Thursday instead.
Dave also exchanges email with Ben, and they decide on Tuesday.
When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages:
Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave.
Dave can't be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.
The story changes, but the end result is always the same: you ask two people for the latest version of a piece of information, and they reply with two different answers, and there's no way to tell which one is really the most recent.
?
Start with Alice's initial message: "Let's meet Wednesday,"
date = Wednesday vclock = Alice:1Ben suggests Tuesday:
date = Tuesday vclock = Alice:1, Ben:1Dave replies, confirming Tuesday:
date = Tuesday vclock = Alice:1, Ben:1, Dave:1
Now Cathy gets into the act, suggesting Thursday:
date = Thursday vclock = Alice:1, Cathy:1Dave has two conflicting objects:
date = Tuesday vclock = Alice:1, Ben:1, Dave:1and
date = Thursday vclock = Alice:1, Cathy:1Dave can tell that these versions are in conflict, because neither vclock "descends" from the other. Luckily, Dave's a reasonable guy, and chooses Thursday. Dave also created a vector clock that is a successor to all previously-seen vector clocks. He emails this value back to Cathy.
date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2
So now when Alice asks Ben and Cathy for the latest decision, the replies she receives are, from Ben:
date = Tuesday vclock = Alice:1, Ben:1, Dave:1and from Cathy:
date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2?
From this, she can tell that Dave intended his correspondence with Cathy to override the decision he made with Ben. All Alice has to do is show Ben the vector clock from Cathy's message, and Ben will know that he has been overruled.
That worked out pretty well.
通過上面的圖示, 確實體現出'Why Vector Clocks are Easy’
?
Why Vector Clocks Are Hard
http://basho.com/blog/technical/2010/04/05/why-vector-clocks-are-hard/
但是在實際實現該系統時, 還是有很多問題需要解決的, 從實現的角度體現'Why Vector Clocks Are Hard’
最難的問題是, 以什么作為actor? 以及當vclocks隨著時間不斷變大的時候, 怎樣保存它?
1. What an actor is (i.e. where the incrementing and resolution is, and what parties get their own field in the vector)
2. How to keep vclocks from growing without bound over time.
In this example the parties that actually proposed changes ("clients") were the actors in the vector clocks. This is the model that vector clocks are designed for and work well with, but it has a drawback. The width of the vectors will grow proportionally with the number of clients. In a group of friends deciding when to have dinner this isn't a problem, but in a distributed storage system the number of clients over time can be large. Once the vector clocks get that large, they not only take up more space in disk and RAM but also take longer to compute comparisons over.
這個就談到選取什么作為actor的問題, 上面的例子用client作為actor是很清晰, 但問題是在分布式環境中, client的數量可能是非常多的, 這個就有嚴重的效率問題.
One straightforward idea is to make the servers handling client requests be the "actors", instead of representing the clients directly. Since any given system usually has a known bounded number of servers over time and also usually has less servers than clients, this serves to reduce and cap the size of the vclocks
Let's think through the same example, but with that difference, to see how it goes. We'll assume that a 2-server distributed system is coordinating the communication, with clients evenly distributed among them.
下面的例子就嘗試使用server作為actor來避免client過多的問題
Alice and Dave happen to get server X, and Ben and Cathy get server Y.
We're fine through the first few steps:
The only real difference so far is that each update increments a vector clock field named for the client's chosen server instead of the client itself. This will mean that the number of fields needed won't grow without bound; it will be the same as the number of servers. This is the desired effect of the change.
We run into trouble, though, when Cathy sends her update:
?
In the original example, this is where a conflict was created. Dave sorted out the conflict, and everything was fine.
With our new strategy, though, something else happened. Ben and Cathy were both modifying from the same original object. Since we used their server id instead of their own name to identify the change, Cathy's message has the same vector clock as Ben's! This means that Dave's message (responding to Ben) appears to be a simple successor to Cathy's... and we lose her data silently!
用server作為actor的問題就是, 會丟失一些數據, 因為Cathy和Ben都操作Y, 所以對于系統而已, Cathy和Ben沒有區別都是Y, 所以當Dave confirm Ben后, vector為X2, Y1, 這個時候再收到Cathy的X1, Y1時, 就會認為是過期數據, 自動丟掉. 這樣就無法解決上面的協調這個問題, 所以這個簡單solution是不可行的.
發生這個問題的原因如下,
For vector clocks to have their desired effect without causing accidents such as this, the elements represented by the fields in the vclock must be the real units of concurrency. In a case like this little example or a distributed storage system, that means client identifiers, not server-based ones.
server不是真正的并發單元, 而你把他作為actor, 其實是不合理的, 所以就會有現在這個問題
?
Just Lose a Little Information and Everything Will Be Fine
If we use client identifiers, we're back in the situation where vector clocks will grow and grow as more clients use a system over time.
The solution most people end up with is to "prune" their vector clocks as they grow.
This is done by adding a timestamp to each field, and updating it to the current local time whenever that field is incremented. This timestamp is never used for vclock comparison -- that is purely a matter of logical time -- but is only for pruning purposes.
This way, when a given vclock gets too big, you can remove fields, starting at the one that was updated longest ago, until you hit a size/age threshold that makes sense for your application.
But, you ask, doesn't this lose information?
Yes, it does -- but it won't make you lose your data.
The only case where this kind of pruning will matter at all is when a client holds a very old copy of the unpruned vclock and submits data descended from that. This will create a sibling (conflict) even though you might have been able to resolve it automatically if you had the complete unpruned vclock at the server. That is the tradeoff with pruning: in exchange for keeping growth under control, you run the chance of occasionally having to do a "false merge"... but you never lose data quietly, which makes this approach unequivocally better than moving the field identifiers off of the real client and onto the server.
這個方法思路, 就是既然我們必須用client作為actor, 問題就是vector會越來越大, 那么可不可以解決這個問題...
那就是prune,? 根據什么去prune, timestamp, 這樣定期把比較老的client數據清掉, 就可以比較合理的解決這個問題.
這個方案是否有風險? 有一些, 不過可以接受, 所以是有價值的tradeoff, 至少我們現在不會lose data quietly.
轉載于:https://www.cnblogs.com/fxjwind/archive/2012/06/06/2537963.html
總結
以上是生活随笔為你收集整理的Why Vector Clock are Easy or Hard?的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 单身女性有必要购房吗?
- 下一篇: 用iptables来防止web服务器被C