Wall-clock constraints in distributed systems

At a project I am working on we had a discussion on how to implement certain constraints. The project in question is (will be) a heavily distributed system, replacing an old monolithic system. In the original system, there are many constrains like ’the date for form X cannot be in the future’. These constraints are very difficult to implement in a distributed system. In this blog post I’d like to discuss some of the aspects on how to manage these kind of wall-clock constraints.

La persistencia de la memoria, Salvador Dalí

The problem

So why are these wall-clock constraints a problem? Essentially, in a distributed system, there is no such thing as a single clock. Each and every server has its own clock. Client applications will have their own clock. So, the question “what is the current time?” does not have a single answer in a distributed system. Determining if a given date is “in the future” means that we have to choose which clock we have to trust for the current time. Imagine that we have a client application where our customers can register an event that happened (e.g., an animal is born). To prevent data corruption, we want to make sure that the date and time for the event registration makes sense and since an event has to have happened before you can register it, we end up with a business rule like “the date cannot be in the future”. Easy enough, the client application can check this against the clock on the client machine (if it is a web based application, use the browsers host clock). But, as we all know, client applications cannot be trusted and validation should (also) happen at the server side. And here we run into a problem: what happens if our server’s clock is slightly behind the client’s clock? (Note, we are not talking about time-zone differences, that is a topic by itself.)

A (too) simple solution

One approach could be to allow for clock-skew. We accept that the clocks of the client and the server might be different and allow some margin. We reformulate the constraint to be “the date for form X cannot be more than five seconds in the future”. Now, even if the clocks of both systems differ a bit, the constraint will not trigger. But why five seconds? What is the maximum skew between clocks? (Hint: always more than you think). If we pick the margin too small, customers might get frustrated. If we pick the margin too large we detect less mistakes. It gets worse: the bigger the margins are, the bigger the chance that an event being processed at the server is actually in the future. All kinds of implicit assumptions that an event is always in the past will crash and burn.

Asking the right questions

From a distributed system point of view, we need to ask different questions. Do we need this server side validation? Is this not simply data that a customer enters? We can allow the client application to perform the check (validate the date against the clock closest by the customer) and simply trust the date as given, pretty much what we do with most data entry form fields. If we want a date that can be checked server-side, why do we even ask the client? Why do we want to mix these frames of reference? If it is simply a timestamp (the event happened roughly around about now), we might simply use the server for that. Where it gets more interesting is what happens at the server side if the date is in the future? What kind of processes use this date? What goes wrong if the date is in the future? In most cases, if something goes wrong, it is because there is an assumption that some event A happens only after some other event B. (e.g., to give an animal a name, it first has to be born).

Causal consistency

And here we come to a core observation: we need to rewrite these business rules into causal relations. In this example, it is perfectly fine to have an animal born in the future, as long as it is born before we give it a name. The event being-born

happens-before

the event giving-a-name. And, as an additional bonus: if you can rewrite all your business rules this way, you have a nice consistency model for your distributed system. Causal consistency is a the next big thing: a consistency model that mere mortals can understand and is actually pretty well defined. Well, at least compared to most of the over-promising and vague eventually consistent models.

Read more

Wall-clock constraints in distributed systems

Arjan Lamers

Previous PostSamen met First8 naar Devoxx

Next PostFirst8 liet Grails zien op T-DOSE 2015

First8 | Conclusion en volg ons