Consistency Models and Leader-Follower Replication

Introduction

This articles continues the exploration of leader-follower replication technique and focuses on how to achieve specific consistency models using leader follower replication. Despite it might seem obvious at first glance, there are a lot of intricacies and low-level implementation details which could produce unexpected results in some cases.

For each consistency models I give illustrations and example on what be use case for the usage of specific consistency model in leader-follower setup. Those examples are not designs of real system, because real systems have a lot of complexity and nuances, but the main point of them to illustrate the point as clearly as possible.

If you have not yet read the article about basics of leader-follower replication and the ways of organizing replication stream I suggest reading them first, this will simplify reading of this article.

Also to be comfortable with the concept of consistency and different consistency models I recommend watching videos about data-centric [1] and client-centric [2] consistency models.

Neither of this is a strict requirement, so feel free to proceed if you already have some knowledge on those topics.

Nuances of Replication Implementation

In this section, I will give some background to understand what are the intricacies of the implementation of the replication which might lead to data inconsistency which are not obvious at first glance. For example even with synchronous followers we might get out-of-order or delayed writes.

This will important in the next sections when we will discuss how to achieve specific consistency model.


Synchronous Followers: The Hidden Delay in Visibility

Intuitively, it seems that in case of synchronous followers the data are immediately visible on followers, but it appears not always to be the case and depends on some intricacies of the configuration.

We will review aspect of postgres implementation of replication related to visibility. Essentially it uses WAL-based replication, described in more details here. But this replication can be tuned with synchronous_commit parameter which as documented here can have several different options.

synchronous_commit:
- off
- local
- remote_write
- on
- remote_apply

The synchronous commit parameter basically defines what exactly happens synchronously in the replication process.

Below is the conceptual sequence of steps taken during synchronous replication.

Synchronous Commit Options in Postgres
  • synchronous_commit:off
    • First when some writes are issued on leader, it generates WAL records which are initially stored in memory in WAL Buffer. In case synchronous_commit:off at this point leader acknowledges the write and responds to client.
    • Obviously if leader crashes at this point the acknowledged writes are lost forever. (There is background process which flushes WAL on disk, but still we can loose some writes which were not yet flushed)
  • synchronous_commit:local
    • Then leader flushes the generated WAL records into file system. In case synchronous_commit:local at this point leader acknowledges the write.
    • Now even in case of power outage the data won’t be lost, but they will be unavailable until leader is recovered.
  • synchronous_commit:remote_write
    • After the WAL segments were written to disk, walsender process works with walreceiver to deliver them to follower. Upon receiving WAL records, walreceiver writes them to filesystem (without flush). In synchronous_commit:remote_write the acknowledgement is sent at this point.
    • Should the postgres process crash, WAL records on follower will be durable, but if operations system itself or the whole follower server is crashed data will be lost.
  • synchronous_commit:on
    • Then remote follower flushes the WAL from filesystem buffer to disk. And in case synchronous_commit:on the acknowlegement is sent to client here.
    • This mode guarantees durability of data in case of crash of follower, as the data can be restored.
  • synchronous_commit:remote_apply
    • Here WAL records are actually applied on follower and become visible to any client reads.

The default in postgres mode for synchronous_commit is on. And it is evident that despite commit in some sense is synchronous, data will not be visible on followers immediately after acknowledgement

Such complexity is added to let users of the database to choose trade-off between consistency and durability on one hand and performance on the other hand. The sooner acknowlegement is sent to client the better response time and performance will be, but we pay with less durability and not immediate visibility of changes.

The point is that by default you might not get immediate visibility with synchronous followers, though on the surface it seems contradictory to the synchronous nature of the replication.

Out-Of-Order Writes

Out-Of-Order writes is the phenomena when some writes were applied on the leader in one order, but they become visible on followers in some other order.

At first glance it might seem that in leader-follower replication, out-of-order writes on replicas do not occur, indeed all writes are linearized on leader into some specific order and then all of them are being written in some kind of log which is sent to followers and re-applied.

What could go wrong here?

But in fact there could be a problem.

Asynchronous Followers

First let’s review problems with asynchronous followers as they offer weaker consistency by default.

Leader allows parallel writes, and they are linearized in some specific way. If we want to have all writes in the same order on followers, we either have to do single-threaded replication as in older versions of MySQL for example or we need some synchronization process. Otherwise we will accept that some writes will be out-of-order.

Single-Threaded Replication

Single-threaded replication have a performance issue – since leader is multi-threaded, it can process much higher writes rate than single threaded follower. The single thread becomes a bottleneck. This could results in the large follower lag and in the piling up replication changes which were not processed.

To solve this, usually some kind of parallel replication used, especially for asynchronous followers. Depending on the exact implementation of this parallel replication out-of-order write might be possible on replicas. It could be either normal behavior or result of some bugs and edge cases.

Multi-Threaded Replication

For example in Maria DB there are two modes for parallel replication – In-Order and Out-Of-Order. With out-of-order mode it is possible that writes will be applied on a replica in different order comparing to primary. System designed must do a conscious choice as to which replication method to choose. Obviously, restricting all writes to become visible in the same order decreases system performance, because of the need for synchronization.

Synchronous Followers

Accepting every write by leader requires explicit acknowledgement from each synchronous follower. Does it fully solve the problem of out-of-order writes?

The answer is: it depends.

For example for MySQL in semi-synchronous replication mode (when one follower must explicitly acknowledge writes) it is not always strictly guaranteed that writes will be applied on follower in order even if it has to acknowledge each of them before the write could take effect. This happens due to architecture of the replication due to the fact that the depending on the settings of the MySQL, the synchronization might happen only for the process of binlog replication, but later application of the bin log is parallelized and under some combination of settings we can observe out of order commits on the replica.

Out-Of-Order Writes with Synchronous Followers

This despite each write is being synchronously acknowledged by follower, it’s visibility on follower will depend on the process of follower applying the writes, which could be asynchronous and parallelized.

This approach is not unique to MySQL, and other databases might have similar consequences.

So the point is that even use of synchronous follower will not eliminate out-of-order writes by itself. You need to carefully read documentation for your specific database, and most likely there will be some configuration parameters which needs to be properly configured to prohibit out of order writes. Of course this would incur performance cost.

Eventual Consistency in Leader-Follower Systems

How to achieve

In any configuration of leader-follower replication eventual consistency is achieved without any specific conditions.

Even if we have fully asynchronous followers and reading randomly from leader/follower eventually the leader and follower will converge to the same state. (Assuming that write flow to leader will pause for some time)

There is some uncertainty in the industry about the definition of the eventual consistency. Namely, the uncertainty is that “can we consider a system to be eventually consistent if it violates property of durability“.

Here we will consider that eventual consistency can be achieved in leader-follower schema even in case durability is not guaranteed. Here are the reasons for it.

Reasons

For example, with asynchronous followers, if leader dies right after accepting write, and this write is lost ( for example the leader was not recovered), can this system be called eventually consistent?

Let’s look into the definitions


[5] (Terry, 1995)

While the replicas held by two servers at any time may vary in their contents because they have received and processed different Writes, a fundamental property of the Bayou design is that all servers move towards eventual consistency. That is, the Bayou system guarantees that all servers eventually receive all Writes via the pair-wise anti-entropy process and that two servers holding the same set of Writes will have the same data contents


[6] (Werner Vogels, 2008)

Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually


[7] (Bailis, 2013)

While eventual consistency is relatively easy to achieve, the current definition leaves some
unfortunate holes. First, what is the eventual state of the database? A database always returning the value 42 is eventually consistent, even if 42 was never written. Amazon CTO Werner Vogels’ preferred definition specifies that “eventually all accesses return the last updated value”; accordingly, the database cannot converge to an arbitrary value.23 Even this new definition has another problem: what values can be returned before the eventual state of the database is reached? If replicas have not yet converged, what guarantees can be made about the data returned?

Looks like all authors agree, that all writes must be received by all servers to call the system eventually consistent, but at the same time Amazon Dynamo allows to specify parameter W = 1, which means that write has to be accepted by only one server, which makes possible exactly the same lost of write and violation of durability, but nevertheless, the Dynamo is called eventually consistent.

Also, even if we have synchronous follower, but it dies together with leader, the data will be lost. Even in case we have W = 5 servers needed to confirm write, they all can fail, and in that case the durability property will be violated, but it does not make Dynamo not eventually consistent.

Eventual consistency can be used when we want to maximize the system performance and when the temporal inconsistency can be relatively easy handled by all clients.

Example: Enterprise Configuration Management System

Independent feature flags

Let’s consider Feature Flag Service where all features are strictly independent from each other.

Feature Flag Service

Let’s make several statements and requirements about this example

  • There is relatively low write rate, as feature flags are not updated very often – several times a day maximum.
  • There is very high read rate of feature flags, let’s say we have multiple services (thousands) that very often read values of feature flags
  • It does not matter for us if feature flags will be updated in each service with some delay after they are applied by admin
  • It does not matter for us if feature flags are flapping i.e. client sees Feature_A is on, then off, and then back on
  • Feature flags are strictly independent from each other
    • we never rely on the fact that client services expect some specific order of switching those beta flags, or some specific combination of them.
    • each feature flag can be written/read/propagated independently from others
    • it does not matter for us if updates are propagated in the same order, i.e. if admin switch on Feature_A and then Feature_B, it will not cause a problem if client first sees Feature_B switched on and only after that Feature_A switched on.
  • It matters for us that all writes which admin makes have specific order determined on the moment of the write and that all services will eventually have the same vision on the feature flags.

Having specified those requirements let’s see how they can be satisfied with leader-follower replication schema with asynchronous followers and with random selection of the follower to make the read from.

Writes are done on single leader and because of that all writes are linearizable. All reads are done from a random follower, so that the load is distributed among them.

Since different updates can arrive at different followers with different delay, and services connect to the followers absolutely randomly, all kind of anomalies are possible, but from tour application specifics we don’t care about them, because of the requirements provided above.

Let’s illustrate this with an example where we have a system with single leader and several followers. Client’s in this case are some services which connect to the balancer which randomly chooses a follower for each read. Admin connects to the leader to make writes.

In the initial state some feature_A is off. Then admin switches it on, and the update asynchronously replicated to followers 1 and 3.

Feature Flag Service on top of distributed database: Update Is Propagated to Followers 1 and 3, while client read it form Follower 2 or 4.

Service A during the read was routed to Follower 2 or 4 where feature is off, so it sees the feature state as off.

Then during the second read Service A could be routed to Followers 1 or 3 and get the feature state as On.

Feature Flag Service on top of distributed database: Update Is Propagated to Followers 1 and 3 and client read feature state from them.

Then if Service A wants to read it again, it could be again routed to followers 2 or 4, and thus it will read feature state as Off despite it already has seen it On.

Feature Flag Service: Update Is Propagated to Followers 1 and 3, while client read it form Follower 2 or 4.

So we have seen an example when feature “flaps” from Service A point of view, but from our requirements such behaviour is acceptable as long as eventually it will see the final state. Eventually, all followers will receive the updates and every read will return state On for this feature, assuming there were no further changes from admin.

Feature Flag Service: all followers got the update so client will always receive On as feature state

Such system would satisfy all the requirements and will not introduce additional performance impact, as we can have as much asynchronous followers as we want, and the eventual consistency will suit the purpose of the system.

Mutually Dependent Feature Flags

So far we have placed a constraint one the independence of feature flags from each other. But what if it is not the case? Let’s consider an example when we have some complex transition period and feature flags depends on each other?

Let’s assume that we have some website and we have two changes which we want to control separately

  • add new webpage (controlled by Feature_Flag_Generate_Page_3)
  • add a link to the main page for this new page (controlled by Feature_Flag_Show_Page_3_Ref)
Dependent features controlled by two separate beta flags

When both flags are off, then there is no new link on the main page, and no page 3 is generated. When both flags are on, the page 3 is generated, and the link to it is added to the main page.

Picture below shows acceptable state, when flag enabling page generation is on, but there is still no link to it from the main page.

However, the opposite state is not acceptable as it will lead to broken link and could be noticed by users and perceived as a bug.

At first glance, it seems that eventual consistency will not work for us, because eventual consistency does not guarantee the order in which the writes to the flags will be propagated, so it is possible to have situation when some service will see page 3 flag off, but shown page reference flag on.

Let’s see what could happen with eventual consistency. Let’s assume that initially all flags are OFF, and then Admin executed following changes

  1. Switch Feature_Flag_Generate_Page_3 On
  2. Switch Feature_Flag_Show_Page_3_Ref On

Despite admin changed those flags in proper order client got invalid state, because of possibility to have out-of-order writes which are possible with eventual consistency.

Broken state due to out-of-order writes

In this example follower 1 has received Feature_Flag_Show_Page_3_Ref change but has not yet received Feature_Flag_Generate_Page_3 flag change.

In the following picture there are possible states and transitions between the which could be seen by client.

Possible states and transitions with eventual consistency

In the described scenarios, eventually all clients will see Possibility 3, but in between they can see any of the possibilities, and moreover they can see them in any order. Because each new request could be potentially done from follower which has different state.

Solutions involving other consistency models

To solve this problem we might make consistency requirements more strict and choose another consistency model.

Consistent Prefix Read

With Consistent Prefix Read client could see only state which really existed on the leader. The state which results in broken link has never existed on the leader, so it will be impossible to observe such state querying a follower. Some flapping is still possible, because different followers could have different state (from the subset of allowed ones)

Possible states and transitions with consistent prefix read

So it is possible that client will see flags on , and then will see the state from past when the flags were off. But it will never see non-existing combination of flags which could lead to broken link issue.

Monotonic Reads

Alternatively we could also solve this using Monotonic Reads session guarantee for clients. Monotonic Reads would ensure that if client has seen some state it will never see state from the past.

Possible states and transitions with Monotonic Reads

Both guarantees will give us even stricter possibility for transition, as with them the flapping will no longer be possible.

Solution based on eventual consistency

Above it was shown how to solve our problem of mutually dependent feature flags by switching from eventual consistency to another stricter consistency level.

But in fact there is a way, we can still use eventually consistent store, and have effectively Consistent Prefix Read guarantee on client application, with just small change of our data model.

Here it is variation of our Feature Flag Service assuming that some features must depend on each other.

Feature Flag Service supporting Independent Flags on Top Of Eventually Consistent Store

Considerations for such system:

  • We assume that single document can be written atomically to the database
  • All feature flags are in single document, so that despite we have storage system that is eventually consistent, our application will have effectively Consistent Prefix Read consistency model. This is important in case of independent flags.
  • If we add version number to that document, then we will have very simple implementation of Monotonic Reads, and all of this is done very simply on top of eventually consistent system!

This example shows that it is greatly depends on application on top of which consistency model it can function properly. There are cases when making a change to an application could allow us to use storage with lesser guarantees and thus improve performance and availability of such system. Though in some cases the application logic could become too complex if we try to make it work on top of too week consistency model. So each time all benefits and trade-offs needs to be weighted to choose the best solution of the specific problem.

Eventual Consistency Conclusion

Eventual consistency is achieved with leader-follower replication even with asynchronous replication when each read is send to random server (follower or secondary), because if writes are stopped eventually they all will reach the same state.

Though it is needed to remember, that schema with asynchronous followers might not be durable in case of leader failure and failover process, also potentially in split-brain scenario writes might be not durable.

Consistent Prefix Reads in Leader Follower Systems

Consistent prefix read means that on any read client will get some state which actually existed on leader at some point in time.

There is one condition required to satisfy consistent prefix reads. Specifically out-of-order writes must be prohibited as was discussed in the corresponding section.

Consistent Prefix Reads

Under the condition that that all writes become visible on every follower exactly in the order they were acknowledged by leader, then consistent prefix read will be automatically satisfied no matter from where client reads. It can read from leader, synchronous or asynchronous follower, it can make one read from one follower and another read from the other follower, but still it will get consistent prefix.

Example on when we might need consistent prefix is provided in the section above, and in this example it is case when we want to use independent flags in our feature flag service.

Session Guarantees aka Client-Centric Consistency in Leader Follower Systems

They are implemented on client side on top of eventually consistent system, which means that if the schema with async follower provides eventual consistency, session guarantees can be achieved as well, simply by having client logic which properly select the server to make the reads from.

I recommend watching those two videos to learn more on session guarantees and their implementation

Also, even without any specific algorithms, if client always reads from leader, then all session guarantees will be satisfied, unless there is a failover and some writes are lost.

What happens if client reads from follower?

Let’s see the session guarantees will be satisfied when reading form leader vs reading from follower.

The table below also assumes normal functioning of the system without failover.

Read From

Monotonic
Reads

Read
Your Writes

Monotonic Writes

Write Follows Read

Leader

Yes

Yes

Yes

Yes

Synchronous Follower (always the same follower!)

Yes

Possible


It depends on exact database and its settings. Under some setups synchronous might be related only to changelog replication, but not to applying that changelog on the replica.

Possible


As we have seen here even synchronous replicas does not always prevent out of order wirtes.

Possible


Because writes always happen on leader, and we read always form the same follower. The only edge case are out-of-order writes that shoud be prohibited.

Asynchronous Follower (always the same follower!)

Yes

No


Write will happen on leader, and read from follower might not reflect that write.

Possible


As we have seen here even synchronous replicas does not always prevent out of order wirtes.

Possible


Because writes always happen on leader, and we read always form the same follower. The only edge case are out-of-order writes that shoud be prohibited.

As expected, when reads are done from leader all guarantees are satisfied, because in this case there is actually one node which processes reads and writes.

Assuming that in case reads from followers single client always reads from the same follower and replication is properly configured it is possible to achieve almost all session guarantees, except Read Your Writes guarantee when reading from asynchronous follower. To achieve it, it is necessary to teach client to preserve read your writes, either by blocking read until the write arrived, or searching for follower which already received the write.

Causal Consistency in Leader-Follower Systems

Causality establishes partial order of events. Some events happen-before the others and some happen after others. Events without specific happens-before relationships are concurrent. With causal consistency it is guaranteed, that events with causal relationship will be seen by clients in proper order. Events without causal relationship can be seen in any order.

Causal Consistency can be implemented as combination of all session guarantees ([8],[9],[10])

  • Monotonic Reads
  • Monotonic Writes
  • Read Your Writes
  • Write Follows Read

and since client-centric consistency models can be implemented with leader-follower schema even with asynchronous follower, it means that it is possible to implement causal consistency as well.

Also causal consistency will be automatically satisfied if client always read from the primary, unless primary fails and some writes are lost.

Example: Social-Media With Posts and Replies

Let’s imagine we use system with leader-follower replication in social media platform where user can write posts and then other people can leave comments under the posts.

In this picture we have Alice who always reads from Async Follower I and Bob who always reads from Async Follower 2. They obviously send write requests to the leader, because in leader-follower model only leader accepts writes.

Use Case For Causal Consistency

On this picture yellow posts and comments belong to Alice while purple posts and comments belong to Bob. Arrows between posts and comments depict causal relationship between posts and comments.

Causal Consistency ensures that there will be no anomalies when someone could see comments without those comments and posts which happened-before them. At the same time mutual order of Alice’s and Bob’s posts does not matter, as well as mutual order of comments to different posts.

Sequential Consistency in Leader-Follower Systems

Introduction

If you are not confident in your understanding of sequential consistency I recommend first watching my video about it, but you can try read first and get back to the video if something is not clear.

In sequential consistency we must be able to place all reads and writes to some total order for all processes/clients. This total order should satisfy program order of each specific process/client. But unlike linearizability there is no requirements about mutual order of operations of different processes.

It seems that in practice we can’t get pure sequential consistency with single leader + followers system. We either will get weaker or stronger consistency model. But if we consider system where each partition has it’s own leader, such system will satisfy sequential consistency properties, while still it will formally remain single-leader system – because each piece of data has only one leader.

Traditional Leader-Follower System

First, let’s review asynchronous follower case. If we read from arbitrary or specific asynchronous follower we will not get sequential consistency. Let’s see an example

Broken Sequential Consistency With Asynchronous Follower

We can observe that client wrote x = 1, and then right after that read the value x back from the asynchronous follower. Requirement that every single process/client should see it’s operation in specific order is violated here, this we can say that such system is not sequentially consistent.

What if we try to add Read Your Writes session guarantee and see if it might become sequentially consistent?

Let’s see an example which illustrated why it does not work. Now there are two clients: Client A and Client B. First Client A writes x = 1, then client B writes y = 1. Then until those writes are propagated to followers they want to read the values they written back. Due to Read Your Writes requirement, they can’t read from followers(because followers don’t yet have those writes at the moment) at the moment, so they read it form leader. And they correctly get the values which they written.

Broken Sequential Consistency With Asynchronous Follower With Read Your Writes Guarantee

Then each of them wants to read the value written by the other client, and now, they are not restricted by Read Your Writes guarantee, and make the read from the closest followers, and they get old version of the variables. Client A gets y = 0 and Client B gets x = 0.

This sequence violates sequential consistency. To understand why exactly let’s look on the illustrations below. On the picture below, it is depicted view on the sequence of operations(reads, writes) by each client. Writes are depicted with orange and red while reads are depicted by blue and green colors.

Sequence of operation from the point of view of each client

From the definition of sequential consistency follows that there should be possible to build a total order of operations which would be legal (legal here means that if we someone wrote something, the further reads should return it). So let’s try to build total order from those operations that would make sense.

Example of Total Order That Violates Sequential Consistency

Notice, that operations in that total order violate the sequential consistency requirements, as after we read value x = 1 and y = 1, subsequent reads return their values as 0, though they were not updated in between according to that total order.

Remember that in sequential consistency we have flexibility of shifting operations of different processes/clients one relatively to another. So let’s try to rearrange the operations (keeping operations of each process in the same process order) to try to prevent this violation. We could rearrange them as shown on the image below.

Another Example of Total Order That Violates Sequential Consistency

We managed to get rid of one violation, but still there is another. It is impossible to rearrange them in such a way that keeps the original program order but at the same time is allowed from history legality point of view.

With synchronous followers where out-of-order writes are prohibited and the configuration ensures immediate visibility of changes on followers we will effectively get linearizability(considering some other requirements are satisfied as well).

So, to be clear if we say if the system satisfies requirements for stronger consistency model then sequential consistency, then it is also satisfies sequential consistency requirements, but since such consistency models would have their own names (like linearizability for example) we can’t really call them sequentially consistent.

The main reason why we get stronger consistency model in this setup is because all writes go through a single leader, and thus we have total order which is strictly defined. There is no room to wiggle and no reason to have situation when recency of operations between different clients/processes is not reflected in their total order.

Leader Per Shard System

With leader per shard approach, the data are partitioned by some criteria and each partition has its own leader. It is still called single-leader system, because each partition has single leader. See details in my article.

Because of the fact that only one leader is responsible for writing specific piece of data there are no conflicts in such system, but at the same time we loose total order of operations which respect recency of operations made on different partitions. In other words we can’t say for writes on different partitions which happened before another, but we can establish an order of write within each partition. And we still should be able to imagine some total order of operations (possibly shifting operations of different partitions shifted relatively to each other) which should be legal.

As we have seen in previous section, asynchronous followers will result in weaker consistency model then sequential consistency even with single leader (partition). So, the only way to have sequential consistency is to use synchronous followers. Moreover those followers must be configured to prevent out-of-order writes and ensure immediate visibility of changes on the followers.

The fact that such system will provide sequential consistency is confirmed in [11].

Note, that sequential consistency is quite strict consistency model, and though in principle it is achievable in such system, you should thoroughly check documentation and make tests to ensure that your specific database would provide sequential consistency in such mode. Even if you have configured immediate visibility after acknowlegement on synchronous followers and out-of-order writes are prohibited, there is still question of failover and all issues related to it. See my previous for more details on this.

Example: Distributed Inventory System

As an example where we might want to use this schema is Distributed Inventory System with following requirements:

  1. The system should keep track of all products in a warehouse
  2. There is huge amount of products, so that we simply could not place all of them into single database, or such database would be very expensive
  3. All users of that warehouse are located in one region, so that the whole system resides in single data center
  4. It is important to prevent overselling – when bigger quantity is sold than is available in the warehouse
  5. The warehouse should survive node failures

For the purposes of our example it does not matter whether relational or document-oriented database is used, but here just for the sake of illustration, it will be shown document-oriented database.

Let’s assume that we have a document which described product in the warehouse. It has it’s name, quantity and category. Since by requirements there are too much products to be stored in one database we have to shard them by some criteria – and store different documents in different shards. There are many approaches to sharding, but here just for simplicity we will simply split them by category, and say that all products with the same category will be stored in single shard. Below we see an example of a product document, and dedicated shards for categories: Electronics, Sports and Toys.

Distributed Inventory System: Data Model

Now there could be multiple ways on how to store those shards data, but here we will choose sequentially consistent system based on single leader per shard. Where each node has multiple shards, and one node can have a leader for some shards and followers of another shards.

Distributed Inventory System: Storage Organization

The requirement that all users reside in one region tells us that the whole system can be placed within single data center, and thus all followers can be synchronous because there will be no significant network overhead in calling them.

In reality, most likely we would not do exactly that. Most likely we would also have some asynchronous followers. Also we could place some followers to some independent data center to mitigate the risk of one data center could be down, but here we will focus only on how we could use sequentially consistent system, so we will not go into all those details, otherwise we might need a huge document only to describe how such system could be designed.

Because each data item resides only on one partition and all changes for it go through only one leader, we don’t have any conflicts. Due to strictly synchronous followers each read would return results of the most recent write. But real time order between writes in different shards can’t be established we will have sequentially consistent system.

So, all clients always see accurate quantity for each product which is available in warehouse. And also we can find total order of operations which would satisfy the requirements of sequential consistency.

If any node fails it will affect only writes for the partitions whose leader was located on that node, and after failover write functionality will be restored.

Linearizability in Leader-Follower Systems

It makes sense to speak about linearizability only when there are no transactions. For systems with transaction we would speak about strict serializability (a.k.a. external consistency) instead.

Let’s remind main points of linearizability

  • Behaves as though there is only one data copy
  • Total order of all operations across all processes
  • Real-time ordering of operations
  • Acknowledged write is immediately visible to all future reads
  • Once a read sees the value, all future reads must return the same value (unless there was another write)

The major difference from sequential consistency is that operations of different processes/clients must follow realtime order in the total order of operations.

See here detailed explanation of linearizability with examples

According to Kleppmann [12] leader-follower replication system with synchronous replication is potentially linearizable when reads are done from the leader or synchronous follower. But in reality not every single leader system is linearizable by default, either by design or because of concurrency bugs.

It seems that linearizability of such system is obvious . Indeed, we have single leader which guarantees total order of all writes, and synchronous followers guarantee that all changes will be immediately visible.

Therefore let’s focus on examples of cases when single leader system will not provide linearizability.

As explained in the beginning of this article, under some circumstances even with synchronous followers, it is possible to get not immediately visible writes or out-of-order writes when read are done from synchronous followers. This obviously violates linearizability requirements about real time total order of operations and immediate effect of the operation between the request and response.

Even if our syhcnronous followers are configured properly, or we don’t even read from followers preferring to make all reads from leader, there is still question of failover described in more details here when as a result of failover linearizabiliyty might get compromised.

Adding Transactions: From Linearizability to External Consistency

How linearizability at all matches with transactions?

Looks like linearizability does not make much sense if applied to transactions, because transactions semantic is that by definition the writes should be visible when they are committed.

So, under the presence of transactions it makes more sense to speak about transactions are becoming visible in their commit order, instead of each individual write becoming visible in its write order.

The concept of external consistency is well described in Google Spanner documentation [13] as

Yes. In fact, Spanner provides external consistency, which is a stronger property than linearizability, because linearizability does not say anything about the behavior of transactions. Linearizability is a property of concurrent objects that support atomic read and write operations. In a database, an “object” would typically be a single row or even a single cell. External consistency is a property of transaction-processing systems, where clients dynamically synthesize transactions that contain multiple read and write operations on arbitrary objects. Linearizability can be viewed as a special case of external consistency, where a transaction can only contain a single read or write operation on a single object.

Below there is an illustration that shows 3 transactions: A, B and C which happen in parallel. Transaction start and commit are shown as black dots.

Ordering Illustration for External Consistency/Strict Serializability

To satisfy the external consistency, we need to order transactions according to their commit times t1 (C) < t2 (A) < t3 (B), hence in the resulting order those transactions must appear in order C -> A -> B.

It is even harder to achieve than linearizability because it also involves transactions.

Examples of Potential violations of external consistency

Repeatable Read or Snapshot Isolation

Repeatable read prevents linearizability in it’s original sense, because by design it means that after transaction starts, it does not see the most recent writes, which happened after transaction was started.

On the illustration we see four transactions 1-4 executing on some database with repeatable read isolation level. Each transaction has time of start and commit depicted as black circles and some operations going on between those two points in time.

Repeatable Read: Concurrent Transactions

For the sake of our example, let’s choose transaction 3 and see which snapshot of data will it see continuously during the whole execution.

Repeatable Read: Snapshot for Transaction 3

By the time of the start of Transaction 3, only Transaction 1 was committed. That’s why Transaction 3 will see only results of Transaction 1. Despite the fact for example that Transaction 2 write happened before Reads 1, 3, and 4, those reads will not see that write which precedes them in realtime order of operations. Also Read 4 of Transaction 3 happened after all other writes on that picture, but it will not see all preceding writes from Transaction 2 and 4.

Thus it’s obvious that neither linearizability nor external consistency requirements are not satisfied here. External consistency violation happens because Transaction 2 was committed earlier than Transaction 3, so from external consistency perspective all writes of Transaction 2 it should have been visible for Transaction 3.

Transactions ordering in postgres preventing linearizability/external consistency

In postgres transactions are ordered by their txid which is assigned at the moment of the first write statement in the transaction, and not by commit time. It means that it does not satisfy external consistency/strict serializability.

Here is a simple experiment which I made to illustrate the point. I used postgres 14.5 for it.

In the table below it can be seen that despite Transaction 1 was started before Transaction 2, and also Transaction 1 was committed before Transaction 2, Transaction 1 has bigger txid =794 , than txid=793 of Transaction 2.

Time

Transaction 1

Transaction 2

Comments

T1

ivan=# begin;
BEGIN

T2

ivan=# begin;
BEGIN

T3

ivan=# insert into comments (comment) values('Transaction 2');
INSERT 0 1

txid is assigned here for Transaction 2

T4

ivan=# insert into comments (comment) values('Transaction 1');
INSERT 0 1

txid is assigned here for Transaction 1

T5

ivan=*# select txid_current();
 txid_current 
--------------
          794
(1 row)

T6

ivan=*# select txid_current();
 txid_current 
--------------
          793
(1 row)

T7

ivan=*# commit;
COMMIT

T8

ivan=*# commit;
COMMIT

So, the txid(Transaction 1) > txid(Transaction 2) so in Postgres opinion, Transaction 2 happened before Transaction 1 which violates external consistency/strict serializability, because external consistency requires that the order of transaction will be realtime order of their commits;

Example use-case for external consistency: banking system

There is classical example of bank system which supports 4 operations

  • deposits
  • withdrawals
  • transfer
  • balance check

For such system we could use leader-follower system with transactions configured to provide external consistency.

Use Case Example for System with External Consistency

Then read queries that require to see linearized state could go to synchronous follower. It does not also prevent us from using asynchronous followers in the same system for the cases when linearizable reads are not required. For example we could use asynchronous followers for reporting, analytics and similar purposes when we don’t really need to have latest state of the data

Conclusion

Using leader-follower replication schema it is possible to achieve wide range of consistency models starting from eventual consistency to external consistency/ strict serializability, but there are tons of intricacies and implementation specifics, so it is necessary to understand the fundamental concepts and at least some basics of database internal structure to reason about this topic.

References

[1] Data-Centric Consistency Models in Distributed Systems.  https://youtu.be/WXnAvA0g8Wc

[2] Client-Centric Consistency Models in Distributed Systems. https://youtu.be/9XSWe-sp9M8

[3] PostgreSQL Documentation. Synchronous Commit. https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT

[4] MariaDB. Parallel Replication. https://mariadb.com/kb/en/parallel-replication/

[5] Terry, D. B., et al. 1995. Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP ’95).

[6] Vogels, Werner. 2008. Eventually Consistent. ACM Queue 6

[7] Bailis, Peter, et al. 2013. Highly Available Transactions: Virtues and Limitations. VLDB.

[8] Brzeziński, Jerzy; Sobaniec, Cezary; Wawrzyniak, Dariusz. 2004. From Session Causality to Causal Consistency. Workshop on Parallel, Distributed, and Network-Based Processing (PDP), 152–158.

[9] Burckhardt, Sebastian. 2014. Principles of Eventual Consistency. Foundations and Trends® in Programming Languages, vol. 1, no. 1-2, 1–150.

[10] Tyulenev, M.; Schwerin, A.; Kamsky, A.; Tan, R.; Cabral, A.; Mulrow, J. 2019. Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB. In SIGMOD ’19, 636–650.

[11] van Steen, Maarten, and Tanenbaum, Andrew S. Distributed Systems, 4th edition, 2023.

[12] Kleppmann, Martin. Designing Data-Intensive Applications. O’Reilly Media, 2017.

[13] Google Cloud. TrueTime and External Consistency. https://cloud.google.com/spanner/docs/true-time-external-consistency#linearizability

Comments

Leave a Reply

Discover more from Ivan Fedianin

Subscribe now to keep reading and get access to the full archive.

Continue reading