Database Replication Streams: From Statement‑Based to WAL

Introduction

In the previous article I wrote about basics of leader-follower replication, its variations and common problems. This article focuses on how exactly the changes on the leader are replicated to followers.

There are several ways of sending changes (represented with boxes with question marks on the image above) and the most used of them are following:

  • Statement-based Replication
  • Row-based(logical) replication
  • Mixed Replication
  • WAL-based replication
  • Trigger-based replication

Sometimes the replication method will be defined by the database you use, but sometimes databases will provide options on how this replication should happen.

Statement-based Replication

The changes in databases usually happen as a result of executing some statements that modify the data. The obvious option would be for the leader just record every statement which was executed and send this log of statements to followers.

For example, assume that in the system which allows to create meetings user Bob has created a meeting called “Review” scheduled it to Feb 1 2025 and invited Alice and Eve as participants.

To do that in our imaginable system, client had to issue 3 inserts statements and commit them.

As illustrated above, leader would record all of those statements after they were committed send the to the follower in exactly the same order.

Then follower applies all of those changes and as a result gets the same state as the leader.

The benefits of statement-based replication are following:

  • it is comparatively high level and because of that it does not depend much on the underlying storage engine, so it could be easily used between different versions of the database instances.
  • the replication log is very compact, especially in cases when there is one UPDATE or DELETE statement which changes multiple rows
  • as a side effect of replication we have a log of all commands executed on the database which can be used for auditing.

However, there are also some significant drawbacks:

  • this technique can’t be used with non-deterministic statements. For example now(), random(), lock() and handful other functions which could result in different value depending on when and where they are executed.
  • if statements contain some auto-incremented counters,or rely on some other data, they must be executed in exactly the same order on all followers, which might not always be possible.
  • statements that cause some side-effects (triggers for example) may results in different state of the system on leader and followers unless all side effects are deterministic.

Those drawbacks are quite serious, so purely statement based replication is not widely used in real systems. This approach was default in MySQL until version 5.1, but then was replaced by mixed replication, described further in this article.

See https://dev.mysql.com/doc/refman/8.4/en/replication-sbr-rbr.html for MySQL documentation on this topic.

Logical (Row-Based) Replication

Logical Replication (a.k.a. Row-Based Replication) solves the major issue of statement based replication, which is non-determinism, by sending to followers exact values of the changed rows, instead of commands which led to that values. That way, followers receive the absolute changes themselves instead of just instructions how to make those changes.

So, for example the now() statement in the command will be sent as exact time which was evaluated at the moment of command execution on the leader. As a result follower will apply all changes in a deterministic way.

For example, assume that in our meeting booking system Bob has created two meetings “review” and “planning” for Feb 1, 2025, but then changed his mind and updated them to Feb 3.

The commands above would generate 4 rows changes – two changes for inserts and two changes for the update. You can notice, that single UPDATE statement, which updates two rows, is translated to two separate row changes, because two rows are updated as a result of it.

The benefits of that method are:

  • Non-determinism is not a problem
  • Don’t depend much on the internal database engine, thus it is easy to perform upgrades of leader and followers one after another
  • The format is quite high-level, and could easily be understood by external application. This can be used to send database content to external system( for example for Data Warehouse for analysis). This technique is called Change Data Capture.

The trade-offs are:

  • For bulk operations (where one command changes tons of rows) the log size will be huge compared to statement based replication. If in our example above we updated a million of meetings, leader would have to send to follower one million changed records, while in statement based replication it would just have sent one UPDATE statement.
  • Since row-based replication focuses on data changes, it can’t really be used for DDL operations, which change data structure rather then data itself.

Despite of the trade-offs most of real systems use some kind of logical replication.

Examples are:

  • My SQL starting from 5.1
  • Maria DB
  • Oracle Golden Gate
  • Microsoft SQL Server
  • Mongo DB
  • ElasticSearch
  • IBM DB2 Q replication
  • Apache Kafka
  • Rabbit MQ

Example: MongoDB

Let’s have a look at MongoDB as an example of a system which uses row-based replication. MongoDB has separate concepts of Journaling (Write Ahead Logging) and Oplog (logical format designed for replication). There is also third concept called Data Streams(which is basically simplified and standartized oplog for the application use.)

Oplog is stored in a capped(means that mongo will automatically remove old documents) collection oplog.rs. This looks like a regular collection from user point of view. It contains logical changes:

  • document inserted
  • document updated
  • document removed

Let’s do some experiment to see what is exactly written to oplog when we execute some operations.

Here, I will create and use a collection which will store meeting objects, where each meeting will have an organizer, topic and array of participants.

{
    meeting_topic: "Discuss the proposed change",
    orginiser: "John Smith",
    participants: [
              "Alice",
              "Bob"
          ]
}

In the left column of the table below there are commands executed in Mongo Shell and on the right the oplog records those commands have generated. Notice that visually compact statements in the left command generate much more verbose oplog records in the right column.

Command

Oplog

The first insert command which also creates collection in Mongo

rs0 [direct: primary] test> db.meetings.insertOne({
...     "meeting_topic": "Discuss the proposed change",
...     "organiser": "John Smith",
...     "participants": ["Alice","Bob"]
... });
{
  acknowledged: true,
  insertedId: ObjectId('66fda154fa4f71aba0542683')
}

Table and index creation

{
    op: 'c',
    ns: 'test.$cmd',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: {
      create: 'meetings',
      idIndex: { v: 2, key: { _id: 1 }, name: '_id_' }
    },
    ts: Timestamp({ t: 1727897940, i: 1 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:39:00.484Z')
  }

Document insertion

{
    lsid: {
      id: UUID('14e22b18-9368-464b-9a2a-a45afabfd766'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNumber: Long('1'),
    op: 'i',
    ns: 'test.meetings',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: {
      _id: ObjectId('66fda154fa4f71aba0542683'),
      meeting_topic: 'Discuss the proposed change',
      organiser: 'John Smith',
      participants: [ 'Alice', 'Bob' ]
    },
    o2: { _id: ObjectId('66fda154fa4f71aba0542683') },
    stmtId: 0,
    ts: Timestamp({ t: 1727897940, i: 2 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:39:00.485Z'),
    prevOpTime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') }
  }

Inserting second meeting “Pizza party”

rs0 [direct: primary] test> db.meetings.insertOne({
...     "meeting_topic": "Pizza party",
...     "organiser": "Alice",
...     "participants": ["Bob","Eve"]
... });
{
  acknowledged: true,
  insertedId: ObjectId('66fda348fa4f71aba0542684')
}
{
    lsid: {
      id: UUID('14e22b18-9368-464b-9a2a-a45afabfd766'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNumber: Long('2'),
    op: 'i',
    ns: 'test.meetings',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: {
      _id: ObjectId('66fda348fa4f71aba0542684'),
      meeting_topic: 'Pizza party',
      organiser: 'Alice',
      participants: [ 'Bob', 'Eve' ]
    },
    o2: { _id: ObjectId('66fda348fa4f71aba0542684') },
    stmtId: 0,
    ts: Timestamp({ t: 1727898440, i: 1 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:47:20.291Z'),
    prevOpTime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') }
  }

rs0 [direct: primary] test> db.meetings.updateMany({}, { $push: { participants: "Jim" } });
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 2,
  modifiedCount: 2,
  upsertedCount: 0
}
rs0 [direct: primary] test> 

Pay attention to the fact that single command which updated two documents generated two oplog records.

First document update

{
    op: 'u',
    ns: 'test.meetings',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: { '$v': 2, diff: { sparticipants: { a: true, u2: 'Jim' } } },
    o2: { _id: ObjectId('66fda154fa4f71aba0542683') },
    ts: Timestamp({ t: 1727898682, i: 1 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:51:22.865Z')
  }

Second document update

{
    op: 'u',
    ns: 'test.meetings',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: { '$v': 2, diff: { sparticipants: { a: true, u2: 'Jim' } } },
    o2: { _id: ObjectId('66fda348fa4f71aba0542684') },
    ts: Timestamp({ t: 1727898682, i: 2 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:51:22.883Z')
  }

rs0 [direct: primary] test> db.meetings.updateOne({_id: ObjectId('66fda154fa4f71aba0542683')}, { $set: { randomNumber: Math.random() } });
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 1,
  modifiedCount: 1,
  upsertedCount: 0
}
rs0 [direct: primary] test> 


{
    lsid: {
      id: UUID('14e22b18-9368-464b-9a2a-a45afabfd766'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNumber: Long('4'),
    op: 'u',
    ns: 'test.meetings',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: { '$v': 2, diff: { i: { randomNumber: 0.4219086450471152 } } },
    o2: { _id: ObjectId('66fda154fa4f71aba0542683') },
    stmtId: 0,
    ts: Timestamp({ t: 1727899160, i: 1 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:59:20.039Z'),
    prevOpTime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') }
  }

rs0 [direct: primary] test> db.meetings.deleteOne({_id: ObjectId('66fda348fa4f71aba0542684')});
{ acknowledged: true, deletedCount: 1 }
rs0 [direct: primary] test> 



{
    lsid: {
      id: UUID('14e22b18-9368-464b-9a2a-a45afabfd766'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNumber: Long('3'),
    op: 'd',
    ns: 'test.meetings',
    ui: UUID('42dac1b0-5dd9-4352-a410-f8a5861045d1'),
    o: { _id: ObjectId('66fda348fa4f71aba0542684') },
    stmtId: 0,
    ts: Timestamp({ t: 1727898829, i: 1 }),
    t: Long('1'),
    v: Long('2'),
    wall: ISODate('2024-10-02T19:53:49.052Z'),
    prevOpTime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') }
  }

Mixed Replication in MySQL

MySQL offers a replication technique called “Mixed Replication,” which combines statement-based and row-based replication methods.

In this mode, MySQL decides whether to use statement-based or row-based replication on a per-statement basis. By default, it uses statement-based replication, but for specific statements that may produce non-deterministic results, it switches to row-based replication for that statement. This ensures better reliability for such scenarios.

While mixed replication leverages the strengths of both methods, it can still be unreliable and prone to bugs, similar to statement-based replication. For example, in a Percona blog post, a bug in statement-based and mixed replication is highlighted: the CURRENT_TIMESTAMP function is not treated as non-deterministic. This oversight can lead to inconsistencies between the leader and follower databases.

This example illustrates one of the potential pitfalls of mixed replication. In general, this article recommends row-based replication as the preferred and more reliable choice.

WAL Replication

In order to achieve durability, databases usually write log of all changes to disk(known as Write Ahead Log or WAL), before the actual data change is made. As a result of such approach if the database server crashed, the data can be restored by replaying all not applied records from WAL.

This log is usually a very low-level and effective. So, a natural idea comes to mind to use this log for replication, so that followers could build exactly the same state as the leader.

The image below could convery the concept of that replication. As you can see, the units of changes are called XLog records, and they are quite low-level and include information about where exactly in the logical storage layout changed information resides.

Example of databases which use such techniques are

  • Postgres
  • Oracle
  • IBM DB2 in HADR mode

The benefits of WAL replication are:

  • It is fast and effective since the log is binary
  • Non-determinism is not a problem
  • Work well for both DML and DDL commands

The major drawback comes from the benefit of WAL replication. Effective low-level binary log means that leader and follower software have to be of the same version in order for replication to work. This causes significant problems of uprading servers between major versions. Either downtime is required or complex solutions involving logical replication.

Since this is a complex topic, in the following sections I will provide some background on the Write Ahead Logging and it’s application for replication. The example will use Postgres to be specific.

Storage Layout and Write-Ahead Log in Postgres

Concept of Write-Ahead Log

Before we speak about Write Ahead Logging it is really neessary to speak first about how data are stored in the databases, because necessity of WAL comes specifically from the underlying storage layout and mechanics of its working.

Further I will talk specifically about Postgres, but most of the concepts are quite generic and applicable to many modern databases.

Each table is physically stored on disk in one or more data files. Each data file is logically split into blocks of fixed size – called pages. Page size in Postgres by default is 8kb.

Each page consists of multiple data tuples. When page is full, new page is created in the data file. When data file is full, new data file is created.

Due to historical reasons rows or records are called tuples internally in the databases.

It is important to mention, that page is a minimal logical unit in postgres which could be read from disk or written to disk. On the first glance it might seem ineffective. However there are several reasons to have fixed page structure and one of the most significant is that underlying storage usually supports only block-granular writes and reads, and thus databases had to adapt to that limitation.

Here we encounter first problem: when reading or writing a single tuple we need to read/write the whole page. Moreover from disk perspective those reads/writes will be random access, as the page we want to access could be located in any file.

It is well known that random access to disk is a very slow operation, so if we had to acess the actual data on each request the system would be terribly slow.

Therefore it is natural to introduce caching pages in memory as shown on the picture below.

This memory area where cached versions of pages are located is called Shared Pool in postgres. Now when client reads or writes data only pages in cache are updated and this happens very fast. The pages which were modified in cache only are called dirty pages. Then there is checkpoint process which regularly writes dirty pages on the disk.

At this point we don’t have performance problems, but as a trade-off we got a problem with data durability. Indeed, if client wrote any data and they were updated only in cache, then if server crashes then all recent changes will be lost, thus such system will violate durability requirements.

Here we came to the reason to why Write Ahead Logs are needed. Actual layout of storage in postgres looks like on the picture below.

Apart from data files and pages server also writes log of all changes called Write-Ahead Log (WAL). Internally in postgres is is called XLog and each record is called XLog Record. Together with writing changes to pages in Shares Pool, the record of each change is aded to so called WAL Buffer – a memory area which stores XLog records until they are flushed on the disk.

At the moment of transaction commit, WAL Buffer is flushed on disk in files which are referred as WAL Segment files. Now, if server crashes we can be sure that actual data could be restored applying XLog records.

One can notice, that commit anyway requires actual write on disk, it is just that instead of pages we write XLog records. One can ask a reasonable question, if we anyway have to make write on disk on each commit, wouldn’t that re-introduced the performance problem?

The answer is no, because of several reasons:

  • XLog is written sequentially and it is known that sequential access to disks works much faster then random access.
  • XLog contains only small description of change. i.e. just diff with previous state, and it is much smaller that actual page.

Internal Layout of Pages and XLog Records

Now, when we understand the basics and the reasons behind WAL, let’s look a bit deeper into the page structure and WAL record structure which are depicted on the image below:

A page consists oa a header portion and data portion. Each XLog record has unique identifier LSN (Log Sequence Number) and each page contain pd_lsn – LSN of the XLog record which was last to update this page. Page space is filled with tuples from behind. I.e. first tuple is added to the end of the page, then next tuple is added before the first one and so on. Each tuple can have its own length depending on the actual data stored in a row. In order to navigate through that structure of variable length tuples, in the beginning of each page there is an array of pointers to the tuples. The free space between the pointers and tuples is called hole.

Below are the attributes of the XLogRecord (which is header part of XLog unit). I have included them not for detailed discussion, but just to see what it looks like.

// https://github.com/postgres/postgres/blob/master/src/include/access/xlogrecord.h#L41
typedef struct XLogRecord
{
	uint32		xl_tot_len;		/* total len of entire record */
	TransactionId xl_xid;		/* xact id */
	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
	uint8		xl_info;		/* flag bits, see below */
	RmgrId		xl_rmid;		/* resource manager for this record */
	/* 2 bytes of padding here, initialize to zero */
	pg_crc32c	xl_crc;			/* CRC for this record */

	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

As you can see the data within XLogRecord is quite low-level. For example it has xl_rnmid which is a resource manager identifier. The idea is that different data changes cause XLog records of different types, and there are a few of resource managers responsible for processing different XLog records.

From looking on this it seems obvious that application of Write-Ahead-Logs to restore data from it must be performed by the same version of postgres which generated it in the first place. That is what causes difficulties for upgrades.

WAL-Based replication in postgres

With understanding of principles of WAL, it’s application to replication should be very easy to understand. In postgres terminology Primary-StandBy means the same as LeaderFollower.

When Stand By is switched on, it starts walreceiver process, which sends connection request to primary. Postgres process on primary starts walsender process and hand-offs incoming TCP connection to it.

After start walreceiver process sends its most recent LSN (Log Sequence Number) to walsender of the primary.

In our example, primary has LSNs up to 100, while standby has only LSN 80, so it sends 80 to the walsender, and walsender will send XLog records between 80 and 100. Once the catch-up process is completed, streaming replication starts where WAL records are streamed to StandBy.

Trigger-Based Replication

The replication methods described so far (Statement-based, Logical, WAL) are natively supported by the underlying databases, and they are mostly targeted to replicate the full database content to the followers.

Trigger-based replication relies on triggers, which are part of the database, but in its nature trigger-based replication is built on top of the database by some 3rd party process or application code, rather than being intergral part of the database.

Most relational databases allow to configure triggers – pieces of code which are executed on different actions modifying database. For example it could be configured when rows of a specific table are inserted, deleted, updated trigger will be fired with exact details of that update.

Then on application level we could implement any logic we wish which would process the changes. We could for example process changes only of one specific table, or even process changes only related to some specific values of some columns. We could potentially send changes to whatever location we wish using whichever protocol. Then the changes could be applied to different databases. Also some additional logic could be executed on those changes.

The typical simplified usage of trigger-based replication is depicted below. Client again inserts and updates some meetings, and database has trigger which fires on every change in the meetings table.

Trigger only writes the changes into dedicated table with change log called changes in our example. Trigger logs table name, operation (insert, update, delete) and which exact row was changed and what was changed in that row.

Then there is usually some third party replication process which listens to new records in changes table, and makes corresponding changes in follower database.

Application developers could develop all of those triggers and replication processes from scratch if needed, but also there are some frameworks which already have basic functionality and they can be simply reused.

The most known frameworks are:

The benefits of trigger-based replication are:

  • Database-agnostic – we could replicate changes from one database to another.
  • Flexibility in what and how we replicate – we don’t have to replicate whole database content, but could replicate only some tables, we could replicate different tables to didferent replicas, only specific rows could be replicated and there can be any logic built-into the process.

And as usual there are no method without drawbacks:

  • Overhead and performance impact
  • Operational complexity to maintain and debug
  • Risk of drift (triggers could be buggy, they could get disabled accidentally etc)

Summary

There is no silver-bullet method which will be best in all cases, but in many cases row-based replication or WAL replication can be used.

Here is a summary of the pros and cons of the described replication methods.

Statement-based Replication

Logical(row-based) Replication

WAL-based Replication

Trigger-based Replication

Replicated changes

  • statements: INSERT, UPDATE, DELETE
  • Changed rows
  • Low-level Write-Ahead Log
  • Changed Rows

Benefits

  • compactness of log
  • audit log as a side effect
  • no strict requirements about matching versions of leader and follower
  • can be used for both DML and DDL
  • non-determinism is not a problem
  • no strict requirements about matching versions of leader and follower
  • can be used as CDC to send events to external systems
  • non-determinism is not a problem
  • fast and effective since the log is binary
  • can be used for both DML and DDL
  • database-agnostic
  • very flexible

Issues

  • issues with non-deterministic statements and statements causing side effects
  • very verbose in case of bulk changes
  • focuses on row changes and could not be used for DDL directly
  • low-level binary log requires leader and follower to have the same version which significantly complicates upgrades
  • overhead and performance impact
  • operational complexity
  • Risk of drift

Used In

  • MySQL before 5.1
  • My SQL starting from 5.1
  • Maria DB
  • Oracle Golden Gate
  • Microsoft SQL Server
  • Mongo DB
  • ElasticSearch
  • IBM DB2 Q replication
  • Apache Kafka
  • Rabbit MQ
  • Postgres
  • Oracle
  • IBM DB2 in HADR mode

Apart from the described and widely used replication methods, there other methods which could exist, though their usage would be needed only in some very specific cases. For example, sometimes CDC (change data capture) could be considered to be separate replication method, despite it is in in its nature a row-based replication, but the nuance is that the changes are more like events published so that any consumer process could apply some logic on them. Other alternatives are for example some kind of snapshot replication, where snapshot is periodically replicated. Also, there could be some physical level replication where the content of the underlying storage is replicated. So, it is useful to know that the replication methods are not restricted and depending on your specific situation or requirements you might benefit from one or another method.

References

[1] Martin Kleppmann. Designing Data-Intensive Applications. 2017

[2] https://dev.mysql.com/doc/refman/8.4/en/replication-sbr-rbr.html

[3] https://dev.mysql.com/doc/refman/8.4/en/binary-log-mixed.html

[4] https://www.percona.com/blog/is-mysql-statement-based-mixed-replication-really-safe

[5] https://bugs.mysql.com/bug.php?id=107293

[6] https://www.interdb.jp/pg/pgsql09.html

[7] https://www.postgresql.org/docs/current/storage-page-layout.html

[8] https://www.slony.info

[9] https://github.com/pgq/londiste

[10] https://bucardo.org/Bucardo/Overview

[11] https://symmetricds.org/docs/overview

Comments

Leave a Reply

Discover more from Ivan Fedianin

Subscribe now to keep reading and get access to the full archive.

Continue reading