Simple Replication

Narrative
The goal of this development is to develop a very simple and lightweight replication for use in a cloud. In order to quickly deploy a solution, this replication implementation will be statement-based and will use protobuf for specification of binary log events.

Environmental assumptions

 * Servers are unreliable
 * Network is unreliable
 * Network has high bandwidth
 * Network has high latency

Consequences:
 * Slave fail-over to new masters are frequent
 * Transaction are frequently aborted

Features

 * Statement-based replication
 * Supporting fail-over to other masters

Decisions to be made
These are decisions that has not yet been finalized and which needs to be considered before the specification is complete.


 * How large shall the timestamp be? Right now, it is specified to be 64 bits, giving timestamps to nanoseconds precision. It is questionable if this is needed and reducing this even with a small fraction will save a lot of data.


 * Shall the timestamp be of a fixed type (meaning it cannot easily be changed) or using varint (meaning we can change the length later)?

Unsorted issues
These are issues that need to be incorporated into the specification in some manner. Either by explaining how it is supported, or requires clarification.


 * Do we have the necessary features to support multi-source replication? If all the features are present, it has to be clarified how it can be incorporated in the future.
 * Specify clearly how master fail-over works. This is necessary to ensure that the handshake protocol actually supports that scenario.

Binary log format
Notes:


 * We want to keep the name of the binary log out of the actual log and make the format of the binary log files independent of how they are organized.

Open issues

 * Find a better name for processing messages. Mkindahl 10:48, 20 August 2008 (UTC)

Structure
The binary log is a sequence of messages, where messages come in three flavors: replication messages, control messages, and frame message. When considering changes to the database, only the replication messages are relevant. The control messages are used to control the behavior of replication, e.g., a change in version or format, and the frame messages are used to structure the binary log in, for example, files and play no role in the actual replication.

Message specification
package BinaryLog;

message Header { required fixed64 timestamp = 1; required uint32 server_id = 2; required uint32 trans_id = 3; }

// Start of a binary log. message Start { required Header header = 1; required uint32 server_version = 2; required string server_signature = 3; }

// Chain a binary log to the next binary log message Chain { required Header header = 1; required uint32 next = 2;           // Sequence number of next file }

// An unparsed statement that was executed on the master message Query { // Lists of tables message Tables { required string database; repeated string tables; }

// Assignments to variables that are part of the query message Variable { required string name = 1; required string value = 2; // Character set? }

required Header header        = 1; required uint32 session_id    = 2; required string query         = 3; repeated Variable variable    = 4; repeated Tables tables_written = 5; }

// A commit message message Commit { required Header header = 1; }

// A rollback message message Rollback { required Header header = 1; }

Server Objects
This section gives all server objects, for example, variables and tables available for administering and supporting replication.

Replication Progress Table
In order to keep track of the progress of replication, a table is maintained containing information on what the slave has seen. In effect, it is a vector clock over the known progress of other servers. It holds the following fields:

Assuming that the server id is unique within a cloud, the pair (Server_Id,Last_Seen_Transaction) denotes the global transaction id and uniquely identifies the transaction within a complete cloud.

Replication Transaction Index
This table is a mapping from global transaction ids to position where the beginning of the transaction is stored.


 * Server Id
 * Transaction Id
 * File Index: The index of the file where the beginning of the transaction is stored
 * Position: The position in the file of the beginning of the transaction

Replication execution
When a slave connects, an initial handshake takes place and after that the master sends a replication stream containing processing and control messages to the slave.

Handshake
When connecting a slave to a master there are some negotiations performed before starting the real replication. This section specifies that protocol.

In order to start replication it is necessary to support the following:


 * Starting replication from the beginning
 * Starting replication from a known "position"

The progress of replication is given as a table (above), so to give a starting position, it shall be possible to send a progress vector to the master, and the master shall then start sending events so that no events are missed (but it might be that events are seen several times).

Replica semplica