NTT's Development Projects

From PostgreSQL Wiki

Jump to: navigation, search

Contents

Synchronous log replication

This design approach document is partially based on the NTT's synchronous log shipping work. The goal is to simplify the synchronous log shipping so that it has a better chance of getting accepted by the PostgreSQL community. Another goal is to build the larger solution in bits and pieces where each piece is complete in itself, provides a significant addition to the overall goal and lays foundation for further work in the area.

In the recent time, there is good discussion in the PostgreSQL community regarding putting a replication solution inside the core. After going through those discussions, it becomes very clear that the community would like the solution to be small, simple and reliable so that it can be easily integrated with the core and maintained over a long period of time. They don't see this as a replacement for other existing replication solutions, but rather a simple tool which would serve the needs of most of the users. Based on these thoughts, I've put together a design document which IMHO would address these concerns.

The current implementation by NTT has several nice things. It provides a simple solution for synchronous replication using the WAL redo technique. Since replication based on log shipping is well known to PostgreSQL, there should be very few objections to the synchronous replication based on the same technique. The current NTT solution is also very well integrated with the High Availability (HA) software such as Heartbeat. Together they provide a robust solution for HA. But the same things also make the current implementation slightly complex. It becomes hard for PostgreSQL developers to understand and maintain the code and also becomes hard for normal user to setup, configure and use. I propose to make certain changes to the design to make it simple, yet getting the most widely used features in. The first part of the document is the changes we can make to the current design to make it simple. The second part is a complete high level design which is inspired from the NTT work, but have many new design choices.

Part 1. Proposed changes to the current implementation

I would like to make the following recommendations to the current NTT replication design.


Separating HA from the core replication:


IMHO this is an important step to simplify the code. Right now, even though the HA is provided by a set of scripts, there are still many points of integration with the core. That makes the code difficult to understand. Separating HA completely from the replication would make design more modular and there won't be any dependency on any other external software to setup replication.


Merging WALWriter code into the core:


The WALWriter should be part of the core. Its an essential component of the overall replication solution and there is no reason to keep it outside the control of postmaster. Merging the code would help better integration and management of the code.


Improving WALWriter design:


IMHO the WALWriter design can be significantly improved after its merged with the core. There seems to be lot of duplication of work for managing WAL files at the SBY. Also, one process writing to WAL files and then the recovery process reading those files does could be improved. Please see my comments in the second half on how we can improve this.


Separating backup acquisition from the core replication:


The initial backup acquisition can be left to the administrator. Postgres administrators are well aware of the process and leaving it outside will reduce code complexity of the core replication.


Making the code consistent with existing PG code:


I find the code to be bit inconsistent with the postgres coding style at some places. Those things can be improved. Also, though the choice of IPC is debatable, it would help if we could stick to the existing IPC mechanism unless can be proved to be not suitable for the purpose. I see duplication of code (e.g LastLSN.c) which is better avoided.

Part 2. A revamped high level design

Below I propose a high level design to implement synchronous replication in the postgres core. It's derived from the NTT work, with some inputs from my side. The primary goal is to simplify setup, configuration and usage of the solution for postgres users. Another goal is to keep the code simple for ease of development and maintenance.

Requirements

R1. Those who are not interested in replication should be able to run standalone postgres servers without any configuration change and any negative performance overhead.

R2. It should be possible to add a standby server without requiring dump/restore or initdb at the master.

R3. It should be possible to promote a standby server to master without requiring a restart at the standby.

R4. It should be possible to setup replication without any external dependency (such as Heartbeat)


Nice to have

N1. It should be possible to add a standby server without requiring restart at the master.

N2. It should be possible to add more than one standby servers to the master.

N3. It should be possible to configure one or more standby servers so that they receive WAL from other standby server.

N4. Standby should receive old WAL records from the sender even if the logs are archived. This would require either the sender or the receiver to be able to restore archived WAL logs.

N5. Different modes for replication such as sync, async should be supported.


Non-Requirements

NR1. Multi-master replication.

NR2. The initial base backup acquisition for standby.

NR3. Automatic failure detection of the master and automatic failover.

NR4. In case of multiple failures, it would be left to the administrator to identify the instance with the most recent updates and start that instance as the master.

NR5. Using standby as read-only server. This work is orthogonal to the replication work. Of course, the replication design should not stop anybody from working on the read-only standby work.

Based on these requirements and non-requirements, here is a high level design proposal. Comments/suggestions are most welcome.


Design Proposal

Postgres Startup Modes:


Postgres may be started in one of the following modes.

  • SA (Stand Alone)
  • ACT (Active with ready to send WAL)
  • SBY (Standby with ready to receive WAL)

Other modes like SBY-R where standby is ready to send WAL records may also be supported in the later phases.

When postgres is started in SBY mode, the user should specify the <addr:port> of the ACT server. The startup mode and other parameters can either be specified at the command line or through postgresql.conf file.

It should be possible to promote server from SA -> ACT and SBY -> SA or SBY -> ACT without restarting it. I am thinking of a configuration change and a signal to postmaster to reload the configuration could be one way to do this.


Standby Bootstrap:


A postgres server can anytime join as a standby to another instance of postgres running in ACT mode. An example is:

$PGINSTALL/bin/postgres --startup-mode=SBY
                        --act <act-hostname>
                        --act-port <act-port>

The administrator has the responsibility of creating a base backup and restoring the standby to certain point in time. The standby server after finishing the local recovery (by applying the restored WAL files), connects to the ACT server and asks for log records starting the LastLSN successfully applied at the end of local recovery. ACT then starts sending all log records starting from the LastLSN until the SBY catches up with the ACT. Once SBY catches up ACT, the SBY may be marked as READY and normal log shipping continues.

It might be noted that we are assuming that the ACT will have access to all the WAL records starting LastLSN. It's possible that the ACT may not find the intermediate WAL files (containing LastLSN) in the pg_xlog directory if those files have already been archived. In that case, the sync up of SBY will fail and the administrator must first copy the remaining archive files to SBY and retry again. Another suggested option could be to temporarily stop the archiving at the ACT (e.g by setting archive_command to ) while bringing up a SBY node. This would ensure that the necessary WAL files are found in the pg_xlog directory as desired.

Later we can improve upon this by adding capability to ship either the archive files directly from ACT to SBY or restoring the archive files locally in a temp location at ACT to send the intermediate WAL records to SBY.

IMHO this simple way to bring up SBY will make it a lot easier for users to setup the replication.


WAL Sender Process:


When postgres is started in or promoted to ACT mode, it will start a WALSender process which will be a child process of the postmaster and will be managed by it. The role of WALSender will be same as in the current design i.e. to send WAL records to the SBY and wait for ack. Having a single WALSender process would ensure that a single connection is opened between ACT and SBY, thus limiting the resource consumption and easy serialization of WAL shipping.

We could debate the IPC mechanism between various backends and WALSender, but ISTM that a simple shared memory and condition variable based approach should work just fine. A backend can request WAL shipping upto a certain LSN by updating a state in shared memory and signaling the WALSender. The WALSender will continuously ship WAL records to the SBY and update another state information in the shared memory. The exact data structures to hold the state information can be discussed later.

The WALSender process can obviously serve more than one SBY nodes. But we might limit ourselves to just one node to in phase one. The WALSender process also will be responsible to assist SBY to get into initial sync state. So as and when we support multiple SBY nodes, ACT may need to fork more processes for initial SBY synchronization.


WAL Receiver Process:


When a server is started in SBY mode, the WALReceiver process will be started as a child process of the postmaster. It will be responsible to get the initial sync state and later continuously receive WAL records from the ACT. The WALReceiver is similar to WAL Writer process in the current NTT design. But there are couple of important differences.

Since 8.3, postgres has a background WALWriter process. Job of the WALWriter process is to periodically flush WAL buffers to disk and create room for further WAL inserts in the WAL buffers. Since there is already a WALWriter process in 8.3, I thought of naming this process as WALReceiver to avoid any kind of confusion. Unlike NTT design, I am proposing to make this process a child process of the postmaster. While separating it from the core postgres may have its own advantages, the management of the process may become more difficult, especially without the HA environment. Also making it a child of postmaster also gives us the ability to tightly integrate it with the core.

Another significant change I would like to propose is the way redo recovery reads the WAL records. In postgres, redo recovery reads WAL records directly from the WAL files, one at a time, and apply the changes. The current NTT implementation has changed ReadRecord() so that it waits for more WAL records to be written to the WAL files. The WAL writer (in NTT design) receives records from the ACT, appends them to the WAL files and signals the recovery process (which is waiting inside ReadRecord()) to continue further. The recovery process then reads more from the WAL file and proceeds.

I would like to propose to ReadRecord() to read from the WAL buffers instead of the WAL file directly. In its simplest form, the recovery process itself can first populate the WAL buffers and then continue the recovery. The advantage of this approach is that the recovery process can read several blocks at a time, instead of one block at a time. In future, we can even improve postgres to fork a WALReader process which can read ahead the WAL buffers which can speed up the recovery considerably.

When postgres is running as a SBY, the WALReceiver process will receive WAL records from the ACT and populate them in the WAL buffers. This would greatly simplify the architecture of WALReceiver process. It can work with standard xlog interfaces to populate WAL buffers. Also IMHO if we use standard xlog interfaces, then the WAL files can be re-created at the SBY automagically. Of course, all this would require further validation, but the idea looks promising to me. We can even start the WALWriter (as of 8.3) during SBY operation. The process will be responsible for writing WAL buffers to disk, thus re-creating the WAL files at the SBY and making room for WALReceiver to populate the WAL buffers with new records.

In the current postgres, the WAL buffers are setup only after the redo recovery is finished. We would need to change that so that the buffers are setup upfront.


Standby Recovery:


The SBY will continuously run in recovery mode. The recovery process will read from the WAL buffers as and when newer WAL records are available and apply them. The WALReceiver process will populate the WAL buffers and let the recovery process know about it, if its not already working. The recovery process will keep track of the WAL records which have been applied so far so that those records can be written to disk and removed from the WAL buffers. Just like normal backends, the WALReceiver process may call XLogWrite() to flush the finished WAL records to disk if it runs out of space in WAL buffers. (If the WALWriter process is also started during SBY operation, it can also periodically flush WAL records to disk, thus re-creating WAL files and making room in WAL buffers).

Synchronous Modes


For flexibility with varying replication requirements, and in light of recent discussions in the community, we may want to consider various levels of synchronization and safety (wal_transfer_wait).

  • aysnc - ACT does not wait for ack from SBY before continuing
  • syncnet - ACT waits for ack from SBY that message was received
  • syncdisk - ACT waits for SBY to process message, and optionally fsync


High Availability Hooks:


The core replication solution is not tightly integrated with any of the HA solutions. But we would like to provide enough hooks so that any external HA solution can manage the setup. The basic hooks required is the ability of postgres server to promote or demote itself from SBY to ACT and vice versa. The postgres server should also be ready to provide information about its state when queried.

SA -> ACT: The WALSender process is started. It listens on the specified IP:port and accepts connection requests.

SBY -> ACT: The WALReceiver process writes any remaining WAL records to WAL buffer and terminates its connection with the ACT (we may also enforce that the ACT is down at this point). The recovery process finishes the recovery and postgres gets ready to serve client requests. It would also fork off a WALSender process which would listen on the given IP:port.

Apart from these hooks, it should be possible to extract information such as LastLSN available at the server even if the server is down. This information would help the admistrator or the HA software to identify the most recent server and make that ACT node.


Authentication:


Standby can authenticate itself with the ACT by using any of the authentication mechanisms provided by Postgres. We may only allow the users with administrative privileges to register as standby and receive WAL. Alternatively, the user must be granted a special previlege for it to setup a standby (e.g GRANT REPLICA TO <user>).


Summary of Proposed Design:


  • Two new postgres processes; WALReceiver at SBY and WALSender at ACT
  • Base backup is copied to the SBY and SBY is recovered to some point in time.
  • SBY connects to ACT and requests for subsequent WAL records.
  • WALSender reads from WAL buffers and/or WAL files and sends the buffers to WALReceiver. In phase one, we may assume that WALSender can only read from WAL buffers and WAL files in pg_xlog directory. Later on, this can be improved so that WALSender can temporarily restore archived files and read from that too.
  • WAL buffers are set up at SBY before the recovery starts
  • ReadRecord() always reads from WAL buffers. During normal crash or archive recovery, the recovery process can read multiple blocks from WAL files and populate WAL buffers. We could have a new version of XLogInsert which copies received data directly to the WAL buffers and updates XLogCtl appropriately.
  • WALReceiver receives WAL buffers. It can then extract WAL records and use current XLogInsert() interface to write WAL records to the WAL buffers. Alternatively, it can use the improvised XLogInsert interface discussed in the previous point to copy received buffers directly to WAL buffers.
  • We may have a WALReader process to read ahead WAL files during crash recovery, that process can also use the same interface to populate WAL buffers from the WAL files.
  • If XLogInsert does not find any free space in WAL buffers, it can call XLogWrite to flush the buffers to disk. This is very similar to what it does today.
  • Once WALReceiver has copied new records to WAL buffers, it wakes up the recovery process to proceed further.
  • A WALWriter process can be forked during SBY operation to periodically flush WAL buffers to the disk. This would automatically re-create WAL files locally at SBY and make room in WAL buffers.
  • When SBY is signaled to be ACT, it will terminate connection with the ACT (in fact, we could have a check to ensure that the ACT is already down at this time. But this can be discussed later). It will apply the remaining WAL records and quit the recovery process to become fully operational.

Credits

This is work of EnterpriseDB and NTT. Several people contributed

  • Pavan Deolasee - for writing this design doc
  • Koichi Suzuki - for design review and several discussions
  • Masao Fujii - for providing the prototype of replicator which is base on this design

Synch Rep for 8.4

Detailed Design

PDF file: The architecture and sequence of Synch Rep

Contents

  • P1: The process architecture of Synch Rep
  • P2: The flow of WAL in Synch Rep <-- Note: this page contains old information. The latest is P6, 7.
  • P3: The sequence of Synch Rep (startup)
  • P4: The sequence of Synch Rep (replication)
  • P5: The sequence of Synch Rep (timeout)
  • P6: The flow of WAL in Synch Rep (sharing archive)
  • P7: The flow of WAL in Synch Rep (not sharing archive)

User Overview

Feature of Synch Rep

  • Synchronous replication; Transaction commit waits for the WAL to be sent and written (not fsync) to the disk on the standby before the command returns a "success" to the client.
  • Asynchronous replication; Though the WAL is replicated to the standby sometime, transaction commit doesn't wait for it. But asynch replication doesn't create any risk of database inconsistency between two servers. User can configure the interval of replicating the WAL (wal_sender_delay; GUC).
  • User can choose the replication mode (synch / asynch) for every transaction (synchronous_replication; GUC).
  • User can choose independently the synchronization mode of replication (synchronous_replication) and local writing (synchronous_commit).
  • The standby can redo the replicated WAL continuously by using pg_standby like warm-standby. And, user can promote the standby to the primary without any transaction loss by preparing the trigger file for pg_standby.
  • User can separate the standby and continue processing on the primary when the error which prevents replication (outage of the standby, network trouble) occurs. In order to detect aggressively such a error, user can configure the timeout (replication_timeout; GUC).
  • User can make the standby catch up with the primary without stopping any processing on the primary.
  • Connection settings and authentication; User can configure the same settings as a normal connection (keepalive, pg_hba.conf..etc).
  • User can check the progress of replication by using 'ps' command; 'ps' command reports LSN of the already replicated WAL.

Restriction of Synch Rep

  • Replication beyond timeline; User has to get a fresh backup whenever making the old standby catch up.
  • Multiple standby; One standby is supported.
  • Clustering; Postgres doesn't provide any clustering feature.
  • Completely automated catching up; User has to carry out some procedure manually for making the standby catch up.

Documentation Plan

Add the new sgml file

log-shipping.sgml
  • Explain the log-shipping using continuous archiving (called warm standby currently) and the synchronous replication
  • Move "Warm Standby Servers for High Availability" from backup.sgml to log-shipping.sgml
  • Get rid of the word "warm standby", because it's the orthogonal concept against the log-shipping. "warm standby" competes with "hot standby".
  • Clear up what these features can carry out or not.
  • Explain roughly how these features act.
  • Write out all the procedures to handle these features.

Modify the existing sgml files

backup.sgml
  • Remove "Warm Standby Servers for High Availability"
high-availability.sgml
  • Add the synch / asynch replication as one of solutions.
pgstandby.sgml
  • Explain new options
config.sgml
  • Explain new options
libpq.sgml
  • Explain new libpq functions for replication
protocol.sgml
  • Explain new messages for replication
client-auth.sgml
  • Explain the authentication for replication
initdb.sgml
  • Explain new database for replication
runtime.sgml
  • Explain new database for replication
manage-ag.sgml
  • Explain new database for replication
perform.sgml
  • Explain that replication prevents from skipping wal operation

Document

Settings

  • synchronous_replication (boolean)

Specifies whether transaction commit will wait for WAL records to be replicated to the standby before the command returns a "success" indication to the client. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe on the standby server. (The maximum delay is the same as wal_sender_delay.) Like synchronous_commit, setting this parameter to off does not create any risk of database inconsistency between the primary and the standby: a crash might result in some recent allegedly-committed transactions being lost on the standby, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_replication off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section ??.?.

This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to replicate some transactions synchronously and others asynchronously. For example, to make a single multi-statement transaction replication asynchronously when the default is the opposite, issue SET LOCAL synchronous_replication TO OFF within the transaction.

  • wal_sender_delay (integer)

Specifies the delay between activity rounds for the WAL sender. In each round the sender will replicate WAL to the standby. It then sleeps for wal_sender_delay milliseconds, and repeats. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_sender_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

  • replication_timeout (integer)

Maximum time to complete replication, in milliseconds. If the backend waits for replication to be complete the specified time or more, the connection is closed and walsender exits. Then, any backends are not aborted, but continue processing. A value of zero (the default) turns off the limitation. This parameter can be changed at any time.

Procedure

How to start the primary
  1. Set up continuous archiving on the primary. Ensure that archive_mode and archive_command are set appropriately on the primary.
  2. Configure authentication so that the standby can connect to the database named walsender only for replication. Ensure that connections and authentication are set appropriately at postgresql.conf and pg_hba.conf.
    • Those configuration parameters can be set up like a normal client authentication except specifying walsender as the database name.
    • A connection for replication is also treated like a normal client connection. Set up max_connections appropriately to reserve at least one "slot" for replication.
  3. Start postgres normally on the primary.
How to start the standby from a base backup (initial setup)
  1. Make a base backup of the primary, and load this data onto the standby
  2. Remove any files present in pg_xlog/ and pg_xlog/archive_status/; these came from the backup dump and are therefore probably obsolete rather than current. If you didn't archive pg_xlog/ at all, then recreate it, being careful to ensure that you re-establish it as a symbolic link if you had it set up that way before.
  3. Set up continuous archiving on the standby. Ensure that archive_mode and archive_command are set appropriately on the standby.
  4. Create a recovery command file recovery.conf in the cluster data directory. Ensure that restore_command, recovery_target_timeline, replication_primary_host and replication_primary_port are set appropriately on the standby.
    • restore_command which waits for the next WAL segment to become available from the primary must be set. pg_standby provided as a contrib module is suitable as restore_command. The same archive directory must be set at restore_command and archive_command.
    • recovery_target_timeline must be set to the current timeline of the primary or 'latest' (In this case, the recovery target timeline is chosen appropriately from the timeline history files).
    • replication_primary_host must be set to the address which the primary listens for the connection from the standby. The same address must be also set to listen_addresses on the primary.
    • replication_primary_port must be set to the port which the primary listen on.
  5. Configure authentication so that the standby can connect to the database named walsender only for replication. Ensure that .pgpass and the environment variables are set up appropriately on the standby.
  6. Copy only the following files present in the WAL archive area from the primary to the standby.
    • All timeline history files (.history).
    • Only the backup history file corresponding to the base backup (.backup).
    • Only the WAL segment including the recovery starting point and the subsequent ones. If the segment already exists on the standby, overwrite it. A recovery starting point is saved by the name of "START WAL LOCATION" at a backup label file backup_label.
  7. Start postgres normally on the standby.
  8. Copy the missing WAL segments (only after the recovery starting point) present in the WAL archive area from the primary to the standby.
How to start the standby from an existing database cluster directory
  1. Remove any files present in pg_xlog/ and pg_xlog/archive_status/; these came from the existing database cluster directory and are therefore probably obsolete rather than current.
  2. Set up continuous archiving on the standby. Ensure that archive_mode and archive_command are set appropriately on the standby.
  3. Create a recovery command file recovery.conf in the cluster data directory. Ensure that restore_command, recovery_target_timeline, replication_primary_host and replication_primary_port are set appropriately on the standby.
    • restore_command which waits for the next WAL segment to become available from the primary must be set. pg_standby provided as a contrib module is suitable as restore_command. The same archive directory must be set at restore_command and archive_command.
    • recovery_target_timeline must be set to the current timeline of the primary or 'latest' (In this case, the recovery target timeline is chosen appropriately from the timeline history files).
    • replication_primary_host must be set to the address which the primary listens for the connection from the standby. The same address must be also set to listen_addresses on the primary.
    • replication_primary_port must be set to the port which the primary listen on.
  4. Configure authentication so that the standby can connect to the database named walsender only for replication. Ensure that .pgpass and the environment variables are set up appropriately on the standby.
  5. Copy only the following files present in the WAL archive area from the primary to the standby.
    • All timeline history files (.history).
    • Only the WAL segment including the recovery starting point and the subsequent ones. If the segment already exists on the standby, overwrite it. A recovery starting point is saved at pg_control, and displayed by the name of "Latest checkpoint's REDO location" by using pg_controldata.
  6. Start postgres normally on the standby.
  7. Copy the missing WAL segments (only after the recovery starting point) present in the WAL archive area from the primary to the standby.
How to promote the standby to the new primary
  1. Create the trigger file specified in the restore command.
How to stop the primary
  1. Stop postgres normally on the primary.
    • In the Smart Shutdown or Fast Shutdown, postgres exits after replicating the last checkpoint WAL to the standby.
How to stop the standby
  1. Stop postgres by using the Fast Shutdown. Don't use the Smart Shutdown and Immediate Shutdown because they might not stop recovery.
How to start synchronous replication system
  1. Start the primary.
  2. Start the standby from the favorite one of the base backup and the existing database cluster. Of course, in initial start case, the standby must be started from the base backup.
How to perform failover
  1. Stop the primary if it still alive.
  2. Promote the standby to the new primary.
How to perform switchover
  1. Stop the primary.
  2. Promote the standby to the new primary.
  3. Start the original primary to the new standby.
How to stop synchronous replication system
  1. Stop the primary
  2. Stop the standby
QA
  • From which base data should we use to start the standby, a base backup? or an existing database cluster:
  • How to detect the connection error between two servers:

Patch set

How to install

Please perform the following procedure on the primary and standby.

1. Prepare the source HEAD
[example]

$ cvs -d :pserver:anoncvs@anoncvs.postgresql.org:/projects/cvsroot co -P pgsql
$ ls
pgsql
2. Apply the synchronous log-shipping patch for HEAD
[example]

$ tar zxf Synch_rep_patchset_vX.tgz
$ cp synch_rep_patchset_v4/whole_patch/walsender.c   pgsql/src/backend/postmaster
$ cp synch_rep_patchset_v4/whole_patch/walreceiver.c pgsql/src/backend/postmaster
$ cp synch_rep_patchset_v4/whole_patch/walsender.h   pgsql/src/include/postmaster
$ cp synch_rep_patchset_v4/whole_patch/walreceiver.h pgsql/src/include/postmaster
$ cd pgsql
$ patch -p 0 -d . < ../synch_rep_patchset_v4/whole_patch/synch_rep_vX.patch
3. Compile
[example]

$ configure
$ make install
4. Setup
Personal tools