178 – More sophisticated FAILOVER

Bug 178 - More sophisticated FAILOVER

Summary: More sophisticated FAILOVER

Status:	RESOLVED FIXED

Alias:	None

Product:	Slony-I
Classification:	Unclassified
Component:	slonik (show other bugs)
Version:	devel
Hardware:	PC Linux

Importance:	high enhancement
Assignee:	Jan Wieck

URL:

Depends on:	179
Blocks:	261
	Show dependency tree

Reported:	2010-12-07 12:12 UTC by Christopher Browne
Modified:	2013-06-06 11:44 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
patch for proposed multi-node failover (65.91 KB, patch) 2011-11-29 13:29 UTC, Steve Singer	Details
patch for proposed multi-node failover v2 (65.91 KB, patch) 2011-11-29 13:57 UTC, Steve Singer	Details
proposed multi-node failover v3 (65.91 KB, patch) 2011-11-29 14:43 UTC, Steve Singer	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christopher Browne 2010-12-07 12:12:06 UTC

General proposal, that FAILOVER be a much more sophisticated command, allowing:
* Dropping nodes considered dead
* Doing several failovers of sets as one request

Thus, something like:
<pre>
  failover (dead nodes=(1,2,4),
            set id=1, backup node=3,
            set id=2, backup node=5,
            set id=3, backup node=3);
</pre>

* Failover should check various conditions and abort if any are the case
** There need to be paths to support communications to let the new masters catch up
** Slons need to be running for nodes that are needed to let masters catch up
** If a node hosts a subscription that cannot be kept that subscription may be marked dead
*** Automatically kill that subscription?
*** Refuse failover until subscription is marked dead? 
This requires a dead subscription clause be added...

Comment 1 Christopher Browne 2010-12-07 12:15:12 UTC

Note that this requires the WAIT FOR EVENT changes of Bug #179

Comment 2 Steve Singer 2011-11-29 13:29:11 UTC

Since no patch was forthcoming for the original description I propose an alternative.

I propose only to address the problem of a cluster like this

 Node 1-------------->Node 2          
    (set 1)             (set 2)
 |                      |
 |                      |
 V                      V
Node 3-------------->  Node 4

If node 1 and 2 are both lost at the same time

Assuming:
  - There are no oustanding configuration/subscription events at the time of failure.
 - No additional failures happen while the FAILOVER command is executing

This patch proposes

FAILOVER (  NODE=(ID=1,BACKUP NODE=2),
            NODE=(ID=3, BACKUP NODE=4));

It also includes a multi-node DROP NODE
DROP NODE( id='1,2', event node=3);

https://github.com/ssinger/slony1-engine/tree/multi_node_failover_steve
Please review and comment on the syntax.

Comment 3 Steve Singer 2011-11-29 13:29:55 UTC

Created attachment 135 [details]
patch for proposed multi-node failover

Comment 4 Steve Singer 2011-11-29 13:57:58 UTC

Created attachment 136 [details]
patch for proposed multi-node failover v2

Comment 5 Steve Singer 2011-11-29 14:43:50 UTC

Created attachment 137 [details]
proposed multi-node failover v3

Comment 6 Christopher Browne 2012-01-16 11:51:16 UTC

The failover procedure (at a high level) is as follows
   * 1. Get a list of failover candidates for each failed node.
   * 2. validate that we have conninfo to all of them
   * 3. blank communications paths to the failed nodes
   * 4. Wait for slons to restart (implies need to tell slons to restart)
   * 5. for each failed node get the highest xid for each candidate
   * 6. execute FAILOVER on the highest canidate
   * 7. MOVE SET to the backup node.

Comment 7 Steve Singer 2013-06-06 11:44:55 UTC

This work was completed in the 2.2 development cycle and was primarily committed as part of http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=commit;h=5e625828d1aefdeabd4ac1e138f54f8aae686f2