Bug 178 - More sophisticated FAILOVER
: More sophisticated FAILOVER
Status: RESOLVED FIXED
Product: Slony-I
slonik
: devel
: PC Linux
: high enhancement
Assigned To: Jan Wieck
:
: 179
: 261
  Show dependency treegraph
 
Reported: 2010-12-07 12:12 PST by Christopher Browne
Modified: 2013-06-06 11:44 PDT (History)
1 user (show)

See Also:


Attachments
patch for proposed multi-node failover (65.91 KB, patch)
2011-11-29 13:29 PST, Steve Singer
Details | Diff
patch for proposed multi-node failover v2 (65.91 KB, patch)
2011-11-29 13:57 PST, Steve Singer
Details | Diff
proposed multi-node failover v3 (65.91 KB, patch)
2011-11-29 14:43 PST, Steve Singer
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Christopher Browne 2010-12-07 12:12:06 PST
General proposal, that FAILOVER be a much more sophisticated command, allowing:
* Dropping nodes considered dead
* Doing several failovers of sets as one request

Thus, something like:
<pre>
  failover (dead nodes=(1,2,4),
            set id=1, backup node=3,
            set id=2, backup node=5,
            set id=3, backup node=3);
</pre>

* Failover should check various conditions and abort if any are the case
** There need to be paths to support communications to let the new masters
catch up
** Slons need to be running for nodes that are needed to let masters catch up
** If a node hosts a subscription that cannot be kept that subscription may be
marked dead
*** Automatically kill that subscription?
*** Refuse failover until subscription is marked dead? 
This requires a dead subscription clause be added...
Comment 1 Christopher Browne 2010-12-07 12:15:12 PST
Note that this requires the WAIT FOR EVENT changes of Bug #179
Comment 2 Steve Singer 2011-11-29 13:29:11 PST
Since no patch was forthcoming for the original description I propose an
alternative.

I propose only to address the problem of a cluster like this

 Node 1-------------->Node 2          
    (set 1)             (set 2)
 |                      |
 |                      |
 V                      V
Node 3-------------->  Node 4

If node 1 and 2 are both lost at the same time

Assuming:
  - There are no oustanding configuration/subscription events at the time of
failure.
 - No additional failures happen while the FAILOVER command is executing

This patch proposes

FAILOVER (  NODE=(ID=1,BACKUP NODE=2),
            NODE=(ID=3, BACKUP NODE=4));

It also includes a multi-node DROP NODE
DROP NODE( id='1,2', event node=3);

https://github.com/ssinger/slony1-engine/tree/multi_node_failover_steve
Please review and comment on the syntax.
Comment 3 Steve Singer 2011-11-29 13:29:55 PST
Created an attachment (id=135) [details]
patch for proposed multi-node failover
Comment 4 Steve Singer 2011-11-29 13:57:58 PST
Created an attachment (id=136) [details]
patch for proposed multi-node failover v2
Comment 5 Steve Singer 2011-11-29 14:43:50 PST
Created an attachment (id=137) [details]
proposed multi-node failover v3
Comment 6 Christopher Browne 2012-01-16 11:51:16 PST
The failover procedure (at a high level) is as follows
   * 1. Get a list of failover candidates for each failed node.
   * 2. validate that we have conninfo to all of them
   * 3. blank communications paths to the failed nodes
   * 4. Wait for slons to restart (implies need to tell slons to restart)
   * 5. for each failed node get the highest xid for each candidate
   * 6. execute FAILOVER on the highest canidate
   * 7. MOVE SET to the backup node.
Comment 7 Steve Singer 2013-06-06 11:44:55 PDT
This work was completed in the 2.2 development cycle and was primarily
committed as part of
http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=commit;h=5e625828d1aefdeabd4ac1e138f54f8aae686f2