Bugzilla – Bug 178
More sophisticated FAILOVER
Last modified: 2012-01-18 13:20:52 PST
You need to log in before you can comment on or make changes to this bug.
General proposal, that FAILOVER be a much more sophisticated command, allowing: * Dropping nodes considered dead * Doing several failovers of sets as one request Thus, something like: <pre> failover (dead nodes=(1,2,4), set id=1, backup node=3, set id=2, backup node=5, set id=3, backup node=3); </pre> * Failover should check various conditions and abort if any are the case ** There need to be paths to support communications to let the new masters catch up ** Slons need to be running for nodes that are needed to let masters catch up ** If a node hosts a subscription that cannot be kept that subscription may be marked dead *** Automatically kill that subscription? *** Refuse failover until subscription is marked dead? This requires a dead subscription clause be added...
Note that this requires the WAIT FOR EVENT changes of Bug #179
Since no patch was forthcoming for the original description I propose an alternative. I propose only to address the problem of a cluster like this Node 1-------------->Node 2 (set 1) (set 2) | | | | V V Node 3--------------> Node 4 If node 1 and 2 are both lost at the same time Assuming: - There are no oustanding configuration/subscription events at the time of failure. - No additional failures happen while the FAILOVER command is executing This patch proposes FAILOVER ( NODE=(ID=1,BACKUP NODE=2), NODE=(ID=3, BACKUP NODE=4)); It also includes a multi-node DROP NODE DROP NODE( id='1,2', event node=3); https://github.com/ssinger/slony1-engine/tree/multi_node_failover_steve Please review and comment on the syntax.
Created an attachment (id=135) [details] patch for proposed multi-node failover
Created an attachment (id=136) [details] patch for proposed multi-node failover v2
Created an attachment (id=137) [details] proposed multi-node failover v3
The failover procedure (at a high level) is as follows * 1. Get a list of failover candidates for each failed node. * 2. validate that we have conninfo to all of them * 3. blank communications paths to the failed nodes * 4. Wait for slons to restart (implies need to tell slons to restart) * 5. for each failed node get the highest xid for each candidate * 6. execute FAILOVER on the highest canidate * 7. MOVE SET to the backup node.