Bug 80 - slon daemon restarts itself in a loop after failover()
: slon daemon restarts itself in a loop after failover()
Status: ASSIGNED
Product: Slony-I
core scripts
: devel
: All All
: high normal
Assigned To: Jan Wieck
:
:
:
  Show dependency treegraph
 
Reported: 2009-04-07 03:08 PDT by Andreas Pfotenhauer
Modified: 2010-09-02 12:16 PDT (History)
2 users (show)

See Also: http://www.slony.info/bugzilla/show_bug.cgi?id=130


Attachments
log and slonik scripts (3.80 KB, application/x-tgz)
2009-04-07 03:08 PDT, Andreas Pfotenhauer
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Andreas Pfotenhauer 2009-04-07 03:08:20 PDT
Created an attachment (id=31) [details]
log and slonik scripts

when dropping the "failed" node immediately after the failover() command, it
might happen that a node gets the dropNode_int() command while the failed node
is still referenced in the sl_set table. the dropNode_int() then fails because
the foreign key contraint "set_origin-no_id-ref" fails. 
The slon deamon then restarts itself and tries again to drop the node (which
fails) and restarts itself in a endless loop, rendering the node unusable;

If you wait a bit after the failover() before dropping the node, everything
works fine.

attached are the slonik scripts to setup the test and to (hopefully) reproduce
the problem, and some of the slon daemon log of the node that went wild.

The setup was as follow: 4 nodes 1-4, 1 as master, 2-4 as slaves directly
subscribing on the master. then the master is assumed to be broken and 4 should
become the new master, so nodes 2 and 3 are re-subscribed to node 4, then the
failover is executed, waited and then the "failed" old master dropped.

i have to apologize i forgot to switch languages on me dev machine so some of
the error messages in the slon log are in german.

Cheers
Andreas
Comment 1 Christopher Browne 2009-04-09 15:10:59 PDT
I'll think about this over the weekend; my first reaction is to treat this as a
documentation patch, and to recommend not rushing to drop the node out of the
cluster until you actually get the failover completed.

As a first response, that's definitely what I'd recommend.

When you drop it "too quickly," that introduces the risk, which you ran into,
that some later node gets the DROP NODE event before receiving the FAILOVER
event.

There's no easy way to evade that problem!

However, second reaction is that it's not particularly reasonable for this
mistake to be allowed to break the cluster.

As a first thought on a solution, we might check to see if there's a pending
FAILOVER_SET event pending, and somehow defer/ignore the DROP NODE.

Gonna have to sleep on that...
Comment 2 Steve Singer 2010-06-18 09:52:59 PDT
This issue is also referenced in bug 129.
There has been some discussion about making FAILOVER mark nodes as disabled so
they don't get included in the set of nodes wait for ... confimred=all uses. 
(that on its own won't fix the issue)
Comment 3 Jan Wieck 2010-08-24 07:23:57 PDT
The issue is actually the async processing of events coming from different
nodes.

The FAILOVER_NODE is faked by slonik to be coming from the failed node. This
guarantees that every subscriber will drain out all outstanding SYNC events
from the failed node before starting to consume changes from the next origin
(either backup node or temporary origin). 

The next origin will issue ACCEPT_SET. The purpose of the ACCEPT_SET event,
which is also seen in MOVE_SET, is that a subscriber suspends processing events
from the accepting node, until it has seen the corresponding FAILOVER_SET or
MOVE_SET, so that it doesn't throw away data from the accepting node. The
accepting node can modify the tables and create sl_log data long before
everybody else is caught up.

What we want to do is to reproduce the ACCEPT_SET logic in slon for DROP_NODE
and suspend processing events from the DROP_NODE origin until there are no more
sets from that origin in slon's runtime config.
Comment 4 Jan Wieck 2010-08-25 07:24:40 PDT
*** Bug 130 has been marked as a duplicate of this bug. ***
Comment 5 Steve Singer 2010-08-25 11:22:57 PDT
This should also be in 2.0.
Comment 6 Jan Wieck 2010-09-02 12:16:33 PDT
We have to push this one back to devel.

There are several issues with a premature DROP NODE. One is that the function
dropNode_int() cleans up after the dropped node. Namely that it deletes every
reference to that node from sl_path, sl_listen, sl_confirm, sl_event. This can
eventually destroy the FAILOVER_NODE or MOVE_SET event before it was forwarded
to everybody else.

However, we cannot easily detect what needs to be waited for because it is
possible to have a multi-node failure, so some other node will never confirm
those events. 

At this point I don't have a plan how to finally fix this problem. It might
require a new event type.