Summary: | failover does not seem to update sl_set | ||
---|---|---|---|
Product: | Slony-I | Reporter: | Steve Singer <ssinger> |
Component: | stored procedures | Assignee: | Slony Bugs List <slony1-bugs> |
Status: | RESOLVED DUPLICATE | ||
Severity: | enhancement | CC: | slony1-bugs |
Priority: | medium | ||
Version: | 2.0 | ||
Hardware: | PC | ||
OS: | Linux | ||
See Also: |
http://bugs.slony.info/bugzilla/show_bug.cgi?id=129 http://bugs.slony.info/bugzilla/show_bug.cgi?id=136 http://www.slony.info/bugzilla/show_bug.cgi?id=80 |
Description
Steve Singer
2010-05-26 14:26:43 UTC
(In reply to comment #0) > Observed with 2.0.3 > > In a cluster as follows > > 1 > \\ > \\ > 3===4 > \\ > \\ > 5 > > > A failover of set 1=>3 seems to work. > > However DROP NODE (id=1) fails with > > <stdin>:12: PGRES_FATAL_ERROR select "_disorder_replica".dropNode(1); - ERROR: > Slony-I: Node 1 is still origin of one or more sets > > When I query sl_set it shows that node 1 is still the origin of the set even > though the failover command seemed to work okay. Where did you run these commands? If you did so against node 1, then I'd fully expect this behaviour. Node #1 doesn't really know that it's "shunned." It certainly doesn't if the disks were ground into a powder (in which case your queries would fail because there's no database there anymore!). But if node 1 failed due to a network partition, it can't expect to ever be aware that it's shunned. If the requests were hitting ex-node #1, then I don't think these results are necessarily wrong. According to the documentation DROP NODE shouldn't work with an EVENT_NODE=1 (the node being dropped) so I don't think that was the case. I think what was happening is that FAILOVER returns right away but the failover processing doesn't complete until later (the other subscribers have to process an ACCEPT_SET). No WAIT FOR commands where being issued after the FAILOVER. See the comments on bug #129. relevant portion duplicated below ---------------- 2) As part of a failover we want to mark the failed node as being inactive in sl_node and make it so that WAIT FOR confirmed=all don't wait on this failed nodes to confirm things. 3) slonik needs to remember the sequence number returned by failedNode2 so that it is possible to WAIT FOR that event on the backup node to ensure it is confirmed by all. Exactly how a slonik script can wait still needs to be figured out. This won't be done until 2.1 ------------------------ I think this is covered by the proposed patch in bug136? |