Bug 213

Summary: failover while slon is not running
Product: Slony-I Reporter: Steve Singer <ssinger>
Component: slonAssignee: Slony Bugs List <slony1-bugs>
Status: NEW ---    
Severity: enhancement CC: slony1-bugs
Priority: low    
Version: devel   
Hardware: PC   
OS: Linux   

Description Steve Singer 2011-05-13 11:24:12 UTC
This bug was introduced in  bfa8e601fe7ba1bd91a053901426d4f7195c53a0 (2.1.0) and 60566590d683b85733404ef290e6c1823c4c014c (2.0.5)

If a failover command is executed while the slon for the backup node is not running (say node 2)

The most ahead node (say node 3) will have a FAILOVER_SET event generated with a ev_origin=1 (the failing node).

For the failover to finish that event needs to be processed on node 2.  When the slon for node 2 is later started  it sees that no_active=false in sl_node (this change was made in the above referenced commits).  Since the node is inactive no remoteWorkerThread_1 is started so the slon for node 2 won't ever process the FAILOVER_SET event since that event has ev_origin=1.


As a workaround if you get into this situation you can:

manually (with psql) set no_active=true for the failed node on node 2.  Then start the slon for node 2.  It will now have a remoteWorkerThread_1 and process the FAILVOVER_SET command.

Longer term we probably need to split out a nodes inactive status for rebuild listen paths and waiting compared with starting slon worker threads?