Bug 133

Summary: DROP set in the middle of a subscribe to the same set confuses slon
Product: Slony-I Reporter: Steve Singer <ssinger>
Component: slonAssignee: Slony Bugs List <slony1-bugs>
Status: NEW ---    
Severity: normal CC: slony1-bugs
Priority: low    
Version: 2.0   
Hardware: PC   
OS: Linux   

Description Steve Singer 2010-06-01 12:01:13 UTC
Consider a setup such as

1==>3===>4

Node 1 is the origin for a replication set (set 2).

This set is subscribed to on node 2.

We then concurrently issue (with different slonik instances)

1:  subscribe set(set id=2, provider=3, receiver=4)
2:  drop set (id=2, origin=1);

It is possible for the slons at various nodes to get confused and start logging things like

db5 - 2010-06-01 14:51:02 EDTERROR  remoteWorkerThread_3: "select "_disorder_replica".subscribeSet_int(2, 3, 4, 't', 'f'); insert into "_disorder_replica".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4, ev_data5    ) values ('3', '5000000005', '2010-06-01 14:50:36.965441', '690742:690742:', 'SUBSCRIBE_SET', '2', '3', '4', 't', 'f'); insert into "_disorder_replica".sl_confirm     (con_origin, con_received, con_seqno, con_timestamp)    values (3, 5, '5000000005', now()); commit transaction;" PGRES_FATAL_ERROR ERROR:  Slony-I: subscribeSet_int(): set 2 not found



Note that the slon complaining is for node 5, and is not directly involved in the actions on set 2 (but other slons log similar errors).


I realize that I am trying to shoot myself in the foot with this example but I can see how this can happen in real life (two admins that don't talk or misbehaving scripts) and it shouldn't be that easy to corrupt my replication cluster.  I would think that the subscribeSet_int should log an error but mark the event as processed?
Comment 1 Christopher Browne 2010-06-15 15:27:04 UTC
(In reply to comment #0)
Yep, there should be something here to behave less badly.
Comment 2 Steve Singer 2010-07-27 11:55:26 UTC
I am worried that marking that we can't just ignore+mark confirmed the subscribe set event on nodes where the set does not exist in sl_set.

node 5 would have no way of knowing if their is no row in sl_set because a DROP SET has already been processed or if it is because the CREATE SET has not yet been received by node 5.   Both the create set and the drop set will be coming from the origin but the subscribe set could be coming from a receiver.

We could try to 'remmeber' the set after a DROP SET for some period of time/messages but what would that period be? An arbitrary value would only delay the issue.   You could remember all old sets, 

We could try to make the subscribe set come from the origin not from the provider. This would mean that it comes from the same place as the create/drop set commands but that will have implications elsewhere.

Ideas welcome