Bug 92

Summary: Events and Confirmations are not propagating
Product: Slony-I Reporter: Cyril Scetbon <cyril.scetbon>
Component: slonAssignee: Slony Bugs List <slony1-bugs>
Status: RESOLVED FIXED    
Severity: critical CC: slony1-bugs
Priority: high    
Version: devel   
Hardware: PC   
OS: Linux   
See Also: http://bugs.slony.info/bugzilla/show_bug.cgi?id=127
Bug Depends on: 127    
Bug Blocks:    

Description Cyril Scetbon 2009-05-18 02:35:04 UTC
I use test_slony_state and see some informations like :

Check of event info
---------------------------------------------------
Problem : Events not propagating to node 2
Problem : Events not propagating to node 4
Problem : Events not propagating to node 3

Check of sl_confirm aging
---------------------------------------------------
Confirmations not propagating from 2 to 1
Confirmations not propagating from 2 to 3
Confirmations not propagating from 2 to 4
Confirmations not propagating from 3 to 1
Confirmations not propagating from 3 to 2
Confirmations not propagating from 3 to 4
Confirmations not propagating from 4 to 1
Confirmations not propagating from 4 to 2
Confirmations not propagating from 4 to 3

You can see the results on one of my databases :

- for Confirmations

select con_origin, con_received, min(con_seqno) as minseq, max(con_seqno) as maxseq, date_trunc('minutes', min(now()-con_timestamp)) as age1, date_trunc('minutes', max(now()-con_timestamp)) as age2, min(now() - con_timestamp) > '00:30:00' as tooold from _pns_slony_voila_archi_0.sl_confirm group by con_origin, con_received order by con_origin, con_received;
con_origin | con_received | minseq  | maxseq  |   age1   |   age2   | tooold ------------+--------------+---------+---------+----------+----------+--------
       101 |          102 | 4464029 | 4464792 | 00:00:00 | 00:16:00 | f
       101 |          103 | 4464027 | 4464792 | 00:00:00 | 00:16:00 | f
       101 |          104 | 4464024 | 4464792 | 00:00:00 | 00:16:00 | f
       102 |          101 |      29 |      29 | 03:39:00 | 03:39:00 | t
       102 |          103 |      29 |      29 | 03:39:00 | 03:39:00 | t
       102 |          104 |      29 |      29 | 03:39:00 | 03:39:00 | t
       103 |          101 |      62 |      62 | 03:39:00 | 03:39:00 | t
       103 |          102 |      62 |      62 | 03:39:00 | 03:39:00 | t
       103 |          104 |      62 |      62 | 03:39:00 | 03:39:00 | t
       104 |          101 |      57 |      57 | 03:39:00 | 03:39:00 | t
       104 |          102 |      57 |      57 | 03:39:00 | 03:39:00 | t
       104 |          103 |      57 |      57 | 03:39:00 | 03:39:00 | t

- for Events

select ev_origin, min(ev_seqno), max(ev_seqno),
        date_trunc('minutes', min(now() - ev_timestamp)),
        date_trunc('minutes', max(now() - ev_timestamp)),
        min(now() - ev_timestamp) > '00:30:00' as agehi
    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
      103 |      62 |      62 | 03:49:00   | 03:49:00   | t
      104 |      57 |      57 | 03:49:00   | 03:49:00   | t
      102 |      29 |      29 | 03:49:00   | 03:49:00   | t
      101 | 4464493 | 4465346 | 00:00:00   | 00:14:00   | f

If I restart every slons processes new events are generated on these nodes but still not correctly propagated. However the last event on each node seems to be propagated thanks to the restart stage (see below)

After the restart I have something like :

select ev_origin, min(ev_seqno), max(ev_seqno),
       date_trunc('minutes', min(now() - ev_timestamp)),
       date_trunc('minutes', max(now() - ev_timestamp)),
       min(now() - ev_timestamp) > '00:30:00' as agehi
   from _pns_slony_voila_archi_0.sl_event group by ev_origin;
ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
      103 |      65 |      66 | 00:01:00   | 00:47:00   | f
      104 |      60 |      61 | 00:01:00   | 00:47:00   | f
      102 |      33 |      34 | 00:01:00   | 00:47:00   | f
      101 | 4541581 | 4542350 | 00:00:00   | 00:13:00   | f
(4 rows)

So, another event is generated (for example, 66 from node 103)

log extract :
2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 33 processing
2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_102: no sets need syncing for this event
2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event 102,33 SYNC
2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 102,33 ignored - duplicate
2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event 104,60 SYNC
2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 104,60 ignored - duplicate
2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_104: SYNC 60 processing
2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_104: no sets need syncing for this event
2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_103: SYNC 65 processing
2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_103: no sets need syncing for this event

after a few minutes I have :

select ev_origin, min(ev_seqno), max(ev_seqno),
       date_trunc('minutes', min(now() - ev_timestamp)),
       date_trunc('minutes', max(now() - ev_timestamp)),
       min(now() - ev_timestamp) > '00:30:00' as agehi
   from _pns_slony_voila_archi_0.sl_event group by ev_origin;
ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
      103 |      66 |      66 | 00:12:00   | 00:12:00   | f
      104 |      61 |      61 | 00:12:00   | 00:12:00   | f
      102 |      34 |      34 | 00:12:00   | 00:12:00   | f
      101 | 4542310 | 4543008 | 00:00:00   | 00:11:00   | f

log extract :
2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_103: SYNC 66 processing
2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_103: no sets need syncing for this event
2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_104: SYNC 61 processing
2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_104: no sets need syncing for this event
2009-04-28 16:37:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 34 processing
2009-04-28 16:37:53 CEST DEBUG1 remoteWorkerThread_102: no sets need syncing for this event

Precedent events (65,60 and 33) have been cleared, and a new one per node has been generated. But, the error continues and after 30 minutes I have :

select ev_origin, min(ev_seqno), max(ev_seqno),
       date_trunc('minutes', min(now() - ev_timestamp)),
       date_trunc('minutes', max(now() - ev_timestamp)),
       min(now() - ev_timestamp) > '00:30:00' as agehi
   from _pns_slony_voila_archi_0.sl_event group by ev_origin;
ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
      103 |      66 |      66 | 00:41:00   | 00:41:00   | t
      104 |      61 |      61 | 00:41:00   | 00:41:00   | t
      102 |      34 |      34 | 00:41:00   | 00:41:00   | t
      101 | 4543634 | 4544715 | 00:00:00   | 00:18:00   | f
Comment 1 Jean-Armel Luce 2009-05-29 04:36:36 UTC
I have the same error messages in my servers.

Does anybody know how I could correct this problem ?
Comment 2 Steve Singer 2010-06-10 10:26:10 UTC
I think this is the same issue as discussed in bug 127
Comment 3 Steve Singer 2010-08-11 14:07:31 UTC
I am marking this as fixed since bug 127 is now fixed.