Bug 324 - slony 2.1.4 slon creating extremely long query
Summary: slony 2.1.4 slon creating extremely long query
Status: NEW
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: devel
Hardware: PC Linux
: low critical
Assignee: Slony Bugs List
URL:
Depends on:
Blocks:
 
Reported: 2013-11-13 09:42 UTC by Jeff Frost
Modified: 2013-11-19 14:31 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff Frost 2013-11-13 09:42:11 UTC
Seems as though this is a duplicate of bug #264, and that patch appears to be in 2.1.4, so I'm not sure why I'm still seeing this behavior.

After an initial sync that takes about 8hrs, slon generates this extremely long query (that log line by itself is 107MB in size) that looks like:

2013-11-13 10:44:26 EST ERROR  remoteWorkerThread_2: "declare LOG cursor for select log_origin, log_txid, log_tableid, log_actionseq, log_cmdtype, octet_length(log_cmddata), case when octet_length(log_cmddata) <= 8192 then log_cmddata else null end from "_migration".sl_log_1 where log_origin = 2 and log_tableid in (7,9,23,24,38,39,42,47,48,53,57,74,75,86,97,116,120,128,129,130,136,137,143) and log_txid >= '6236982' and log_txid < '6236996' and "pg_catalog".txid_visible_in_snapshot(log_txid, '6236996:6236996:')  and (  log_actionseq <> '5342437751'  and  log_actionseq <> '5342362294'  and  log_actionseq <> '5341832481'  and  log_actionseq <> '5342413870'  and  log_actionseq <> '5342418592'  and  log_actionseq <> '5342202064'  and  log_actionseq <> '5341939025'  and  log_actionseq <> '5341887545'  and  log_actionseq <> '5342114362'  and  log_actionseq <> '5342300971'  and  log_actionseq <> '5341759946'  and  log_actionseq <> '5342195755'  and  log_actionseq <> '5342198736'  and  log_actionseq <> '5342218415'  and  log_actionseq <> '5342188807'  and  log_actionseq <> '5342225657'  and  log_actionseq <> '5342434567'  and  log_actionseq <> '5342088216'  and  log_actionseq <> '5342293611'  and  log_actionseq <> '5341866011'  and  

.....

and  log_actionseq <> '5342342863' ) order by log_actionseq" PGRES_FATAL_ERROR ERROR:  stack depth limit exceeded

max_stack_depth is 4MB on these systems, but even setting it to the max doesn't do the job.

BTW, the versions in bugzilla only go up to 2.0, which is why this is filed under devel.
Comment 1 Jan Wieck 2013-11-13 13:48:43 UTC
I do see the order by clause on the log_actionseq select in the REL_2_1_STABLE branch, but the log snippet you pasted clearly shows those numbers out of order. Can you try to reindex the sl_log_* tables?

Note that in 2.2 the problem is gone entirely because we no longer use the log_actionseq for that. Instead the log selection for the first SYNC after subscribe is based on the snapshot of the copy_set() operation. At the time we did not feel comfortable to backpatch that into 2.1. That might be a reasonable thing to do now.
Comment 2 Jeff Frost 2013-11-19 11:39:57 UTC
For some reason this didn't make it back into bugzilla:

I actually went and downloaded the source for 2.1.4 and 2.1.3 and both of them
appear to have the ORDER BY, so I'm not sure what's going on here.

Unfortunately, I truncated the sl_log_* tables so we could get this sync
completed as it's holding up a migration.

How stable is 2.2.1 considered?

I have a feeling we are going to hit this problem again today, so is there any further info I could gather before I tear it down?
Comment 3 Jan Wieck 2013-11-19 12:20:27 UTC
Afilias is running 2.2.0 in production and we have done extensive testing on 2.2.1, will be upgrading to it shortly. So we definitely consider it "production".
Comment 4 Jan Wieck 2013-11-19 12:23:24 UTC
That said, we discussed the matter when I was in TO last week and we do think that backpatching the 2.2 changes, that eliminate the action sequence list altogether, is appropriate now. A subsequent 2.1 release should have that.
Comment 5 Jeff Frost 2013-11-19 12:25:07 UTC
Thanks! That's good to hear. 

Perhaps we'll upgrade to 2.2 if we can't get around the issue before the
backport.

So, is there any further info I can gather because I think it's likely to hit
this issue again today or not really because you're just going to backport the
2.2 code?
Comment 6 Jan Wieck 2013-11-19 12:36:04 UTC
The thing I don't understand is why you are hitting this issue at all. The 2.1 STABLE branch has the ORDER BY clause, so the list should be ordered. The log you pasted shows an unordered list. 

I hate to ask this, but is it possible that you upgraded Slony while there was an outstanding SUBSCRIBE going on?
Comment 7 Jeff Frost 2013-11-19 12:46:08 UTC
I agree, I actually went and downloaded the source for both 2.1.3 and 2.1.4 and they both have the ORDER BY in the code.

I actually went and asked on #postgresql-apt to just check that the 2.1.4 debian package was really build from 2.1.4, but haven't heard back yet.

It actually happened a couple times and in each case the cluster was completely dropped and recreated.

very strange!
Comment 8 Jan Wieck 2013-11-19 13:01:52 UTC
The place where that action sequence ID list is coming from is the sl_setsync.ssy_action_list. That datum is created by copy_set() as part of the SUBSCRIBE operation.

The only way I can imagine you ended up with a 2.1.4 running into that problem is that a SUBSCRIBE was done with a previous version, the sl_setsync entry created and then a successful upgrade to 2.1.4 was performed. That of course didn't fix the existing sl_setsync entry and the system is still stuck and cannot complete the initial SYNC after SUBSCRIBE.
Comment 9 Jeff Frost 2013-11-19 13:39:10 UTC
So, I just looked in sl_setsync for this new pair of servers and the ids are not in order:

'5864690123','5864690319','5864683875','5864690590','5864681889','5864660235','5864685143'

BUT, there is only about 400K of data in that column.

Also, these two servers have only ever had 2.1.4 on them prior to the subscription.
Comment 10 Jan Wieck 2013-11-19 14:31:20 UTC
Apparently a construct of 

    (select x, ... order by x)
    union
    (select x, ... order by x)

does NOT create a sorted list at all. It may have created something with one break in the sort order in some older PostgreSQL release, but right now I think we really need to backport the patch that went into 2.2.